Skip to content

Potential Research Projects

Research projects to validate, extend, and implement the AI safety framework. Projects are organized into eleven categories:

  • Empirical Validation (3 projects) — Testing core assumptions
  • Cross-Domain Analysis (4 projects) — Learning from nuclear, finance, carbon markets
  • Formal Methods (3 projects) — Mathematical foundations
  • Mechanism Design (3 projects) — Incentive-compatible safety reporting
  • Architecture (3 projects) — Implementation feasibility
  • Game Theory (3 projects) — Trust dynamics and Byzantine tolerance
  • Human-AI (2 projects) — Trust calibration and handoffs
  • Standards & Policy (3 projects) — Certification and regulation
  • Tools (3 projects) — Simulation, monitoring, language support
  • Meta-Research (3 projects) — Historical failures, uncertainty, emergence
  • Agency & Power (10 projects) — Formalizing capability, agency measurement, strong tools hypothesis

Does decomposition actually reduce scheming behavior in AI systems?

  • Survey existing work on factored cognition and monitoring (Factor(T,U) paper, etc.)
  • Find/analyze empirical studies comparing monolithic vs modular AI behavior
  • Examine AI Control literature for relevant experiments
  • Key question: What’s the evidence that limited context prevents strategic deception?

2. Forecasting Calibration for Self-Prediction

Section titled “2. Forecasting Calibration for Self-Prediction”

Can AI systems accurately forecast their own behavior after modifications?

  • Benchmark “predict your behavior N steps ahead” with various modifications
  • Characterize how calibration degrades with horizon length
  • Compare to superforecaster baselines on similar prediction tasks
  • Test if forecasting quality scales with capability

Does deterministic output (temperature=0) prevent covert communication?

  • Run same LLM at temperature=0 vs 1.0 with repeated queries
  • Measure output variance and potential information encoding
  • Test if high temperature enables steganographic signaling between components
  • Quantify “surprise budget” violations

Deep dive into operational PRA and SIL frameworks:

  • How do fault trees decompose 10⁻⁹ failure targets to components?
  • Limitations discovered in practice (Three Mile Island, Fukushima, Challenger)
  • How do independence assumptions hold up under stress?
  • Can IEC 61508’s ASIL decomposition rules (ASIL D = ASIL B + B) adapt to AI?

Euler allocation and portfolio risk decomposition:

  • How do banks cascade risk budgets from board to trading desks?
  • Component VaR (CoVaR) computation and full allocation property
  • Lessons from LTCM collapse, 2008 crisis, March 2020 risk parity failures
  • Applicability of Expected Shortfall vs VaR to AI harm measures

Climate policy as large-scale risk budgeting:

  • Seven effort-sharing approaches (equal per capita, historical responsibility, capability-based, etc.)
  • EU ETS implementation: linear reduction factors, benchmarks, carbon leakage protection
  • MRV (Monitoring, Reporting, Verification) systems
  • Phase 1 failure analysis: over-allocation from poor baseline data

How uncertainty factors stack in drug development:

  • Therapeutic Index vs Margin of Safety calculations
  • FDA uncertainty factors: 10× animal-to-human, 10× human variability, etc.
  • Combined factors of 100-1000× as engineering judgment
  • Applicability to AI where failure modes are poorly characterized

Mathematical foundations for AI risk budgeting:

  • Which risk measures are homogeneous (enabling Euler allocation)?
  • Survey Markov categories for compositional probability
  • When do safety properties compose vs not compose?
  • Conditions for additive, multiplicative, sub-additive composition

9. Linear Logic for Delegation Risk Budgets

Section titled “9. Linear Logic for Delegation Risk Budgets”

Type systems that enforce resource constraints:

  • Bounded Linear Types (Ghica and Smith 2014) for risk tracking
  • Light Linear Logic (LLL) for polynomial-time guarantees
  • Object capability models for authority tracking
  • Practical implementation in AI system design

What can and can’t be formally verified in AI systems?

  • Current state of neural network verification
  • Compositional verification via assume-guarantee reasoning
  • Gap between verified components and emergent system behavior
  • Guaranteed Safe AI (Seshia, Tegmark) Level 1-9 framework analysis

Mechanism design for honest risk disclosure:

  • VCG mechanisms applied to AI subsystems reporting risk
  • Revelation principle implications for trust reporting
  • Shapley values for risk attribution across components
  • What makes safety buffers (Anthropic’s 6× > 4×) work?

Economic structures for AI safety:

  • Insurance pricing for AI trust violations
  • Moral hazard and adverse selection in AI deployment
  • Experience rating and deductible structures
  • Subrogation and recovery from violating agents

When trust is mispriced:

  • Market efficiency for AI trustworthiness
  • Information asymmetry between developers and deployers
  • Adverse selection in AI agent markets
  • Trust derivatives (futures, options, swaps) - feasibility analysis

Implementation-focused investigation:

  • Decomposed research assistant design
  • Cost/performance: 10+ fine-tuned narrow models vs one frontier model
  • Existing multi-agent frameworks and trust/coordination properties
  • Hardware requirements for Byzantine-tolerant redundant coordinators

Testing diverse coordinator implementations:

  • Build 3 coordinators: rule-based, fine-tuned 7B, older frontier (GPT-3.5)
  • Compare agreement rates and failure modes
  • Red-team attack attempts on each architecture
  • Quantify safety improvement from redundancy

Task-specific models for safety:

  • Train narrow models for summarization, forecasting, verification, etc.
  • Compare behavior predictability to frontier models
  • Audit narrow model behavior vs general model
  • Cost-benefit analysis of fine-tuning pipeline

17. Trust Dynamics Under Adversarial Pressure

Section titled “17. Trust Dynamics Under Adversarial Pressure”

Strategic trust building and exploitation:

  • Equilibrium behaviors when agents can build then exploit trust
  • Deceptive equilibrium analysis: when does “behave well then defect” pay?
  • Counter-strategy effectiveness (unpredictable monitoring, delayed trust, trust decay)
  • Multi-agent trust learning and reputation systems

Fault tolerance in AI coordination:

  • How many coordinators needed for Byzantine tolerance?
  • 3f+1 rule application to AI systems
  • Cryptographic commitment schemes for coordination decisions
  • Empirical testing of voting-based coordination

Mathematical properties of trust flow:

  • Compare multiplicative, minimum, harmonic, PageRank propagation
  • Multi-path trust combination (max, independence, Dempster-Shafer)
  • Trust conductance (electrical analogy) for complex topologies
  • Adversarial risk inheritance with Byzantine components

How humans (mis)calibrate trust in AI:

  • Over-trust and under-trust patterns
  • Calibration interventions (accuracy feedback, confidence display, explanation)
  • Trust contagion in human-AI teams
  • Trust repair after AI violations

Human-AI task transitions:

  • Specification completeness at human→AI handoffs
  • Context sufficiency at AI→human handoffs
  • Verification criteria clarity
  • Failure modes at handoff points

Standards development:

  • Delegation Risk estimation methodologies
  • Verification coverage requirements
  • Trust concentration limits
  • Third-party certification processes

Policy implications:

  • Maximum allowed Delegation Risk for application categories
  • Required verification for high-stakes tasks
  • Disclosure requirements for trust relationships
  • Liability frameworks for trust violations

24. Comparison to Existing AI Safety Frameworks

Section titled “24. Comparison to Existing AI Safety Frameworks”

Situating this work:

  • Relationship to CAIS (Drexler 2019)
  • Integration with AI Control (Redwood Research 2024)
  • Comparison to IDA (Christiano 2018)
  • Anthropic RSP, OpenAI preparedness framework analysis

Software for trust dynamics:

  • Simulate trust evolution before deployment
  • Stress testing (adversarial fraction, coordinated attacks, cascade failures)
  • Trust regression testing after architectural changes
  • Red team trust testing automation

Operational tooling:

  • Real-time Delegation Risk tracking
  • Trust concentration heat maps
  • Verification coverage metrics
  • Alert systems for trust violations

Language-level support:

  • Trust types with provenance tracking
  • Compiler-checked trust requirements
  • Trust context blocks with capability limits
  • Integration with existing type systems

Learning from past risk management failures:

  • Finance: LTCM, 2008 crisis, correlation breakdowns
  • Nuclear: TMI, Fukushima, Challenger organizational failures
  • Carbon: EU ETS Phase 1 over-allocation
  • Pattern extraction for AI safety

Distributions over trust levels:

  • Beta distributions for trust estimates
  • Uncertainty propagation through trust calculations
  • Confidence intervals for Delegation Risk
  • Robust optimization under trust uncertainty

The unknown unknowns problem:

  • How to bound trust when capabilities are unknown?
  • Detecting emergence before exploitation
  • Capability elicitation and red-teaming
  • Dynamic trust adjustment as capabilities are discovered

Agency, Power, and Capability Formalization

Section titled “Agency, Power, and Capability Formalization”

Research questions related to Agent, Power, and Authority Formalization.

Can we reliably measure agency in AI systems?

  • Behavioral tests for goal-directedness (perturbation studies, goal persistence)
  • Temporal consistency probes (does apparent utility function drift?)
  • Cross-context generalization (same goals in different environments?)
  • Comparison of agency metrics to capability benchmarks

Is high capability achievable with low agency?

  • Analyze existing high-power, low-agency systems (search engines, compilers, databases)
  • What architectural patterns keep agency low?
  • Theoretical limits: As capability increases, does agency necessarily increase?
  • Empirical studies: Track agency measures across model scales

When and how does AI system power grow over time?

  • Measure power growth rate (λ) for deployed systems
  • Identify factors that increase λ (resource access, influence, information accumulation)
  • Architectural constraints that bound λ to zero
  • Early warning indicators for power accumulation

What’s the shape of the safety-capability tradeoff?

  • Map (Agency, Power, Risk) tuples for diverse systems
  • Is there a Pareto frontier? What’s its shape?
  • Can alignment research push the frontier outward?
  • Compare frontier across architectural approaches

How does agency combine across components?

  • Does decomposition reduce total system agency?
  • Conditions for agency “emergence” in multi-agent systems
  • Voting/consensus effects on aggregate agency
  • Can low-agency components form high-agency collectives?

Design systems that maximize Risk-Adjusted Capability:

  • Sensitivity analysis: Which design choices most affect RACAP?
  • Optimal agency level for given tasks
  • Power allocation across components
  • Benchmark architectures by RACAP

Operationalize power-seeking in deployed systems:

  • Behavioral signatures of instrumental convergence
  • Resource acquisition monitoring
  • Influence expansion detection
  • Shutdown avoidance indicators

When does power exceed granted authority?

  • Methods for detecting authority violations
  • Dynamic authority adjustment protocols
  • Enforcement mechanisms when power exceeds authority
  • Case studies: Historical authority-power gaps (institutions, organizations)

Mathematical limits on achievable power:

  • Given resource constraints, what’s maximum power?
  • Power bounds from capability constraints
  • Information-theoretic limits on influence
  • Compositional power bounds for decomposed systems

How does agency develop during AI training?

  • Track agency metrics across training runs
  • Phase transitions in goal-directedness
  • RLHF effects on agency
  • Can training be modified to suppress agency while maintaining capability?