AI Safety Frameworks Comparison
Comparison of existing AI safety frameworks and how they address trust decomposition, risk budgeting, coordinator safety, and verification.
Framework Comparison Matrix
Section titled “Framework Comparison Matrix”| Framework | Trust Decomposition | Risk Budgeting | Coordinator Safety | Verification |
|---|---|---|---|---|
| CAIS (Drexler 2019) | Services not agents | None | Underspecified | Human oversight |
| AI Control (Redwood 2024) | Trusted/Untrusted | Qualitative | T monitors U | Empirical protocols |
| IDA (Christiano 2018) | Amplification | None | Amplified human | Distillation |
| Anthropic RSP | ASL tiers | Safety buffers | Governance only | Capability evals |
| Guaranteed Safe AI | Component contracts | Worst-case only | Formal specs | Mathematical proofs |
| This Framework | Hierarchical trust | Euler allocation | Formal constraints | Risk budgets + verification |
CAIS - Comprehensive AI Services (Drexler 2019)
Section titled “CAIS - Comprehensive AI Services (Drexler 2019)”Citation: Drexler, K.E. (2019). “Reframing Superintelligence: Comprehensive AI Services as General Intelligence.” FHI Technical Report #2019-1. FHI Oxford
Core Thesis
Section titled “Core Thesis”AI-driven general intelligence need not take the form of unified, goal-directed agents. Instead: services with expanding capabilities, where agency is optional.
Key Safety Arguments
Section titled “Key Safety Arguments”- Avoid unified agency - treat AGI as deviant branch
- Task-specific optimization - concrete goals, not open-ended utility
- Human oversight - services informed by models of human approval
- Recursive service development - self-improving under human guidance
Critiques
Section titled “Critiques”- Foundation models gap - didn’t anticipate shared base models spreading problems
- Agency creep - even GPT-4 being set up to act autonomously
- Coordination underspecified - how do services coordinate without becoming agents?
- Competitive dynamics - no mechanism to prevent racing to unified agents
How This Framework Extends CAIS
Section titled “How This Framework Extends CAIS”| CAIS Gap | This Framework’s Solution |
|---|---|
| No formal trust composition | Delegation risk framework with propagation rules |
| No quantitative risk allocation | Euler allocation for risk budgets |
| Underspecified coordination | Formal coordinator constraints |
| Ignores foundation model reality | Accounts for shared base models |
AI Control (Redwood Research 2024)
Section titled “AI Control (Redwood Research 2024)”Citation: Greenblatt, R. et al. (2023). “AI Control: Improving Safety Despite Intentional Subversion.” arXiv:2312.06942. Paper
Core Insight
Section titled “Core Insight”Maintain safety even when the model is intentionally trying to subvert safety measures.
Architecture
Section titled “Architecture”- Untrusted Model (U): Powerful but potentially deceptive (e.g., GPT-4)
- Trusted Model (T): Less capable but reliable (e.g., GPT-3.5)
- Limited Trusted Labor: Human oversight with resource constraints
Key Protocols
Section titled “Key Protocols”- Trusted Monitoring - T reviews U’s outputs for suspicious behavior
- Untrusted Monitoring - U monitors itself (requires collusion-busting)
- Defer to Trusted - when suspicious, defer to weaker but trusted model
- Factor(T,U) - decompose tasks, monitor each child task
Empirical Results
Section titled “Empirical Results”Factor(T,U) December 2024: Safety improved 41% → 70% when adding factored cognition to trusted monitoring. Attacks on simpler child tasks are more obvious to monitors.
Scheming Detection (OpenAI/Apollo 2024):
- o4-mini: 8.7% → 0.3% scheming (29× reduction with deliberative alignment)
- o3: 13% → 0.4% scheming (33× reduction)
How This Framework Extends AI Control
Section titled “How This Framework Extends AI Control”| AI Control | This Framework |
|---|---|
| Qualitative T/U separation | Quantitative risk budgets |
| Single T-U pair | Multi-level hierarchies |
| Technical protocols only | Economic incentive alignment |
| No compositional framework | Mathematical trust composition |
Key integration: Coordinator = formalized “Trusted model T” with explicit capability bounds, risk budgets, and verification requirements.
IDA - Iterated Distillation and Amplification (Christiano 2018)
Section titled “IDA - Iterated Distillation and Amplification (Christiano 2018)”Citation: Christiano, P. et al. (2018). “Supervising strong learners by amplifying weak experts.” arXiv:1810.08575. Paper
The Core Loop
Section titled “The Core Loop”Amplification: Weak agent + more resources/time = better outputs (like MCTS in AlphaGo)
Distillation: Train faster model to predict amplified outputs
Iteration: Repeat with distilled model as new base
Connection to Hierarchical Delegation
Section titled “Connection to Hierarchical Delegation”- Agent receives difficult task
- Decomposes into subtasks
- Calls copies of itself to solve subtasks
- Integrates results
This is precisely the coordinator architecture pattern.
Factored Cognition
Section titled “Factored Cognition”Central question: “Can we solve difficult problems by composing small contributions from individual agents who don’t know the big picture?”
Challenge: Problems requiring extended learning to build intuitions (e.g., math textbook halfway through requires earlier problems).
How This Framework Relates
Section titled “How This Framework Relates”| IDA | This Framework |
|---|---|
| Training methodology | Deployment architecture |
| Alignment during training | Safety during operation |
| No runtime risk quantification | Continuous risk budgets |
| Theoretical | Practical implementation |
Complementary: Train with IDA, deploy in this architecture, use risk budgets to determine when model “trusted enough” for role.
Anthropic RSP - Responsible Scaling Policy
Section titled “Anthropic RSP - Responsible Scaling Policy”Citation: Anthropic (2023-2025). “Responsible Scaling Policy.” Versions 1.0-2.2. Anthropic RSP
AI Safety Levels (ASL)
Section titled “AI Safety Levels (ASL)”| Level | Description | Status |
|---|---|---|
| ASL-1 | No meaningful catastrophic risk | Historical LLMs |
| ASL-2 | Early dangerous capabilities (not yet useful) | Pre-Claude Opus 4 |
| ASL-3 | Substantially increased catastrophic misuse risk | Claude Opus 4 (Dec 2024) |
| ASL-4+ | Not yet defined | Future |
Safety Buffer Concept
Section titled “Safety Buffer Concept”- Yellow Line: Safety buffer before red line (originally 6× effective compute)
- Red Line: Actual ASL threshold requiring new protections
- Yellow line triggers comprehensive assessment and preparation
Evolution: Version 2.2 dropped specific 6× number - “science of evaluations is not currently mature enough”
Evaluation Triggers
Section titled “Evaluation Triggers”- 4× Effective Compute increase, OR
- Six months of accumulated post-training enhancements
How This Framework Extends RSP
Section titled “How This Framework Extends RSP”| RSP | This Framework |
|---|---|
| Model-level ASL tiers | Component-level risk budgets |
| Qualitative thresholds | Quantitative probability budgets |
| Periodic assessments | Continuous risk tracking |
| Governance only | Technical + economic mechanisms |
Proposed Integration:
ASL-2 component: Risk budget = 10⁻⁴ per operationASL-3 component: Risk budget = 10⁻⁶ per operationSystem composition: Euler allocation for total riskYellow line: 50% of budget consumedRed line: 80% of budget consumedGuaranteed Safe AI (Tegmark, Seshia et al. 2024)
Section titled “Guaranteed Safe AI (Tegmark, Seshia et al. 2024)”Citation: Dalrymple, D. et al. (2024). “Towards Guaranteed Safe AI.” arXiv:2405.06624. Paper
Core Philosophy
Section titled “Core Philosophy”“Mathematical proof is humanity’s most powerful tool for controlling AGIs. Regardless of how intelligent a system becomes, it cannot prove a mathematical falsehood.”
Three Components
Section titled “Three Components”- World Model: Mathematical description of how AI affects world
- Safety Specification: Mathematical description of acceptable effects
- Verifier: Auditable proof certificate that AI satisfies specification
Practical Examples
Section titled “Practical Examples”- Machines provably impossible to login without correct credentials
- DNA synthesizers provably unable to synthesize certain pathogens
- AI hardware provably geofenced or with remote kill-switch
Limitations
Section titled “Limitations”- Requires complete formal specification (extremely difficult)
- Assumptions in world model may not hold
- Verification doesn’t scale easily to large neural networks
How This Framework Relates
Section titled “How This Framework Relates”| Guaranteed Safe AI | This Framework |
|---|---|
| Worst-case proofs | Probabilistic budgets |
| Complete specification needed | Works with partial specs |
| Formal verification | Risk budgets + verification |
Complementary:
- Formally verify coordinator constraints (provable bounds)
- Risk budget sub-agent behaviors (practical for complex tasks)
- Hybrid: proofs for critical paths, budgets for rest
Other Frameworks
Section titled “Other Frameworks”OpenAI Preparedness Framework
Section titled “OpenAI Preparedness Framework”Citation: OpenAI (2023-2025). “Preparedness Framework.” OpenAI
Two thresholds:
- High capability: Could amplify existing pathways to severe harm
- Critical capability: Could introduce unprecedented new pathways
Similar governance (SAG review, board veto) but less granular than RSP.
DeepMind Debate
Section titled “DeepMind Debate”Citation: Irving, G. et al. (2018). “AI Safety via Debate.” arXiv:1805.00899. Paper
Two AI systems argue opposite sides; human judge evaluates.
Core assumption: “Harder to lie convincingly than to refute a lie”
2024 results: First empirical evidence debate helps judges recognize truth (Anthropic Feb, DeepMind July).
Constitutional AI
Section titled “Constitutional AI”Citation: Bai, Y. et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073. Anthropic
Self-improvement without human labels of harmful outputs. AI evaluates own outputs against constitutional principles.
Key result: More helpful AND more harmless than standard RLHF.
Weak-to-Strong Generalization
Section titled “Weak-to-Strong Generalization”Citation: Burns, C. et al. (2023). “Weak-to-Strong Generalization.” arXiv:2312.09390. OpenAI
Can weak model supervision elicit full capabilities of stronger model?
Finding: GPT-2-level model can elicit most of GPT-4’s capabilities (close to GPT-3.5-level), but not full recovery.
Implication: Weak coordinator supervising strong sub-agent = weak-to-strong problem. Risk budgets must account for supervision limitations.
Multi-Agent Safety
Section titled “Multi-Agent Safety”Citation: Cooperative AI Foundation (2024). “Multi-Agent Risks from Advanced AI.” Report
Three failure modes:
- Miscoordination: Failure to cooperate despite shared goals
- Conflict: Failure due to differing goals
- Collusion: Undesirable cooperation
Coordinator architecture addresses all three via shared state, risk budgets, and monitoring.
Key Citations by Topic
Section titled “Key Citations by Topic”Trust Decomposition
Section titled “Trust Decomposition”- Drexler (2019) - CAIS services architecture
- Greenblatt et al. (2023) - AI Control T/U decomposition
- Christiano et al. (2018) - IDA amplification
Risk Budgeting
Section titled “Risk Budgeting”- Anthropic RSP (2023-2025) - Safety buffers
- OpenAI Preparedness (2023-2025) - Capability thresholds
- This framework - Euler allocation from finance
Coordinator Safety
Section titled “Coordinator Safety”- AI Control protocols - Trusted monitoring
- Factor(T,U) (2024) - Factored cognition for monitoring
- This framework - Formal coordinator constraints
Verification
Section titled “Verification”- Dalrymple et al. (2024) - Guaranteed Safe AI
- Urban & Miné (2021) - Formal methods for ML, arXiv:2104.02466
- OpenAI/Apollo (2024) - Scheming detection
Scalable Oversight
Section titled “Scalable Oversight”- Leike et al. (2018) - Recursive reward modeling, arXiv:1811.07871
- Irving et al. (2018) - AI Safety via Debate
- Burns et al. (2023) - Weak-to-strong generalization
Unique Contributions of This Framework
Section titled “Unique Contributions of This Framework”-
Quantitative Risk Composition: First framework to mathematically compose risks across hierarchical AI systems using Euler allocation
-
Coordinator Formalization: Explicit formal treatment of the “trusted overseer” role (informal in other frameworks)
-
Economic Alignment: Mechanism design for AI systems (missing from technical safety work)
-
Practical Integration: Bridges theoretical safety (IDA, Debate) and deployment reality (RSP, Preparedness)
-
Heterogeneous Systems: Safe composition of components at different trust/capability levels
-
Real-Time Risk Tracking: Continuous budget monitoring vs. periodic evaluations
-
Foundation Model Reality: Accounts for shared base models and fine-tuning