Skip to content

AI Safety Frameworks Comparison

Comparison of existing AI safety frameworks and how they address trust decomposition, risk budgeting, coordinator safety, and verification.

FrameworkTrust DecompositionRisk BudgetingCoordinator SafetyVerification
CAIS (Drexler 2019)Services not agentsNoneUnderspecifiedHuman oversight
AI Control (Redwood 2024)Trusted/UntrustedQualitativeT monitors UEmpirical protocols
IDA (Christiano 2018)AmplificationNoneAmplified humanDistillation
Anthropic RSPASL tiersSafety buffersGovernance onlyCapability evals
Guaranteed Safe AIComponent contractsWorst-case onlyFormal specsMathematical proofs
This FrameworkHierarchical trustEuler allocationFormal constraintsRisk budgets + verification

CAIS - Comprehensive AI Services (Drexler 2019)

Section titled “CAIS - Comprehensive AI Services (Drexler 2019)”

Citation: Drexler, K.E. (2019). “Reframing Superintelligence: Comprehensive AI Services as General Intelligence.” FHI Technical Report #2019-1. FHI Oxford

AI-driven general intelligence need not take the form of unified, goal-directed agents. Instead: services with expanding capabilities, where agency is optional.

  1. Avoid unified agency - treat AGI as deviant branch
  2. Task-specific optimization - concrete goals, not open-ended utility
  3. Human oversight - services informed by models of human approval
  4. Recursive service development - self-improving under human guidance
  • Foundation models gap - didn’t anticipate shared base models spreading problems
  • Agency creep - even GPT-4 being set up to act autonomously
  • Coordination underspecified - how do services coordinate without becoming agents?
  • Competitive dynamics - no mechanism to prevent racing to unified agents
CAIS GapThis Framework’s Solution
No formal trust compositionDelegation risk framework with propagation rules
No quantitative risk allocationEuler allocation for risk budgets
Underspecified coordinationFormal coordinator constraints
Ignores foundation model realityAccounts for shared base models

Citation: Greenblatt, R. et al. (2023). “AI Control: Improving Safety Despite Intentional Subversion.” arXiv:2312.06942. Paper

Maintain safety even when the model is intentionally trying to subvert safety measures.

  • Untrusted Model (U): Powerful but potentially deceptive (e.g., GPT-4)
  • Trusted Model (T): Less capable but reliable (e.g., GPT-3.5)
  • Limited Trusted Labor: Human oversight with resource constraints
  1. Trusted Monitoring - T reviews U’s outputs for suspicious behavior
  2. Untrusted Monitoring - U monitors itself (requires collusion-busting)
  3. Defer to Trusted - when suspicious, defer to weaker but trusted model
  4. Factor(T,U) - decompose tasks, monitor each child task

Factor(T,U) December 2024: Safety improved 41% → 70% when adding factored cognition to trusted monitoring. Attacks on simpler child tasks are more obvious to monitors.

Scheming Detection (OpenAI/Apollo 2024):

  • o4-mini: 8.7% → 0.3% scheming (29× reduction with deliberative alignment)
  • o3: 13% → 0.4% scheming (33× reduction)
AI ControlThis Framework
Qualitative T/U separationQuantitative risk budgets
Single T-U pairMulti-level hierarchies
Technical protocols onlyEconomic incentive alignment
No compositional frameworkMathematical trust composition

Key integration: Coordinator = formalized “Trusted model T” with explicit capability bounds, risk budgets, and verification requirements.


IDA - Iterated Distillation and Amplification (Christiano 2018)

Section titled “IDA - Iterated Distillation and Amplification (Christiano 2018)”

Citation: Christiano, P. et al. (2018). “Supervising strong learners by amplifying weak experts.” arXiv:1810.08575. Paper

Amplification: Weak agent + more resources/time = better outputs (like MCTS in AlphaGo)

Distillation: Train faster model to predict amplified outputs

Iteration: Repeat with distilled model as new base

  1. Agent receives difficult task
  2. Decomposes into subtasks
  3. Calls copies of itself to solve subtasks
  4. Integrates results

This is precisely the coordinator architecture pattern.

Central question: “Can we solve difficult problems by composing small contributions from individual agents who don’t know the big picture?”

Challenge: Problems requiring extended learning to build intuitions (e.g., math textbook halfway through requires earlier problems).

IDAThis Framework
Training methodologyDeployment architecture
Alignment during trainingSafety during operation
No runtime risk quantificationContinuous risk budgets
TheoreticalPractical implementation

Complementary: Train with IDA, deploy in this architecture, use risk budgets to determine when model “trusted enough” for role.


Anthropic RSP - Responsible Scaling Policy

Section titled “Anthropic RSP - Responsible Scaling Policy”

Citation: Anthropic (2023-2025). “Responsible Scaling Policy.” Versions 1.0-2.2. Anthropic RSP

LevelDescriptionStatus
ASL-1No meaningful catastrophic riskHistorical LLMs
ASL-2Early dangerous capabilities (not yet useful)Pre-Claude Opus 4
ASL-3Substantially increased catastrophic misuse riskClaude Opus 4 (Dec 2024)
ASL-4+Not yet definedFuture
  • Yellow Line: Safety buffer before red line (originally 6× effective compute)
  • Red Line: Actual ASL threshold requiring new protections
  • Yellow line triggers comprehensive assessment and preparation

Evolution: Version 2.2 dropped specific 6× number - “science of evaluations is not currently mature enough”

  • 4× Effective Compute increase, OR
  • Six months of accumulated post-training enhancements
RSPThis Framework
Model-level ASL tiersComponent-level risk budgets
Qualitative thresholdsQuantitative probability budgets
Periodic assessmentsContinuous risk tracking
Governance onlyTechnical + economic mechanisms

Proposed Integration:

ASL-2 component: Risk budget = 10⁻⁴ per operation
ASL-3 component: Risk budget = 10⁻⁶ per operation
System composition: Euler allocation for total risk
Yellow line: 50% of budget consumed
Red line: 80% of budget consumed

Guaranteed Safe AI (Tegmark, Seshia et al. 2024)

Section titled “Guaranteed Safe AI (Tegmark, Seshia et al. 2024)”

Citation: Dalrymple, D. et al. (2024). “Towards Guaranteed Safe AI.” arXiv:2405.06624. Paper

“Mathematical proof is humanity’s most powerful tool for controlling AGIs. Regardless of how intelligent a system becomes, it cannot prove a mathematical falsehood.”

  1. World Model: Mathematical description of how AI affects world
  2. Safety Specification: Mathematical description of acceptable effects
  3. Verifier: Auditable proof certificate that AI satisfies specification
  • Machines provably impossible to login without correct credentials
  • DNA synthesizers provably unable to synthesize certain pathogens
  • AI hardware provably geofenced or with remote kill-switch
  • Requires complete formal specification (extremely difficult)
  • Assumptions in world model may not hold
  • Verification doesn’t scale easily to large neural networks
Guaranteed Safe AIThis Framework
Worst-case proofsProbabilistic budgets
Complete specification neededWorks with partial specs
Formal verificationRisk budgets + verification

Complementary:

  • Formally verify coordinator constraints (provable bounds)
  • Risk budget sub-agent behaviors (practical for complex tasks)
  • Hybrid: proofs for critical paths, budgets for rest

Citation: OpenAI (2023-2025). “Preparedness Framework.” OpenAI

Two thresholds:

  • High capability: Could amplify existing pathways to severe harm
  • Critical capability: Could introduce unprecedented new pathways

Similar governance (SAG review, board veto) but less granular than RSP.

Citation: Irving, G. et al. (2018). “AI Safety via Debate.” arXiv:1805.00899. Paper

Two AI systems argue opposite sides; human judge evaluates.

Core assumption: “Harder to lie convincingly than to refute a lie”

2024 results: First empirical evidence debate helps judges recognize truth (Anthropic Feb, DeepMind July).

Citation: Bai, Y. et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073. Anthropic

Self-improvement without human labels of harmful outputs. AI evaluates own outputs against constitutional principles.

Key result: More helpful AND more harmless than standard RLHF.

Citation: Burns, C. et al. (2023). “Weak-to-Strong Generalization.” arXiv:2312.09390. OpenAI

Can weak model supervision elicit full capabilities of stronger model?

Finding: GPT-2-level model can elicit most of GPT-4’s capabilities (close to GPT-3.5-level), but not full recovery.

Implication: Weak coordinator supervising strong sub-agent = weak-to-strong problem. Risk budgets must account for supervision limitations.

Citation: Cooperative AI Foundation (2024). “Multi-Agent Risks from Advanced AI.” Report

Three failure modes:

  1. Miscoordination: Failure to cooperate despite shared goals
  2. Conflict: Failure due to differing goals
  3. Collusion: Undesirable cooperation

Coordinator architecture addresses all three via shared state, risk budgets, and monitoring.


  • Drexler (2019) - CAIS services architecture
  • Greenblatt et al. (2023) - AI Control T/U decomposition
  • Christiano et al. (2018) - IDA amplification
  • Anthropic RSP (2023-2025) - Safety buffers
  • OpenAI Preparedness (2023-2025) - Capability thresholds
  • This framework - Euler allocation from finance
  • AI Control protocols - Trusted monitoring
  • Factor(T,U) (2024) - Factored cognition for monitoring
  • This framework - Formal coordinator constraints
  • Dalrymple et al. (2024) - Guaranteed Safe AI
  • Urban & Miné (2021) - Formal methods for ML, arXiv:2104.02466
  • OpenAI/Apollo (2024) - Scheming detection
  • Leike et al. (2018) - Recursive reward modeling, arXiv:1811.07871
  • Irving et al. (2018) - AI Safety via Debate
  • Burns et al. (2023) - Weak-to-strong generalization

  1. Quantitative Risk Composition: First framework to mathematically compose risks across hierarchical AI systems using Euler allocation

  2. Coordinator Formalization: Explicit formal treatment of the “trusted overseer” role (informal in other frameworks)

  3. Economic Alignment: Mechanism design for AI systems (missing from technical safety work)

  4. Practical Integration: Bridges theoretical safety (IDA, Debate) and deployment reality (RSP, Preparedness)

  5. Heterogeneous Systems: Safe composition of components at different trust/capability levels

  6. Real-Time Risk Tracking: Continuous budget monitoring vs. periodic evaluations

  7. Foundation Model Reality: Accounts for shared base models and fine-tuning