Skip to content

Relationship to Existing Approaches

CAIS proposed: Build superintelligence as ecosystem of narrow services, avoid monolithic agents

Challenge: The foundation model era changed the landscape—“train from scratch for each task” is less practical now. Also, pure tool AI faces economic pressure to add agency (Gwern, 2016).

This proposal: Accept agency, but try to structure it through modularity. Agentic coordinators over narrow components, not passive services waiting for calls.

AI Control: Assume models might be scheming, build protocols that maintain safety regardless

This proposal: Architectural foundation that makes scheming harder + forecasting-based navigation

The combination (obviously you’d do both):

  • Use control’s trusted monitoring on all LLM components
  • Smaller, trusted models check outputs of larger untrusted components
  • Modular architecture makes monitoring easier (each component has narrow output format)
  • Fine-tuned narrow models are easier to trust than frontier models
  • Inject fake vulnerabilities to detect scheming
  • Examples: Fake high-privilege components that coordinators shouldn’t invoke
  • Fake data that tempts misbehavior
  • Decomposed coordination makes honeypots more effective (each layer can be tested)
  • Red-team adversarial testing against the entire modular system
  • Try to make coordinators scheme despite architectural constraints
  • Test if decomposition actually prevents scheming or just obscures it
  • This is how you validate the architecture works
  • If component output looks suspicious, resample with different component
  • Modular architecture enables this: can swap components easily
  • Redundant coordinators mean you can defer suspicious coordination to backup
  • Fine-tuned models give diverse implementations to sample from
  • Control requires immutable logging of all decisions
  • Our architecture already includes this for coordinators
  • Extend to all LLM components
  • Decomposition makes logs more analyzable (narrow scope per component)
  • Control limits how fast systems can take actions
  • Apply to coordinators: can’t make rapid sequence of strategic decisions
  • Apply to components: max invocations per time window
  • Prevents fast takeoff scenarios
flowchart TB
    subgraph "Defense in Depth"
        Arch[Modular Architecture<br/>May reduce scheming] --> Control[AI Control<br/>Catches if happens]
        Control --> Budget[Limited Budgets<br/>Bounds damage]
        Budget --> Forecast[Forecasting<br/>Predicts problems]
        Forecast --> Human[Human Review<br/>Final check]
    end

Modular architecture alone: Might limit scheming opportunities, but not proven AI Control alone: Handles scheming but doesn’t reduce its probability Combined: Architecture may reduce scheming probability + Control catches it if it happens

Concrete example - Research Assistant:

  • Architecture: Decomposed coordination, fine-tuned narrow models, limited budgets
  • Control layer 1: Trusted monitor checks all hypothesis generation outputs
  • Control layer 2: Honeypot includes fake “high-impact but dangerous” research direction
  • Control layer 3: Strategy Module outputs audited for consistency with stated goals
  • Control layer 4: If Tactical Coordination acts suspiciously, defer to backup coordinator
  • Control layer 5: All decisions logged immutably for post-hoc analysis

This is defense in depth:

  • If architecture fails to prevent scheming → Control catches it
  • If Control misses scheming instance → Limited budgets bound damage
  • If both fail → Forecasting should predict bad outcomes → Human review
  • If all fail → Worst case limited by component scope and privilege

Key insight: Modular architecture makes Control easier and more effective

  • Narrow components easier to monitor than monolithic models
  • Decomposed coordination has more checkpoints for honeypots
  • Audit trails more analyzable with structured component interactions
  • Fine-tuned models more predictable for control evaluations

IDA (Iterated Distillation and Amplification): Recursively amplify human judgment—human + AI assistants decompose hard problems, distill back into faster model, iterate

This proposal: Use forecasting-based navigation with AI self-evaluation, not human amplification

Key difference: IDA assumes human judgment is the ground truth to amplify. This assumes forecasting + evaluation by AI itself is sufficient if you prevent scheming.

Why different: IDA requires enormous human effort in the loop. Forecasting-based navigation offloads evaluation to AI, reserving human oversight for high-level decisions and verification that forecasting remains calibrated.

Capability limitation: Boxing, restricted action spaces, oracle AI—prevent dangerous capabilities entirely

This proposal: Allow full capabilities (including agency), but allocate and structure them carefully

Key difference: You can build transformatively useful AI with this approach. Pure limitation fails Gwern’s economic test—limited AI gets outcompeted.

Three reasons for cautious optimism:

1. Decomposition has precedent Software engineering, security engineering, and system design all rely on modularity for safety. The principle isn’t new—the application to AI is.

2. Agency is inevitable anyway If we’re going to have agentic AI (and economic pressures suggest we will), better to structure it safely than fight the tide.

3. We only need this for ~5 years Not claiming this solves superintelligence alignment. Claiming it might keep human-level AI safe long enough for that AI to help solve harder problems.

4. Defense in depth Not relying on single mechanism—combining 10+ principles, multiple safety layers, continuous monitoring, diverse redundancy. Many points of failure before catastrophe.

5. Builds on established techniques Borrowing from security engineering (least privilege, defense in depth), software engineering (modularity, testing), distributed systems (Byzantine fault tolerance), formal methods (verification). Not inventing from scratch—though the AI-specific application is novel.


ApproachCore IdeaStrengthsLimitationsCompatibility
This FrameworkDecomposed components with delegation risk budgetsQuantified risk, works with agency, adapts established cross-domain methodsImplementation overhead, uncertain AI-specific evidence, untested at scale
CAISNarrow AI services, no agentsAvoids agency risk entirelyEconomically uncompetitive, obsolete with foundation modelsComplementary (we accept agency)
AI ControlAssume scheming, design for safety anywayAdversarial mindset, practical protocolsDoesn’t reduce scheming probabilityHighly complementary
IDAAmplify human judgment recursivelyHuman values preservedEnormous human effort requiredPartially compatible
Constitutional AITrain model on principlesScalable, built into weightsDoesn’t guarantee constraint followingUse for base models in components
RLHFOptimize for human preferenceEffective for helpfulnessDoesn’t solve deception, reward hackingUse for component fine-tuning
Boxing/ContainmentRestrict AI capabilitiesSimple conceptuallyEconomically unviable, breaks usefulnessApply within components
InterpretabilityUnderstand model internalsCould provide ground truthFar from deployment-readyWould enhance verification
SituationRecommended Approaches
Building production AI systems nowThis framework + AI Control + RLHF
Research on alignment foundationsInterpretability + Constitutional AI
Extremely high-stakes deploymentAll of the above + human oversight
Internal tools, low riskLight application of this framework