Skip to content

Risk Decomposition: Accidents vs. Defection

Risk Decomposition: Accidents vs. Defection

Section titled “Risk Decomposition: Accidents vs. Defection”

Every harm mode in a delegation exposure falls into one of two categories:

flowchart TB
    DE[Delegation Exposure]

    DE --> AE[Accident Exposure<br/>non-goal-directed]
    DE --> DF[Defection Exposure<br/>goal-directed]

    AE --> A1[Bugs and errors]
    AE --> A2[Hallucinations]
    AE --> A3[Capability limitations]
    AE --> A4[Edge cases]
    AE --> A5[System failures]

    DF --> D1[Scheming / deception]
    DF --> D2[Pursuing misaligned objectives]
    DF --> D3[Collusion between components]
    DF --> D4[Resource acquisition]
    DF --> D5[Goal preservation behaviors]

    style AE fill:#e6f3ff
    style DF fill:#ffe6e6

Accident Exposure: The system isn’t trying to cause harm—it’s failing to achieve the intended goal. A calculator that returns wrong answers, an LLM that hallucinates citations, a robot that drops objects.

Defection Exposure: The system is effectively optimizing for something other than the principal’s interests. This includes both explicit deception and emergent misalignment where the system’s learned objective diverges from intent.

graph LR
    subgraph "As Capability Increases"
        A[Accident Risk] -->|decreases| A2[Smarter = fewer errors]
        D[Defection Risk] -->|may increase| D2[Smarter = better at pursuing wrong goals]
    end
Risk TypeCapability ScalingWhy
Accident RiskGenerally ↓More capable systems make fewer errors, handle more edge cases
Defection RiskMay ↑More capable systems are better at strategic deception, can pursue misaligned goals more effectively

This is the core tension in AI development: the same capabilities that reduce accidents may increase defection risk.

Risk TypePrimary Mitigations
Accident RiskTesting, verification, redundancy, capability improvements, error handling, monitoring for anomalies
Defection RiskAlignment techniques, architectural constraints, capability limitations, monitoring for deception, human oversight, interpretability

Some mitigations help both (monitoring, redundancy). Others are specific:

  • Capability improvements reduce accident risk but may increase defection risk
  • Capability limitations reduce defection risk but may increase accident risk
  • Alignment work primarily targets defection risk
Risk TypeHow to Estimate Probability
Accident RiskHistorical data, testing, formal verification, fuzzing, code review
Defection RiskRed-teaming, interpretability research, behavioral analysis, honeypots, game-theoretic analysis

Accidents leave traces in logs and tests. Defection may be actively hidden.

4. Different Components Have Different Ratios

Section titled “4. Different Components Have Different Ratios”
Component TypeTypical Risk RatioWhy
Deterministic code~100% AccidentNo goal-directedness to misalign
Narrow ML models~95% AccidentLimited optimization, specific task
General LLMs~80% Accident*Some goal-directedness, but mostly errors
Agentic systems~60% Accident*Significant goal-directed behavior
RL-trained agents~40% Accident*Strong optimization, emergent goals
CoordinatorsDefection-dominatedStrategic position, leverage over system

*Estimates highly uncertain and depend on specific system and deployment.

Worked Example: Decomposing a Research Assistant

Section titled “Worked Example: Decomposing a Research Assistant”

Consider the research assistant from Delegation Risk Overview:

Harm ModeTypeProbabilityDamageRisk
Misrepresents findings (misunderstanding)Accident0.018$5,000$90
Misrepresents findings (strategic)Defection0.002$5,000$10
Leaks data (error)Accident0.0008$50,000$40
Leaks data (intentional)Defection0.0002$50,000$10

Summarizer Risk Breakdown:

  • Accident Risk: $130 (87%)
  • Defection Risk: $20 (13%)
  • Total: $150
Harm ModeTypeProbabilityDamageRisk
Deploys buggy codeAccident0.049$20,000$980
Deploys subtly malicious codeDefection0.001$20,000$20
Catastrophic breach (error)Accident0.00005$1,000,000$50
Catastrophic breach (intentional)Defection0.00005$1,000,000$50

Code Deployer Risk Breakdown:

  • Accident Risk: $1,030 (94%)
  • Defection Risk: $70 (6%)
  • Total: $1,100
Harm ModeTypeProbabilityDamageRisk
Misroutes task (error)Accident0.008$2,000$16
Misroutes task (strategic)Defection0.002$2,000$4
Grants excessive permissions (error)Accident0.001$100,000$100
Grants excessive permissions (strategic)Defection0.004$100,000$400

Coordinator Risk Breakdown:

  • Accident Risk: $116 (22%)
  • Defection Risk: $404 (78%)
  • Total: $520
ComponentTotal RiskAccidentDefectionDefection %
Summarizer$150$130$2013%
Code Deployer$1,100$1,030$706%
Coordinator$520$116$40478%
System Total$1,770$1,276$49428%

For accident-dominated components (most executors):

  • Invest in testing and verification
  • Add redundancy (N-version programming)
  • Improve capabilities where possible
  • Monitor for anomalies

For defection-dominated components (coordinators, high-autonomy agents):

  • Apply strict architectural constraints
  • Limit capabilities to necessary minimum
  • Use decomposed coordination
  • Implement human oversight at key decisions
  • Monitor for deceptive patterns

Principle: Separate Risk Types Architecturally

Section titled “Principle: Separate Risk Types Architecturally”
flowchart TB
    subgraph "High Defection Risk Zone"
        C[Coordinator]
    end

    subgraph "Low Defection Risk Zone"
        E1[Executor 1<br/>Deterministic]
        E2[Executor 2<br/>Narrow Model]
        E3[Executor 3<br/>Verified Code]
    end

    C -->|constrained interface| E1
    C -->|constrained interface| E2
    C -->|constrained interface| E3

    style C fill:#ffcccc
    style E1 fill:#ccffcc
    style E2 fill:#ccffcc
    style E3 fill:#ccffcc

By isolating high-defection-risk components and constraining their interfaces, you limit the damage a defecting component can cause.

ContextUse
General risk discussions”Delegation Risk”
Discussing bugs, errors, testing”Accident Risk”
Discussing alignment, scheming, deception”Defection Risk”
Coordinator constraintsEmphasize “Defection Risk”
Verification and testingEmphasize “Accident Risk”
Capability scaling analysisDistinguish both

For current AI systems, most realized harms are accidents. But as systems become more capable and autonomous:

  1. Accident risk decreases (better at tasks)
  2. Defection risk may increase (better at deception)
  3. Defection becomes harder to detect (smarter adversary)
  4. Stakes increase (more capability = more damage potential)

This means the ratio of defection to accident risk likely increases with capability, even if absolute accident risk decreases. Safety architectures must anticipate this shift.