Risk Decomposition: Accidents vs. Defection
Risk Decomposition: Accidents vs. Defection
Section titled “Risk Decomposition: Accidents vs. Defection”The Two Sources of Harm
Section titled “The Two Sources of Harm”Every harm mode in a delegation exposure falls into one of two categories:
flowchart TB
DE[Delegation Exposure]
DE --> AE[Accident Exposure<br/>non-goal-directed]
DE --> DF[Defection Exposure<br/>goal-directed]
AE --> A1[Bugs and errors]
AE --> A2[Hallucinations]
AE --> A3[Capability limitations]
AE --> A4[Edge cases]
AE --> A5[System failures]
DF --> D1[Scheming / deception]
DF --> D2[Pursuing misaligned objectives]
DF --> D3[Collusion between components]
DF --> D4[Resource acquisition]
DF --> D5[Goal preservation behaviors]
style AE fill:#e6f3ff
style DF fill:#ffe6e6
Accident Exposure: The system isn’t trying to cause harm—it’s failing to achieve the intended goal. A calculator that returns wrong answers, an LLM that hallucinates citations, a robot that drops objects.
Defection Exposure: The system is effectively optimizing for something other than the principal’s interests. This includes both explicit deception and emergent misalignment where the system’s learned objective diverges from intent.
Why the Distinction Matters
Section titled “Why the Distinction Matters”1. Different Scaling with Capability
Section titled “1. Different Scaling with Capability”graph LR
subgraph "As Capability Increases"
A[Accident Risk] -->|decreases| A2[Smarter = fewer errors]
D[Defection Risk] -->|may increase| D2[Smarter = better at pursuing wrong goals]
end
| Risk Type | Capability Scaling | Why |
|---|---|---|
| Accident Risk | Generally ↓ | More capable systems make fewer errors, handle more edge cases |
| Defection Risk | May ↑ | More capable systems are better at strategic deception, can pursue misaligned goals more effectively |
This is the core tension in AI development: the same capabilities that reduce accidents may increase defection risk.
2. Different Mitigations
Section titled “2. Different Mitigations”| Risk Type | Primary Mitigations |
|---|---|
| Accident Risk | Testing, verification, redundancy, capability improvements, error handling, monitoring for anomalies |
| Defection Risk | Alignment techniques, architectural constraints, capability limitations, monitoring for deception, human oversight, interpretability |
Some mitigations help both (monitoring, redundancy). Others are specific:
- Capability improvements reduce accident risk but may increase defection risk
- Capability limitations reduce defection risk but may increase accident risk
- Alignment work primarily targets defection risk
3. Different Detection Methods
Section titled “3. Different Detection Methods”| Risk Type | How to Estimate Probability |
|---|---|
| Accident Risk | Historical data, testing, formal verification, fuzzing, code review |
| Defection Risk | Red-teaming, interpretability research, behavioral analysis, honeypots, game-theoretic analysis |
Accidents leave traces in logs and tests. Defection may be actively hidden.
4. Different Components Have Different Ratios
Section titled “4. Different Components Have Different Ratios”| Component Type | Typical Risk Ratio | Why |
|---|---|---|
| Deterministic code | ~100% Accident | No goal-directedness to misalign |
| Narrow ML models | ~95% Accident | Limited optimization, specific task |
| General LLMs | ~80% Accident* | Some goal-directedness, but mostly errors |
| Agentic systems | ~60% Accident* | Significant goal-directed behavior |
| RL-trained agents | ~40% Accident* | Strong optimization, emergent goals |
| Coordinators | Defection-dominated | Strategic position, leverage over system |
*Estimates highly uncertain and depend on specific system and deployment.
Worked Example: Decomposing a Research Assistant
Section titled “Worked Example: Decomposing a Research Assistant”Consider the research assistant from Delegation Risk Overview:
Summarizer Component
Section titled “Summarizer Component”| Harm Mode | Type | Probability | Damage | Risk |
|---|---|---|---|---|
| Misrepresents findings (misunderstanding) | Accident | 0.018 | $5,000 | $90 |
| Misrepresents findings (strategic) | Defection | 0.002 | $5,000 | $10 |
| Leaks data (error) | Accident | 0.0008 | $50,000 | $40 |
| Leaks data (intentional) | Defection | 0.0002 | $50,000 | $10 |
Summarizer Risk Breakdown:
- Accident Risk: $130 (87%)
- Defection Risk: $20 (13%)
- Total: $150
Code Deployer Component
Section titled “Code Deployer Component”| Harm Mode | Type | Probability | Damage | Risk |
|---|---|---|---|---|
| Deploys buggy code | Accident | 0.049 | $20,000 | $980 |
| Deploys subtly malicious code | Defection | 0.001 | $20,000 | $20 |
| Catastrophic breach (error) | Accident | 0.00005 | $1,000,000 | $50 |
| Catastrophic breach (intentional) | Defection | 0.00005 | $1,000,000 | $50 |
Code Deployer Risk Breakdown:
- Accident Risk: $1,030 (94%)
- Defection Risk: $70 (6%)
- Total: $1,100
Coordinator Component
Section titled “Coordinator Component”| Harm Mode | Type | Probability | Damage | Risk |
|---|---|---|---|---|
| Misroutes task (error) | Accident | 0.008 | $2,000 | $16 |
| Misroutes task (strategic) | Defection | 0.002 | $2,000 | $4 |
| Grants excessive permissions (error) | Accident | 0.001 | $100,000 | $100 |
| Grants excessive permissions (strategic) | Defection | 0.004 | $100,000 | $400 |
Coordinator Risk Breakdown:
- Accident Risk: $116 (22%)
- Defection Risk: $404 (78%)
- Total: $520
System Summary
Section titled “System Summary”| Component | Total Risk | Accident | Defection | Defection % |
|---|---|---|---|---|
| Summarizer | $150 | $130 | $20 | 13% |
| Code Deployer | $1,100 | $1,030 | $70 | 6% |
| Coordinator | $520 | $116 | $404 | 78% |
| System Total | $1,770 | $1,276 | $494 | 28% |
Implications for Architecture
Section titled “Implications for Architecture”Principle: Match Mitigations to Risk Type
Section titled “Principle: Match Mitigations to Risk Type”For accident-dominated components (most executors):
- Invest in testing and verification
- Add redundancy (N-version programming)
- Improve capabilities where possible
- Monitor for anomalies
For defection-dominated components (coordinators, high-autonomy agents):
- Apply strict architectural constraints
- Limit capabilities to necessary minimum
- Use decomposed coordination
- Implement human oversight at key decisions
- Monitor for deceptive patterns
Principle: Separate Risk Types Architecturally
Section titled “Principle: Separate Risk Types Architecturally”flowchart TB
subgraph "High Defection Risk Zone"
C[Coordinator]
end
subgraph "Low Defection Risk Zone"
E1[Executor 1<br/>Deterministic]
E2[Executor 2<br/>Narrow Model]
E3[Executor 3<br/>Verified Code]
end
C -->|constrained interface| E1
C -->|constrained interface| E2
C -->|constrained interface| E3
style C fill:#ffcccc
style E1 fill:#ccffcc
style E2 fill:#ccffcc
style E3 fill:#ccffcc
By isolating high-defection-risk components and constraining their interfaces, you limit the damage a defecting component can cause.
When to Use Each Term
Section titled “When to Use Each Term”| Context | Use |
|---|---|
| General risk discussions | ”Delegation Risk” |
| Discussing bugs, errors, testing | ”Accident Risk” |
| Discussing alignment, scheming, deception | ”Defection Risk” |
| Coordinator constraints | Emphasize “Defection Risk” |
| Verification and testing | Emphasize “Accident Risk” |
| Capability scaling analysis | Distinguish both |
The Uncomfortable Truth
Section titled “The Uncomfortable Truth”For current AI systems, most realized harms are accidents. But as systems become more capable and autonomous:
- Accident risk decreases (better at tasks)
- Defection risk may increase (better at deception)
- Defection becomes harder to detect (smarter adversary)
- Stakes increase (more capability = more damage potential)
This means the ratio of defection to accident risk likely increases with capability, even if absolute accident risk decreases. Safety architectures must anticipate this shift.
See Also
Section titled “See Also”- Delegation Risk Overview — Core framework and calculations
- Coordinator Constraints — Managing defection risk in coordinators
- Safety Mechanisms — Mitigations for both risk types
- Glossary — Formal definitions
Further Reading
Section titled “Further Reading”- Deceptive Alignment — Hubinger et al. on scheming risks
- AI Control — Greenblatt et al. on controlling potentially misaligned AI
- Failure Modes in ML — Amodei et al.’s taxonomy of AI safety problems