Skip to content

Quick Start

A practical checklist for applying the AI Safety Framework to a new system. Complete these steps in order.

  • What is the system’s primary task?
  • What actions can the system take?
  • What data does it have access to?
  • Who are the human principals (users, operators, admins)?
  • What’s the worst realistic outcome?
  • What damage would that cause? ($ or equivalent)
  • What’s acceptable expected damage per month? → This is your Delegation Risk budget

Typical budgets:

System TypeSuggested Delegation Risk Budget
Internal tools, recoverable$500-2,000/month
Customer-facing, monitored$2,000-10,000/month
High-stakes, autonomous$10,000-50,000/month
Safety-criticalConsult specialists

List every component that will exist:

  • What does each component do?
  • What inputs does it receive?
  • What outputs does it produce?
  • What actions can it take?

For each component, complete this table:

ComponentFailure ModeP(failure)DamageDelegation Risk
namewhat could go wrongestimate$P × D

Guidance for estimates:

  • P = 0.1 (10%): Happens regularly without mitigation
  • P = 0.01 (1%): Occasional, seen in testing
  • P = 0.001 (0.1%): Rare, but plausible
  • P = 0.0001 (0.01%): Very rare, requires adversarial action
  • Total component Delegation Risks: $______
  • Budget: $______
  • Within budget? Yes / No

If no: proceed to Phase 3 for mitigation. If yes: proceed to Phase 4.


Phase 3: Risk Mitigation (iterate until within budget)

Section titled “Phase 3: Risk Mitigation (iterate until within budget)”

3.1 Identify Highest-Delegation Risk Component

Section titled “3.1 Identify Highest-Delegation Risk Component”

Which component contributes most to total Delegation Risk? → Focus here first

StrategyEffectExample
Add verificationReduce P(failure)Code checker before deploy
Reduce permissionsReduce DamageRead-only instead of read-write
Add redundancyReduce P(failure)Two models must agree
Add human gateReduce P(failure)Require approval for high-stakes
DecomposeReduce Damage per componentSplit into smaller pieces
  • Update Delegation Risk estimates with mitigation
  • Still over budget? → Mitigate next-highest component
  • Within budget? → Proceed to Phase 4

For each component, verify:

PrincipleQuestionAnswer
Least IntelligenceIs this the dumbest implementation that works?
Least PrivilegeDoes it have only necessary permissions?
Least ContextDoes it know only what it needs?
Least PersistenceIs it stateless between invocations?
Least AutonomyIs human oversight proportionate to risk?
Least SurpriseIs behavior deterministic/predictable?
Least ConnectivityAre communication paths minimized?

4.2 Coordinator Constraints (if applicable)

Section titled “4.2 Coordinator Constraints (if applicable)”

If your system has a coordinator component:

  • Does it route rather than directly cause outcomes?
  • Does it work through intermediaries?
  • Is communication structured (not natural language)?
  • Is planning horizon limited?
  • Is it stateless between invocations?
  • Is there redundancy or verification?

Phase 5: Implementation Choices (2-4 hours)

Section titled “Phase 5: Implementation Choices (2-4 hours)”

5.1 Choose Implementation for Each Component

Section titled “5.1 Choose Implementation for Each Component”

Use the Decision Guide for details.

ComponentImplementationJustification
nameCode / Narrow Model / General LLMwhy

Rule of thumb: Default to verified code or narrow models. Only use general LLMs when creativity/flexibility is required.

For safety-critical components:

  • Which components need N-version redundancy?
  • Are implementations diverse (different architectures)?
  • What agreement threshold (unanimous, majority)?
  • Which actions require human approval?
  • Sync, async, or batch approval?
  • What information does the human see to decide?

Phase 6: Monitoring and Rollback (1-2 hours)

Section titled “Phase 6: Monitoring and Rollback (1-2 hours)”

For each component:

  • What metrics indicate problems?
  • What thresholds trigger alerts?
  • What thresholds trigger automatic shutdown?
  • How do you undo actions if problems detected?
  • What’s the maximum time to detect and rollback?
  • Are there actions that can’t be rolled back? → Extra scrutiny needed
  • How will you test that Delegation Risk estimates are accurate?
  • What data will you collect in production?
  • When will you re-assess the risk analysis?

Document:

  • System architecture diagram
  • Component Delegation Risk budgets
  • Trust relationships (who delegates to whom)
  • Human approval gates
  • Monitoring thresholds

Before deployment:

  • Total Delegation Risk within budget
  • All principles applied to each component
  • Coordinator constraints verified (if applicable)
  • Redundancy in place for critical components
  • Human gates at appropriate points
  • Monitoring and alerting configured
  • Rollback procedures tested
  • Documentation complete

Delegation Risk = Σ P(outcome) × Damage(outcome)
Trust(A→C via B) = Trust(A→B) × Trust(B→C)
Verified Code > Regular Code > Narrow Models > Constrained LLMs > Base LLMs > RL
  1. What’s the worst that could happen?
  2. How likely is it?
  3. Can I make it less likely (verification, redundancy)?
  4. Can I make it less bad (reduce permissions, add gates)?
  5. How will I know if it’s happening (monitoring)?
  6. How will I stop it (rollback, circuit breakers)?

Use these calculators to run the numbers:


System: Slack bot that answers employee questions using internal docs

Phase 1 (5 min):

  • Task: Answer questions from internal knowledge base
  • Actions: Read docs, post messages
  • Budget: $1,000/month (internal tool, recoverable)

Phase 2 (10 min):

  • Components: Retriever, Answerer, Poster
  • Failure modes:
    • Retriever returns wrong docs: P=0.05, D=100DR=100 → DR=5
    • Answerer hallucinates: P=0.1, D=500DR=500 → DR=50
    • Answerer leaks sensitive info: P=0.01, D=10,000DR=10,000 → DR=100
    • Poster spam: P=0.001, D=1,000DR=1,000 → DR=1
  • Total: $156/month ✓ Within budget

Phase 4 (10 min):

  • Least Intelligence: Answerer could be smaller model ✓
  • Least Context: Answerer doesn’t see user identity ✓
  • Least Persistence: Stateless ✓

Phase 5 (5 min):

  • Retriever: Code (vector search)
  • Answerer: Fine-tuned 7B (narrow task)
  • Poster: Code (rate limited)

Result: Simple, low-risk system. No redundancy needed. Monitor for hallucination rate.