Quick Start
Quick Start: Applying the Framework
Section titled “Quick Start: Applying the Framework”A practical checklist for applying the AI Safety Framework to a new system. Complete these steps in order.
Phase 1: Scope and Budget (1-2 hours)
Section titled “Phase 1: Scope and Budget (1-2 hours)”1.1 Define System Boundaries
Section titled “1.1 Define System Boundaries”- What is the system’s primary task?
- What actions can the system take?
- What data does it have access to?
- Who are the human principals (users, operators, admins)?
1.2 Set Risk Tolerance
Section titled “1.2 Set Risk Tolerance”- What’s the worst realistic outcome?
- What damage would that cause? ($ or equivalent)
- What’s acceptable expected damage per month? → This is your Delegation Risk budget
Typical budgets:
| System Type | Suggested Delegation Risk Budget |
|---|---|
| Internal tools, recoverable | $500-2,000/month |
| Customer-facing, monitored | $2,000-10,000/month |
| High-stakes, autonomous | $10,000-50,000/month |
| Safety-critical | Consult specialists |
1.3 Identify Components
Section titled “1.3 Identify Components”List every component that will exist:
- What does each component do?
- What inputs does it receive?
- What outputs does it produce?
- What actions can it take?
Phase 2: Risk Assessment (2-4 hours)
Section titled “Phase 2: Risk Assessment (2-4 hours)”2.1 Failure Mode Analysis
Section titled “2.1 Failure Mode Analysis”For each component, complete this table:
| Component | Failure Mode | P(failure) | Damage | Delegation Risk |
|---|---|---|---|---|
| name | what could go wrong | estimate | $ | P × D |
Guidance for estimates:
- P = 0.1 (10%): Happens regularly without mitigation
- P = 0.01 (1%): Occasional, seen in testing
- P = 0.001 (0.1%): Rare, but plausible
- P = 0.0001 (0.01%): Very rare, requires adversarial action
2.2 Sum and Compare
Section titled “2.2 Sum and Compare”- Total component Delegation Risks: $______
- Budget: $______
- Within budget? Yes / No
If no: proceed to Phase 3 for mitigation. If yes: proceed to Phase 4.
Phase 3: Risk Mitigation (iterate until within budget)
Section titled “Phase 3: Risk Mitigation (iterate until within budget)”3.1 Identify Highest-Delegation Risk Component
Section titled “3.1 Identify Highest-Delegation Risk Component”Which component contributes most to total Delegation Risk? → Focus here first
3.2 Choose Mitigation Strategy
Section titled “3.2 Choose Mitigation Strategy”| Strategy | Effect | Example |
|---|---|---|
| Add verification | Reduce P(failure) | Code checker before deploy |
| Reduce permissions | Reduce Damage | Read-only instead of read-write |
| Add redundancy | Reduce P(failure) | Two models must agree |
| Add human gate | Reduce P(failure) | Require approval for high-stakes |
| Decompose | Reduce Damage per component | Split into smaller pieces |
3.3 Re-estimate and Repeat
Section titled “3.3 Re-estimate and Repeat”- Update Delegation Risk estimates with mitigation
- Still over budget? → Mitigate next-highest component
- Within budget? → Proceed to Phase 4
Phase 4: Apply Principles (1-2 hours)
Section titled “Phase 4: Apply Principles (1-2 hours)”4.1 Component Principles Checklist
Section titled “4.1 Component Principles Checklist”For each component, verify:
| Principle | Question | Answer |
|---|---|---|
| Least Intelligence | Is this the dumbest implementation that works? | |
| Least Privilege | Does it have only necessary permissions? | |
| Least Context | Does it know only what it needs? | |
| Least Persistence | Is it stateless between invocations? | |
| Least Autonomy | Is human oversight proportionate to risk? | |
| Least Surprise | Is behavior deterministic/predictable? | |
| Least Connectivity | Are communication paths minimized? |
4.2 Coordinator Constraints (if applicable)
Section titled “4.2 Coordinator Constraints (if applicable)”If your system has a coordinator component:
- Does it route rather than directly cause outcomes?
- Does it work through intermediaries?
- Is communication structured (not natural language)?
- Is planning horizon limited?
- Is it stateless between invocations?
- Is there redundancy or verification?
Phase 5: Implementation Choices (2-4 hours)
Section titled “Phase 5: Implementation Choices (2-4 hours)”5.1 Choose Implementation for Each Component
Section titled “5.1 Choose Implementation for Each Component”Use the Decision Guide for details.
| Component | Implementation | Justification |
|---|---|---|
| name | Code / Narrow Model / General LLM | why |
Rule of thumb: Default to verified code or narrow models. Only use general LLMs when creativity/flexibility is required.
5.2 Design Redundancy
Section titled “5.2 Design Redundancy”For safety-critical components:
- Which components need N-version redundancy?
- Are implementations diverse (different architectures)?
- What agreement threshold (unanimous, majority)?
5.3 Place Human Gates
Section titled “5.3 Place Human Gates”- Which actions require human approval?
- Sync, async, or batch approval?
- What information does the human see to decide?
Phase 6: Monitoring and Rollback (1-2 hours)
Section titled “Phase 6: Monitoring and Rollback (1-2 hours)”6.1 Define Monitoring
Section titled “6.1 Define Monitoring”For each component:
- What metrics indicate problems?
- What thresholds trigger alerts?
- What thresholds trigger automatic shutdown?
6.2 Design Rollback
Section titled “6.2 Design Rollback”- How do you undo actions if problems detected?
- What’s the maximum time to detect and rollback?
- Are there actions that can’t be rolled back? → Extra scrutiny needed
6.3 Plan Verification
Section titled “6.3 Plan Verification”- How will you test that Delegation Risk estimates are accurate?
- What data will you collect in production?
- When will you re-assess the risk analysis?
Phase 7: Document and Review (1-2 hours)
Section titled “Phase 7: Document and Review (1-2 hours)”7.1 Create Trust Ledger
Section titled “7.1 Create Trust Ledger”Document:
- System architecture diagram
- Component Delegation Risk budgets
- Trust relationships (who delegates to whom)
- Human approval gates
- Monitoring thresholds
7.2 Review Checklist
Section titled “7.2 Review Checklist”Before deployment:
- Total Delegation Risk within budget
- All principles applied to each component
- Coordinator constraints verified (if applicable)
- Redundancy in place for critical components
- Human gates at appropriate points
- Monitoring and alerting configured
- Rollback procedures tested
- Documentation complete
Quick Reference Card
Section titled “Quick Reference Card”Delegation Risk Formula
Section titled “Delegation Risk Formula”Delegation Risk = Σ P(outcome) × Damage(outcome)Risk Inheritance (Multiplicative)
Section titled “Risk Inheritance (Multiplicative)”Trust(A→C via B) = Trust(A→B) × Trust(B→C)Verifiability Hierarchy
Section titled “Verifiability Hierarchy”Verified Code > Regular Code > Narrow Models > Constrained LLMs > Base LLMs > RLKey Questions
Section titled “Key Questions”- What’s the worst that could happen?
- How likely is it?
- Can I make it less likely (verification, redundancy)?
- Can I make it less bad (reduce permissions, add gates)?
- How will I know if it’s happening (monitoring)?
- How will I stop it (rollback, circuit breakers)?
Interactive Tools
Section titled “Interactive Tools”Use these calculators to run the numbers:
- Delegation Risk Calculator — Input failure modes, get total Delegation Risk
- Risk Inheritance — Compute effective trust through networks
- Tradeoff Frontier — Explore capability vs. safety tradeoffs
Example: 30-Minute Assessment
Section titled “Example: 30-Minute Assessment”System: Slack bot that answers employee questions using internal docs
Phase 1 (5 min):
- Task: Answer questions from internal knowledge base
- Actions: Read docs, post messages
- Budget: $1,000/month (internal tool, recoverable)
Phase 2 (10 min):
- Components: Retriever, Answerer, Poster
- Failure modes:
- Retriever returns wrong docs: P=0.05, D=5
- Answerer hallucinates: P=0.1, D=50
- Answerer leaks sensitive info: P=0.01, D=100
- Poster spam: P=0.001, D=1
- Total: $156/month ✓ Within budget
Phase 4 (10 min):
- Least Intelligence: Answerer could be smaller model ✓
- Least Context: Answerer doesn’t see user identity ✓
- Least Persistence: Stateless ✓
Phase 5 (5 min):
- Retriever: Code (vector search)
- Answerer: Fine-tuned 7B (narrow task)
- Poster: Code (rate limited)
Result: Simple, low-risk system. No redundancy needed. Monitor for hallucination rate.
Next Steps
Section titled “Next Steps”- Decision Guide — Detailed decision trees
- Research Assistant Example — Full worked example
- Principles to Practice — Mapping principles to implementations