Skip to content

For Engineers

You want to build safe AI systems. Here’s the fastest path to implementation.


If you trust the patterns and just want to implement:

  1. Quick Reference — Core formulas and decision flowchart (5 min)
  2. Quick Start Guide — Step-by-step implementation checklist (10 min)
  3. Minimal Framework — The 5 essential patterns (5 min)

You can stop here and have a working safety layer.


If you want to understand why the patterns work:

  1. Five-Minute Intro — Core problem (5 min)
  2. Core Concepts — Visual framework (20 min)
  3. Delegation Risk Overview — Quantitative foundation (30 min)
  4. Design Patterns Index — Pattern catalog (15 min)
  5. Composing Patterns — Pattern combinations (20 min)
  6. Pick your patterns — Based on your threat model (20 min)

PatternWhat It DoesWhen to Use
Least-X PrinciplesMinimize permissions, context, autonomyAlways (default)
Structural PatternsBulkheads, firewalls, blast radius containmentMulti-component systems
Decomposed CoordinationSeparation of powers, rotating validatorsHigh-stakes decisions
PatternWhat It DoesWhen to Use
Verification PatternsGhost checker, consistency triangulationOutput validation
Multi-Agent VerificationCross-validation, adversarial collaborationAI-on-AI verification
Monitoring PatternsTripwires, anomaly detectionRuntime safety
PatternWhat It DoesWhen to Use
Recovery PatternsCheckpoint-rollback, graceful degradationAny production system
Safety MechanismsKill switches, capability sunsetHigh-autonomy systems

[ ] List all AI components in your system
[ ] For each component, identify:
- What data can it access?
- What actions can it take?
- What's the worst-case outcome?
[ ] Calculate rough risk budget (see Quick Reference)
[ ] Classify as Low / Medium / High risk
[ ] Implement Least Privilege
- Restrict file access to workspace only
- Remove unnecessary API permissions
- Use read-only where possible
[ ] Implement Human Escalation
- Define escalation thresholds ($, sensitivity, confidence)
- Build approval workflow
- Test the escalation path
[ ] Implement Output Filtering
- PII detection
- Policy violation checks
- Harmful content filtering
[ ] Implement Sandboxing
- Containerized execution
- Network isolation
- Resource limits
[ ] Implement Audit Logging
- All inputs and outputs
- Decision traces
- Timestamps and versions
[ ] Add verification layer (if high-risk)
[ ] Set up monitoring and alerting
[ ] Create incident response runbook
[ ] Schedule quarterly risk reviews

The framework includes worked examples you can reference:

ExampleTypeKey Patterns
Research AssistantMulti-step taskEscalation, checkpoints
Trading SystemHigh-stakes, autonomousRisk budgets, circuit breakers
Healthcare BotSafety-criticalOutput filtering, human review
Code DeploymentInfrastructureSandboxing, rollback

If You UseIntegration Point
LangChainAdd verification as chain step, use callbacks for monitoring
OpenAI FunctionsWrap function calls with permission checks
Anthropic ClaudeUse system prompts for least-context, implement tool restrictions
AutoGPT-styleAdd checkpoints between action cycles, implement kill switches
Suggestion: Add safety layer tests to your pipeline
- Unit tests for output filters
- Integration tests for escalation paths
- Chaos testing for failure modes

SymptomLikely CauseFix
Too many escalationsThresholds too tightAdjust based on false positive rate
Missed harmful outputsFilter gapsReview missed cases, expand rules
Slow responsesHeavy verificationAdd caching, async verification
Verification false positivesOverfitting to training dataDiversify verifier inputs

What Engineers Often Skip (But Shouldn’t)

Section titled “What Engineers Often Skip (But Shouldn’t)”
  1. Entanglements — If you use the same model in multiple places, failures correlate. Your 3 independent 90% verifiers might only give you 92%, not 99.9%.

  2. Insurer’s Dilemma — If you set the risk budget but users bear the harm, you’ll systematically underestimate acceptable risk.

  3. Capability Probing — Systems can hide capabilities. Periodic probing detects capability drift before it causes harm.