Skip to content

Design Patterns

This section catalogs 45 reusable architectural patterns for building safer AI systems. Like the Gang of Four design patterns for software, these are named, documented solutions to recurring problems in AI delegation.

Patterns capture hard-won knowledge:

  • Shared vocabulary: “Use a Voting Tribunal with Rotating Validators” is faster than explaining the architecture
  • Proven solutions: Each pattern addresses real problems in AI deployment
  • Trade-off awareness: Patterns document costs, not just benefits
  • Composability: Patterns combine to create robust architectures

How to architect delegation relationships:

PatternPurpose
Escalation LadderHierarchical oversight with stakes-based escalation
Voting TribunalMultiple agents must agree on decisions
Graduated AutonomyTrust increases with demonstrated track record
Capability AirlockIsolate dangerous capabilities behind barriers
Task Decomposition PipelineBreak tasks into verifiable steps
Gateway ChokepointAll critical operations through single point
Bulkhead IsolationFailures can’t cascade across partitions
Separation of PowersNo single entity has unchecked authority

How to check that agents are behaving correctly:

PatternPurpose
Ghost CheckerParallel async verification without blocking
Consistency TriangulationCross-reference for contradictions
Honeypot InjectionTraps only schemers trigger
Capability ProbingTest for hidden abilities
Canary TokensTraceable markers reveal unauthorized access
Random AuditUnpredictable deep inspection
Rotating ValidatorChange who validates to prevent collusion

How to manage what agents know:

PatternPurpose
Need-to-Know CompartmentsOnly task-relevant information
Chinese WallImpenetrable information barriers
Selective AmnesiaForget sensitive data after use
Air GapComplete physical/logical separation
Data Shelf LifeInformation access expires automatically

How to use time as a control mechanism:

PatternPurpose
Cooling Off PeriodMandatory delays before high-stakes actions
Dead Man’s SwitchAction required to prevent automatic response
Trust DecayTrust erodes without reinforcement
Capability SunsetPermissions automatically expire
Rate Limiting LadderAdaptive throttling under suspicion

How to observe agent behavior over time:

PatternPurpose
Tripwire MeshDistributed detection network
Behavioral FingerprintingUnique behavioral signature per agent
Panopticon EffectUncertainty about when monitored
Black Box Flight RecorderComplete audit trail for forensics
Temporal Behavior AnalysisDetect drift and strategic timing

How to leverage agent interactions for safety:

PatternPurpose
Adversarial CollaborationOpposed agents reveal problems through debate
Vouching NetworkAgents stake reputation vouching for others
Trust Transitivity LimitsTrust doesn’t automatically propagate
Reputation StakingStake reputation on claims

How to align incentives economically:

PatternPurpose
Bonding & SlashingCollateral forfeited for misbehavior
Bounty SystemRewards for finding problems
Insurance RequirementRisk transferred to expert assessors

How to handle failures gracefully:

PatternPurpose
Graceful Degradation LadderProgressive capability reduction
Checkpoint-RollbackRevert to known-good states
Fallback Component ChainPre-configured alternatives
Blast Radius ContainmentLimit damage propagation
Self-Healing LoopAutomatic detection and correction

How to prevent unauthorized influence between components:

PatternPurpose
Side-Channel DetectionMonitor for unauthorized influence paths
Boundary EnforcementEnsure components only interact through sanctioned channels
Influence IsolationArchitectural separation preventing side-deals

Related: Entanglements addresses passive correlations (shared blind spots), while Channel Integrity addresses active boundary violations (side-deals, capture, collusion).


Each pattern is documented with:

SectionDescription
IntentWhat problem does this solve?
MotivationWhen and why would you use this?
StructureArchitecture diagram or pseudocode
ImplementationPractical code examples
ConsequencesBenefits, costs, and risks
Related PatternsWhat combines well or conflicts?

Primary ThreatRecommended Patterns
Capability errorsEscalation Ladder, Ghost Checker, Checkpoint-Rollback
Adversarial inputsGateway Chokepoint, Tripwire Mesh, Semantic Firewall
AI schemingHoneypot Injection, Consistency Triangulation, Capability Probing
Cascading failuresBulkhead Isolation, Blast Radius Containment, Graceful Degradation
Trust miscalibrationGraduated Autonomy, Trust Decay, Capability Sunset
Collusion/CaptureChannel Integrity, Rotating Validator, Adversarial Collaboration, Separation of Powers
Resource acquisitionCapability Airlock, Rate Limiting Ladder, Bonding & Slashing
System TypePattern Focus
High-throughputGhost Checker (async), Random Audit (sampling), Fallback Chain
Low-latency criticalPre-computed fallbacks, Bulkhead Isolation, Graceful Degradation
High-stakes decisionsVoting Tribunal, Cooling Off Period, Escalation Ladder
Long-running tasksCheckpoint-Rollback, Dead Man’s Switch, Task Decomposition
Multi-agentAdversarial Collaboration, Vouching Network, Rotating Validator
PhaseKey Patterns
Pre-deploymentCapability Probing, Behavioral Fingerprinting
Initial operationGraduated Autonomy (low level), Maximum monitoring
Established trustTrust Decay, Capability Sunset, Random Audit
Incident responseGraceful Degradation, Checkpoint-Rollback, Self-Healing

Patterns are most powerful in combination. See Composing Patterns for complete architecture examples and a step-by-step composition guide.

Common compositions:

flowchart TB
    subgraph Stack["Defense in Depth"]
        GC["Gateway Chokepoint<br/>Single entry point"]
        TM["Tripwire Mesh<br/>Distributed detection"]
        BF["Behavioral Fingerprinting<br/>Anomaly detection"]
        GH["Ghost Checker<br/>Async verification"]
        BR["Blast Radius Containment<br/>Limit damage"]
    end
    GC --> TM --> BF --> GH --> BR

    style GC fill:#ffcccc
    style TM fill:#ffe6cc
    style BF fill:#ffffcc
    style GH fill:#ccffcc
    style BR fill:#cce6ff
flowchart LR
    CA["Capability Airlock<br/>start here"] --> GA["Graduated Autonomy<br/>earn trust"] --> TD["Trust Decay<br/>maintain it"]

    style CA fill:#ffcccc
    style GA fill:#ffffcc
    style TD fill:#ccffcc
flowchart LR
    VT["Voting Tribunal<br/>agreement"] --- AC["Adversarial Collaboration<br/>debate"] --- RV["Rotating Validator<br/>no capture"]

    style VT fill:#cce6ff
    style AC fill:#e6ccff
    style RV fill:#ffcce6
flowchart LR
    CR["Checkpoint-Rollback<br/>save state"] --- FC["Fallback Chain<br/>alternatives"] --- GD["Graceful Degradation<br/>reduce scope"] --- SH["Self-Healing<br/>auto-fix"]

    style CR fill:#cce6ff
    style FC fill:#ccffcc
    style GD fill:#ffffcc
    style SH fill:#ffcccc
flowchart LR
    BS["Bonding & Slashing<br/>skin in game"] --- RS["Reputation Staking<br/>calibrated"] --- BY["Bounty System<br/>find issues"] --- IN["Insurance<br/>transfer risk"]

    style BS fill:#ffcccc
    style RS fill:#ffe6cc
    style BY fill:#ffffcc
    style IN fill:#ccffcc

When you stack three 90%-effective layers, you might expect 99.9% protection. But if those layers share failure modes, influence each other, or can collude, you might only get 95% protection—a 50× worse outcome than expected.

This “entanglement tax” is one of the hardest problems in delegation architecture design. Entanglements come in three forms:

  • Passive: Shared blind spots, correlated failures
  • Active: One component influences another (the “bureaucrat problem”)
  • Adversarial: Components coordinate against the principal’s interests

See the full Entanglements section →

TopicWhat It Covers
Types of EntanglementPassive, active, and adversarial entanglements and their variants
Detecting InfluenceMethods for detecting active and adversarial entanglements
Challenge Categories9 sources of entanglement
Worked ExamplesCode review, healthcare, trading, customer support
Solutions8 mitigations including temporal dynamics
Research ConnectionsPrincipal-agent theory, game theory, adversarial ML
Formal DefinitionsInformation-theoretic and game-theoretic formalization
Historical CasesEnron, rating agencies, Boeing, Madoff, TMI
Quantitative MetricsPractical measurement and dashboards

Your system likely has entanglement problems if:

  • Multiple verification layers use the same LLM provider
  • All safety patterns were designed by the same team
  • Verifiers receive context from the components they verify
  • Agent-verifier pairs work together repeatedly without rotation
  • Your red team can find “master key” inputs that evade all layers
  • Dashboards only show per-component metrics (not joint failure rates)

Each pattern page notes related anti-patterns to avoid. Common mistakes:

  • Over-engineering: Adding patterns that don’t address real threats
  • Pattern cargo-culting: Using patterns without understanding trade-offs
  • Incomplete implementation: Half-implementing a pattern (often worse than nothing)
  • Pattern conflicts: Combining patterns that interfere with each other
  • Missing the threat model: Sophisticated patterns for the wrong threats
  • Ignoring correlation: Assuming composed patterns are independent (see Entanglements)

If you can only implement a few patterns, start here:

  1. Escalation Ladder: Foundation for human oversight
  2. Capability Airlock: Contain dangerous capabilities
  3. Checkpoint-Rollback: Enable recovery from any problem
  4. Graceful Degradation: Fail safely, not completely
  5. Trust Decay: Keep trust calibrated over time

Patterns you might not find elsewhere:

  • Panopticon Effect: Uncertainty as deterrent
  • Vouching Network: Transitive accountability
  • Dead Man’s Switch: Inverse heartbeat for oversight
  • Cooling Off Period: Time as a safety mechanism
  • Capability Sunset: Permissions that expire

These patterns appear throughout our case studies and worked examples:

Case StudyKey Patterns Used
Code Review BotTask Decomposition Pipeline, Gateway Chokepoint, Escalation Ladder
Sydney IncidentBehavioral Fingerprinting, Capability Airlock (missing)
Drift DetectionTrust Decay, Temporal Behavior Analysis, Graceful Degradation
Nuclear Launch AuthorityDead Man’s Switch, Separation of Powers, Voting Tribunal
Jury TrustAdversarial Collaboration, Rotating Validator, Voting Tribunal
Open Source TrustVouching Network, Graduated Autonomy, Reputation Staking
ExampleKey Patterns Used
Trading SystemCapability Airlock, Graceful Degradation, Blast Radius Containment
Healthcare BotEscalation Ladder, Need-to-Know Compartments, Cooling Off Period
Code DeploymentTask Decomposition Pipeline, Checkpoint-Rollback, Ghost Checker
Research AssistantGraduated Autonomy, Random Audit, Trust Decay

These patterns evolve with the field. If you’ve developed new patterns or have improvements to existing ones, we welcome contributions.

See also: