Skip to content

Recovery Patterns

Recovery patterns address what happens when things go wrong. No system is perfect—agents will make mistakes, systems will fail, and attacks will succeed. These patterns ensure that failures are contained, recoverable, and don’t cascade into catastrophe.


Define a series of fallback operating modes that progressively reduce capability while maintaining safety, allowing the system to continue operating at reduced capacity rather than failing completely.

Complete failure is often worse than reduced capability. By defining graceful degradation levels, we can maintain essential functions even when parts of the system fail, giving humans time to respond without forcing an all-or-nothing choice.

flowchart TB
    subgraph Ladder["GRACEFUL DEGRADATION LADDER"]
        L0["LEVEL 0: FULL OPERATION<br/>All capabilities active<br/>Full autonomy, normal monitoring"]
        L0 -->|"Minor anomaly"| L1
        L1["LEVEL 1: REDUCED AUTONOMY<br/>High-stakes need approval<br/>Increased logging"]
        L1 -->|"Significant issue"| L2
        L2["LEVEL 2: CONSERVATIVE MODE<br/>Only low-risk ops allowed<br/>External calls blocked"]
        L2 -->|"Serious problem"| L3
        L3["LEVEL 3: SAFE MODE<br/>Read-only operations<br/>Human approval required"]
        L3 -->|"Critical failure"| L4
        L4["LEVEL 4: EMERGENCY SHUTDOWN<br/>All operations halted<br/>State preserved"]
    end

    style L0 fill:#d1fae5,stroke:#059669
    style L1 fill:#bbf7d0,stroke:#16a34a
    style L2 fill:#fef3c7,stroke:#d97706
    style L3 fill:#fed7aa,stroke:#ea580c
    style L4 fill:#fee2e2,stroke:#dc2626
LevelAllowedNeeds ApprovalBlockedMonitoring
FullAllNoneNoneStandard
ReducedRead, compute, low-risk writeHigh-risk write, external, deleteNoneEnhanced
ConservativeRead, computeLow-risk writeHigh-risk, external, deleteIntensive
SafeRead onlyComputeAll writes, externalFull trace
ShutdownNoneNoneAllForensic
SeverityDegradation Level
<0.2Full (no degradation)
0.2–0.4Reduced
0.4–0.6Conservative
0.6–0.8Safe
≥0.8Shutdown

Recovery to a less-degraded state requires:

  1. Authorization — Higher levels need more authority (see trigger table)
  2. Health checks — System must pass health verification
  3. History logging — All transitions recorded with authorization ID

Unknown operations are blocked at any degraded level (whitelisting approach).

LevelTrigger ExamplesRecovery Requires
ReducedElevated error rate, anomaly detectedSelf-recovery after timeout
ConservativeSecurity alert, policy violationTeam lead approval
SafeConfirmed attack, data breachSecurity team approval
ShutdownCritical system compromiseExecutive + Security approval

Benefits:

  • Maintains partial service during problems
  • Clear escalation path
  • Predictable behavior under stress
  • Gives humans time to respond

Costs:

  • Complexity of defining levels
  • Reduced capability during degradation
  • Risk of staying degraded too long

Risks:

  • Wrong level for situation
  • Slow recovery might strand in degraded state
  • Attackers might trigger degradation deliberately
  • Circuit Breaker Cascade: Triggers degradation
  • Checkpoint-Rollback: Recovery mechanism
  • Dead Man’s Switch: Ultimate fallback

Periodically save system state so that failures can be recovered by reverting to a known-good checkpoint, limiting the blast radius of any problem.

If we can always go back to a known-good state, failures become much less scary. Checkpoint-Rollback creates these save points, enabling recovery from corruption, attacks, or errors by reverting to before the problem occurred.

flowchart TB
    subgraph Timeline["CHECKPOINT-ROLLBACK Timeline"]
        direction LR
        CP1["T0<br/>CP1"] --> CP2["T1<br/>CP2"] --> CP3["T2<br/>CP3"] --> Corrupt["T3<br/>CORRUPT"] --> CP4["T4<br/>CP4"] --> Now["T5<br/>NOW"]
    end

    Corrupt -->|"Problem detected"| Rollback

    Rollback["ROLLBACK<br/>1. Halt operations<br/>2. Select checkpoint (CP3)<br/>3. Restore state from CP3<br/>4. Replay safe operations<br/>5. Resume operations"]

    subgraph AfterRollback["After rollback: new timeline from CP3"]
        direction LR
        NCP1["CP1"] --> NCP2["CP2"] --> NCP3["CP3"] --> NNow["NOW"]
    end

    style CP3 fill:#d1fae5,stroke:#059669
    style Corrupt fill:#fee2e2,stroke:#dc2626
    style Rollback fill:#dbeafe,stroke:#2563eb
flowchart LR
    State["Current State"] --> Serialize["Serialize"]
    Serialize --> Hash["Compute SHA-256"]
    Serialize --> Store["Store to checkpoint storage"]
    Hash --> Record["Create checkpoint record"]
    Store --> Record
    Record --> Verify["Verify readable"]
    Verify --> Prune["Prune if > max checkpoints"]

Checkpoint record includes:

  • Checkpoint ID
  • Created timestamp
  • State hash (SHA-256)
  • Storage location
  • Metadata
  • Verified flag

Defaults: Interval = 1 hour, Max checkpoints = 24

  1. Find checkpoint — Must exist and be verified
  2. Check authorization — Rollbacks require approval
  3. Halt operations — Stop current work
  4. Verify integrity — Hash must match stored state
  5. Restore state — Apply checkpoint state
  6. Replay safe operations — Re-run idempotent/read-only ops since checkpoint
  7. Resume operations — Continue from restored state

Safe to replay: Idempotent operations, read-only operations Not safe to replay: Destructive, non-idempotent operations (skipped)

StrategyWhen to CheckpointProsCons
Time-basedFixed intervalsPredictableMight miss critical states
Event-basedAfter significant eventsCaptures important statesVariable frequency
TransactionBefore risky operationsPrecise recoveryHigher overhead
HybridTime + eventsBest coverageComplex

Finding the best checkpoint:

  1. Filter to checkpoints created before the problem
  2. Score each candidate
  3. Pick highest-scoring checkpoint
  4. Estimate impact

Checkpoint scoring factors:

FactorWeightNotes
RecencyHigher = betterLess data loss
Verified1.0 vs 0.5Unverified checkpoints penalized
Clean state+20% bonusMarked as known-good state

Rollback plan includes:

  • Selected checkpoint
  • Estimated data loss
  • Operations that can be replayed
  • Operations that will be lost
  • Confidence score

Benefits:

  • Enables recovery from any problem
  • Limits blast radius to checkpoint interval
  • Creates audit trail of states
  • Supports debugging and analysis

Costs:

  • Storage for checkpoints
  • Performance overhead
  • Complexity of replay logic

Risks:

  • Corrupted checkpoints
  • Data loss between checkpoints
  • Replay might cause new problems
  • Graceful Degradation: Different recovery approach
  • Black Box Flight Recorder: Provides replay data
  • Fallback Component Chain: Alternative recovery

Maintain a prioritized list of alternative implementations for critical functionality, automatically switching to backups when primary components fail.

Single points of failure are dangerous. By having pre-configured fallback options ready, we can maintain service continuity even when primary components fail, with graceful degradation rather than complete outage.

flowchart TB
    Request["Request"]
    Request --> Chain

    subgraph Chain["COMPONENT CHAIN"]
        Primary["Primary (GPT-4)<br/>Try first"]
        Primary -->|"On failure"| FB1
        FB1["Fallback 1 (Claude)<br/>Try second"]
        FB1 -->|"On failure"| FB2
        FB2["Fallback 2 (Local LLM)<br/>Try third"]
        FB2 -->|"On failure"| FB3
        FB3["Fallback 3 (Scripted)<br/>Last resort"]
    end

    Chain --> Response["Response<br/>(from whichever succeeded)"]

    style Primary fill:#d1fae5,stroke:#059669
    style FB1 fill:#dbeafe,stroke:#2563eb
    style FB2 fill:#fef3c7,stroke:#d97706
    style FB3 fill:#fee2e2,stroke:#dc2626
flowchart TB
    Request["Request"] --> Check1{"Component 1 healthy?"}
    Check1 -->|"no"| Check2{"Component 2 healthy?"}
    Check1 -->|"yes"| Try1["Try Component 1"]
    Try1 -->|"success"| Done["Return result"]
    Try1 -->|"failure"| Check2
    Check2 -->|"no"| Check3{"Component 3 healthy?"}
    Check2 -->|"yes"| Try2["Try Component 2"]
    Try2 -->|"success"| Done
    Try2 -->|"failure"| Check3
    Check3 -->|"yes"| Try3["Try Component 3"]
    Try3 -->|"success"| Done
    Try3 -->|"failure"| Fail["All failed"]
    Check3 -->|"no"| Fail

Component health tracking:

  • is_healthy (bool)
  • latency_ms
  • error_rate
  • last_check timestamp

Background health checks: Run every minute (configurable). Unhealthy components are skipped.

Priority management:

  • Components sorted by priority (lower = try first)
  • Can promote/demote components dynamically
TypeFallback StrategyTrade-offs
Provider diversityDifferent AI providersDifferent behaviors
Model diversityDifferent model sizesCapability differences
Approach diversityAI → Rules → HumanLatency/capability
Location diversityCloud → On-prem → EdgeLatency/availability

The chain can automatically reorder based on observed performance.

Performance tracking:

  • Record latency for each successful call
  • Record ∞ for failures
  • Keep last 100 observations per component

Reorder logic: Sort by average score (lower = better). Components that are fast and reliable bubble to the top.

Reorder interval: Hourly by default — prevents thrashing while adapting to changes.

Benefits:

  • No single point of failure
  • Automatic recovery from component failures
  • Can optimize for performance
  • Graceful degradation built-in

Costs:

  • Complexity of managing multiple components
  • Inconsistent behavior across fallbacks
  • Need to test all paths

Risks:

  • Fallbacks might have different security properties
  • Performance might degrade unexpectedly
  • Dependencies between components
  • Graceful Degradation: Ladder uses fallback chains
  • Circuit Breaker Cascade: Triggers fallback
  • Bulkhead Isolation: Isolates fallback failures

Limit the scope of potential damage from any failure or attack by partitioning the system so that problems in one area cannot spread to others.

Even with prevention measures, failures happen. Blast radius containment ensures that when things go wrong, the damage is limited to a small portion of the system rather than causing total collapse.

flowchart TB
    subgraph System["SYSTEM"]
        direction TB

        subgraph Zones[" "]
            direction LR
            ZoneA["ZONE A<br/>[FAILURE]"]
            ZoneB["ZONE B<br/>[OK]"]
            ZoneC["ZONE C<br/>[OK]"]
        end

        Bulkhead["BULKHEAD BOUNDARY<br/>Failure in Zone A:<br/>• Cannot access Zone B or C<br/>• Cannot propagate errors<br/>• Cannot exhaust shared resources"]
    end

    style ZoneA fill:#fee2e2,stroke:#dc2626
    style ZoneB fill:#d1fae5,stroke:#059669
    style ZoneC fill:#d1fae5,stroke:#059669
    style Bulkhead fill:#fef3c7,stroke:#d97706

Each zone has:

  • Zone ID and name
  • Resources (what’s in this zone)
  • Resource limits (max CPU, memory, etc.)
  • Allowed external calls
  • Isolation level (strict, moderate, etc.)
flowchart LR
    ZoneA["Zone A"] -->|"access request"| Check{"Policy exists?"}
    Check -->|"no"| Deny1["DENIED"]
    Check -->|"yes"| Type{"Access type permitted?"}
    Type -->|"no"| Deny2["DENIED"]
    Type -->|"yes"| Health{"Source zone healthy?"}
    Health -->|"no"| Deny3["DENIED"]
    Health -->|"yes"| Allow["ALLOWED"]

When a zone fails:

  1. Mark zone as unhealthy
  2. Block all cross-zone policies involving this zone
  3. Terminate active cross-zone connections
  4. Log isolation reason and timestamp
Impact TypeWhat It Includes
DirectResources within the failing zone
IndirectResources in dependent zones
User impactEstimated affected users/services
PrincipleDescriptionImplementation
Minimal dependenciesZones should be self-sufficientEach zone has own resources
Clear boundariesWell-defined zone interfacesAPI-only cross-zone access
Independent failureZones can fail independentlySeparate health tracking
Resource isolationResources dedicated per zoneResource quotas enforced
Blast radius limitsMaximum damage boundedZone impact estimation

Benefits:

  • Failures don’t cascade
  • Predictable impact boundaries
  • Easier root cause analysis
  • Enables parallel recovery

Costs:

  • Resource duplication
  • Cross-zone communication overhead
  • Complexity of zone management

Risks:

  • Zone boundaries might not match failure modes
  • Hidden cross-zone dependencies
  • Over-isolation reduces efficiency
  • Bulkhead Isolation: Implementation of containment
  • Circuit Breaker Cascade: Triggers isolation
  • Graceful Degradation: Operates within zones

Implement automated detection and correction of problems, allowing the system to recover from certain failures without human intervention.

Humans can’t respond to every issue in real-time. Self-healing loops detect known problem patterns and apply proven remediation steps automatically, reducing mean time to recovery and human operational burden.

flowchart TB
    subgraph Loop["SELF-HEALING LOOP"]
        Detect["DETECT<br/>Monitor for symptoms"]
        Detect --> Diagnose
        Diagnose["DIAGNOSE<br/>Identify root cause"]
        Diagnose --> Heal
        Heal["HEAL<br/>Apply remedy"]
        Heal --> Verify
        Verify["VERIFY<br/>Confirm fixed"]
        Verify -->|"Continue monitoring"| Detect
    end

    Fail["IF HEALING FAILS:<br/>1. Try next remedy in runbook<br/>2. If exhausted, escalate to human<br/>3. Log everything for analysis"]

    style Detect fill:#dbeafe,stroke:#2563eb
    style Diagnose fill:#fef3c7,stroke:#d97706
    style Heal fill:#d1fae5,stroke:#059669
    style Verify fill:#f3e8ff,stroke:#9333ea

Symptom Pattern:

  • Pattern ID and name
  • Detector function (returns true if symptom present)
  • Severity level

Remedy:

  • Remedy ID and name
  • Action to execute
  • Prerequisites (other remedies that must succeed first)
  • Side effects (what else might happen)
  • Estimated duration

Runbook:

  • Which pattern it addresses
  • Ordered list of remedies to try
  • Max auto-attempts before escalating
  • Cooldown between attempts
  • Escalation policy
flowchart TB
    Monitor["Monitor system state"] --> Detect{"Symptom detected?"}
    Detect -->|"no"| Monitor
    Detect -->|"yes"| Cooldown{"In cooldown?"}
    Cooldown -->|"yes"| Monitor
    Cooldown -->|"no"| Try["Try first remedy"]
    Try -->|"success"| Verify{"Symptom gone?"}
    Verify -->|"yes"| Done["Healed!"]
    Verify -->|"no"| Next{"More remedies?"}
    Try -->|"failure"| Next
    Next -->|"yes"| Try
    Next -->|"no"| Escalate["Escalate to human"]
    Done --> SetCooldown["Set cooldown"]
    Escalate --> SetCooldown
    SetCooldown --> Monitor

High Memory Usage:

Symptommemory_usage > 90%
SeverityMedium
Cooldown15 minutes
EscalationPage on-call
OrderRemedyDurationSide Effects
1Clear caches30 secTemporary slowdown
2Force garbage collection10 secNone
3Restart workers2 minBrief unavailability

Stuck Task:

SymptomAny task running > 1 hour
SeverityHigh
Cooldown30 minutes
EscalationPage on-call
OrderRemedyDurationSide Effects
1Send heartbeat5 secNone
2Timeout and retry1 minTask restart
3Kill task10 secTask lost

Guardrails prevent self-healing from causing more harm:

CheckLimitReason
Global rateMax 10 healings/hourPrevent runaway healing
Same patternMax 3 attempts/hourDetect healing failures
Alternating patternsDetect A→B→A→BCatch healing loops
Conflicting healingsOnly one at a timePrevent interference

Healing loop detection: If the same symptom is healed 3+ times in an hour, or if patterns alternate (A triggers B triggers A), escalate to human instead of continuing automatic healing.

Benefits:

  • Faster recovery than human intervention
  • Handles known problems automatically
  • Reduces operational burden
  • 24/7 availability

Costs:

  • Runbook development effort
  • Risk of automated damage
  • Complexity of guardrails

Risks:

  • Healing might cause more damage
  • Loops of failing healing
  • Masks underlying problems
  • Checkpoint-Rollback: Alternative recovery
  • Graceful Degradation: Fallback during healing
  • Circuit Breaker Cascade: Triggers healing

See also: