Skip to content

Core Concepts

This page introduces the key ideas without mathematical formalism. For quantitative details, see Delegation Risk.

How all the pieces fit together:

flowchart TB
    subgraph Foundation["Foundation: What We're Managing"]
        RISK["Delegation Risk<br/>Risk = Σ P(harm) × Damage"]
    end

    subgraph Theory["Theory: How Risk Behaves"]
        PROP["Risk Propagation<br/>How risk flows through chains"]
        COMP["Composition<br/>How component risks combine"]
    end

    subgraph Methods["Methods: Established Approaches"]
        BUDGET["Risk Budgeting<br/>Allocate, monitor, enforce"]
        MECH["Mechanism Design<br/>Incentive-compatible reporting"]
    end

    subgraph Principles["Principles: How We Constrain"]
        LEASTX["Least X Principles<br/>Capability, Privilege, Context..."]
        DECOMP["Decomposition<br/>Many limited components"]
    end

    subgraph Application["Application: AI Systems"]
        ARCH["Architecture<br/>Decomposed coordination"]
        SAFETY["Safety Mechanisms<br/>Tripwires, gates, verifiers"]
    end

    RISK --> PROP
    RISK --> COMP
    PROP --> BUDGET
    COMP --> BUDGET
    BUDGET --> MECH
    MECH --> LEASTX
    LEASTX --> DECOMP
    DECOMP --> ARCH
    ARCH --> SAFETY

Reading the diagram: Delegation Risk is the foundation. Theory explains how it behaves. Methods provide established approaches from other fields. Principles give actionable constraints. Application shows how to build systems (with AI as primary focus).

When you delegate a task, you’re accepting risk—the potential for harm if the delegate fails or misbehaves.

Delegation Risk quantifies this: for each delegate, sum over all possible bad outcomes, weighted by their probability. A delegate with access to your bank account has higher Delegation Risk than one that can only read public web pages.

Delegation Risk = Σ P(harm mode) × Damage(harm mode)

For each way something could go wrong (harm mode), multiply:

  • The probability it happens
  • The damage if it does

Sum across all harm modes to get total Delegation Risk.

Example: Employee with Check-Writing Authority

Section titled “Example: Employee with Check-Writing Authority”
Harm ModeProbabilityDamageRisk Contribution
Clerical error0.01$1,000$10
Unauthorized payment0.001$50,000$50
Fraud0.0001$500,000$50

Delegation Risk = 10+10 + 50 + 50=50 = **110** expected cost

This doesn’t mean you’ll lose 110.Itmeansthatovermanysuchdelegations,youdexpect110. It means that over many such delegations, you'd expect 110 in losses on average. You can compare this to the value the delegation provides and decide if it’s worth it.

When delegates delegate to other delegates, risk relationships form networks. If A trusts B with weight 0.8, and B trusts C with weight 0.7, how much should A trust C?

Different rules give different answers:

  • Multiplicative: 0.8 × 0.7 = 0.56 (each stage is independent risk)
  • Minimum: min(0.8, 0.7) = 0.7 (chain only as strong as weakest link)
  • Discounted: More complex models accounting for path length
graph LR
    A((A)) -->|"0.8"| B((B))
    B -->|"0.7"| C((C))
    A -.->|"0.56 mult / 0.7 min"| C

The right rule depends on what “trust” means in your context. The framework provides formal tools for reasoning about these relationships.

Choosing a propagation rule
RuleFormulaBest when…
MultiplicativeT(A→C) = T(A→B) × T(B→C)Each delegation is independent risk; default choice
MinimumT(A→C) = min(T(A→B), T(B→C))Chain only as strong as weakest link
DiscountedT(A→C) = T(A→B) × T(B→C) × δⁿLonger paths inherently less trustworthy
Maximum pathT(A→C) = max over all pathsAny valid route suffices

Most systems should start with multiplicative propagation—it’s intuitive, conservative, and composes well.

A single powerful delegate is dangerous because it concentrates capability, context, and control. If it fails or acts against you, the damage is unbounded.

Decomposition breaks this pattern:

  • Split tasks across many specialized components
  • Each component has limited capability and context
  • No single component can cause catastrophic harm
  • Components can’t easily coordinate against the system

Think of it like separation of powers in government, or microservices in software architecture.

AspectCentralizedDecomposed
Single point of failureCEO makes all decisionsAuthority distributed across roles
Blast radiusBad CEO decision affects everythingBad decision contained to department
Coordination riskCEO can pursue personal agendaRequires multiple people to collude
VerificationMust verify CEO’s judgment on everythingVerify narrow, specific behaviors

The same logic applies to software systems and AI agents.

Security engineering has the principle of least privilege: give each component minimum necessary access. We extend this across multiple dimensions:

  • Least Capability: Minimum ability to take actions
  • Least Privilege: Minimum permissions and access rights
  • Least Context: Minimum information about the broader system
  • Least Persistence: Minimum state maintained across invocations
  • Least Autonomy: Minimum ability to act without oversight
  • Least Connectivity: Minimum communication paths between components

Each principle bounds a different attack surface. Applied together, they create components that are limited in what they can do, know, remember, and coordinate.

Why each principle matters
PrincipleWhat it limitsRisk bounded
Least CapabilityAbility to actDirect damage from actions
Least PrivilegeAccess rightsScope of potential harm
Least ContextSystem knowledgeStrategic planning against system
Least PersistenceMemory across callsLong-term planning, accumulating leverage
Least AutonomyUnsupervised actionScope of damage before intervention
Least ConnectivityCommunication pathsCoordination between components

Borrowed from finance and nuclear safety: set a total acceptable risk level for the system, then allocate risk budgets to components.

Key requirements:

  • Compositional guarantees: Component risks must aggregate predictably
  • Principled allocation: Methods like Euler allocation ensure budgets sum correctly
  • Incentive-compatible reporting: Mechanisms that make honest risk reporting optimal
  • Verification infrastructure: Independent confirmation that claimed levels match reality
  • Conservative margins: Buffer for uncertainty and unknowns
flowchart TD
    S[System Risk Target<br/>e.g. $10,000/year] --> A[Component A<br/>Budget: $3,000]
    S --> B[Component B<br/>Budget: $4,000]
    S --> C[Component C<br/>Budget: $3,000]
    A --> V[Verification Layer]
    B --> V
    C --> V

The central claim: safety can be a property of system architecture, not just individual component behavior.

If you can’t guarantee a delegate is trustworthy, you can still:

  • Limit what they can access (least privilege)
  • Limit what they know (least context)
  • Limit what they can coordinate (decomposition)
  • Limit how much damage they can cause (risk budgets)
  • Verify their behavior matches claims (verification layers)

The framework applies generally, but AI systems are our primary focus. For AI specifically:

Not all implementations are equally trustworthy:

  1. Provably secure code — Mathematical guarantees of correctness
  2. Regular code — Auditable, testable, deterministic
  3. Fine-tuned narrow models — Predictable within a specific domain
  4. Constrained general models — LLMs with extensive prompting and evaluation
  5. Base LLMs — General capability, harder to verify
  6. RL systems — Learned policies, least predictable
graph TB
    subgraph "Most Verifiable"
        A[Provably Secure Code]
    end
    A --> B[Regular Code]
    B --> C[Fine-tuned Narrow Models]
    C --> D[Constrained General Models]
    D --> E[Base LLMs]
    E --> F[RL Systems]
    subgraph "Least Verifiable"
        F
    end
When to use each level
LevelUse when…Example
Provably secure codeCorrectness is critical, logic is simpleAccess control checks, rate limiters
Regular codeBehavior is fully specifiableData transformation, API routing
Fine-tuned narrow modelsTask is well-defined, training data existsSentiment classification, entity extraction
Constrained general modelsJudgment needed, outcomes recoverableDraft generation, summarization
Base LLMsMaximum flexibility needed, heavy oversightCreative tasks, complex reasoning
RL systemsLearning from feedback is essentialRarely recommended for safety-critical paths

Architectural Spectrum: Monolithic vs Decomposed

Section titled “Architectural Spectrum: Monolithic vs Decomposed”

Real systems exist on a spectrum. This diagram shows two stylized endpoints:

flowchart TB
    subgraph Before["Monolithic Agent"]
        direction TB
        M["Single Agent<br/>────────────<br/>• Full context access<br/>• All capabilities<br/>• High autonomy<br/>• Single point of failure"]
        M --> OUT1[Any Output]
    end

    subgraph After["Decomposed System"]
        direction TB
        subgraph Coord["Coordination (Code)"]
            R[Router]
        end
        subgraph Workers["Narrow Components"]
            W1["Searcher<br/>Read-only"]
            W2["Summarizer<br/>Text only"]
            W3["Analyzer<br/>No actions"]
        end
        subgraph Safety["Safety Layer"]
            V[Verifier]
            G[Human Gate]
        end
        R --> W1 & W2 & W3
        W1 & W2 & W3 --> V
        V --> G
        G --> OUT2[Verified Output]
    end
AspectMonolithicDecomposed
Single point of failureYes—one bad output, system failsNo—failure isolated to component
Attack surfaceEntire model capabilityOnly exposed component interfaces
VerificationMust verify everything the model can doVerify narrow, specific behaviors
Blast radiusUnlimited—model can do anythingBounded—each component has limits
Coordination riskModel coordinates with itselfComponents can’t easily collude

  1. Vocabulary: Precise terms for discussing risk, delegation, and containment
  2. Theory: Mathematics for risk propagation, capability quantification, and optimization
  3. Methods: Established approaches adapted from finance, nuclear safety, mechanism design
  4. Principles: Actionable design constraints (the “Least X” family)
  5. Patterns: Decomposed coordination, verification layers, safety mechanisms

The goal is infrastructure for safely delegating at scale—not a complete solution to trust, but a foundation for managing risk as capabilities increase.

The framework isn’t just about minimizing risk—it’s about maximizing capability subject to risk constraints:

maxCapabilitys.t.DelegationRiskBudget\max \text{Capability} \quad \text{s.t.} \quad \text{DelegationRisk} \leq \text{Budget}

Where Capability = Power × Agency:

  • Power: Ability to achieve diverse goals (what can the system accomplish?)
  • Agency: Coherence of goal-pursuit (how optimizer-like is the system?)

See Agent, Power, and Authority for full formalization.