Skip to content

Glossary

Key terms used throughout this framework, organized by domain.

Common Terminology Confusions {#terminology-clarifications}

Section titled “Common Terminology Confusions {#terminology-clarifications}”

Some terms are used interchangeably in casual discussion but have distinct meanings in this framework:

TermWhen to UseRelated Terms
AgentWhen emphasizing goal-directed behavior and optimizationComponent, Delegate, System
ComponentWhen emphasizing architecture and system structureAgent, Module, Subsystem
DelegateWhen emphasizing the delegation relationship from principalAgent, Subordinate, Executor
DefectionIntentional deviation from principal’s goals (scheming)Misalignment, Failure
MisalignmentGoals don’t match principal’s (may or may not be intentional)Defection, Value Drift
Harm ModeSpecific way delegation causes damageFailure Mode (related but distinct)
Failure ModeEngineering term for how systems fail (from FMEA)Harm Mode (our AI-specific term)

When in doubt:

  • Use agent when discussing optimization, goals, and agency scores
  • Use component when discussing architecture and system decomposition
  • Use delegate when discussing the principal-agent relationship

Core Framework Terms {#core-framework-terms}

Section titled “Core Framework Terms {#core-framework-terms}”

Agent — A system whose behavior can be usefully modeled as optimizing an objective function over time. Agency is a matter of degree, not a binary property. See Agent, Power, and Authority.

Agency Score (or Coherence Score) — Measure of how well a system’s behavior can be explained by a simple utility function. Higher coherence = more agent-like. Formally: max over simple utility functions U of Fit(U, observed behaviors). A low-agency system is predictable; a high-agency system coherently pursues goals.

Power Score — Measure of an agent’s ability to achieve a wide variety of goals. Formally: expected achievability across many possible goal functions. Related concepts: reachability (how many states can the agent access?), resource control, influence over other agents.

Authority — Sanctioned power; the intersection of what an agent can do (power) and what it may do (permission). Authority = Power ∩ Permission. An agent with high power but low authority is dangerous.

Effective Capability — Power × Agency. The ability to actually accomplish objectives. Requires both the power to affect the world and the agency to coherently pursue goals.

RACAP (Risk-Adjusted Capability) — Capability Score / Delegation Risk. Measures efficiency—how much capability per unit risk. Higher RACAP = better system design.

Delegation Exposure — The complete set of possible harms (harm modes) from delegating a task. Not a single number—a collection, like an attack surface or failure envelope. Contains all the ways the delegation could cause damage.

Harm Mode — A specific way a delegation could cause damage. Parallels “failure mode” from FMEA. Each harm mode has a probability and a damage value. Example: “Leaks proprietary data” with P=0.001 and Damage=$50,000.

Delegation Risk — The probability-weighted expected cost of a delegation: Σ P(harm mode) × Damage(harm mode). This is the quantified total across all harm modes in the delegation exposure. Measured in dollars per time period. Example: Two harm modes (100+100 + 50) = $150 delegation risk. Decomposes into Accident Risk + Defection Risk.

Accident Exposure — The subset of delegation exposure containing non-goal-directed harm modes: bugs, errors, hallucinations, capability limitations, edge cases. These are failures where the system isn’t pursuing a conflicting objective—it’s just failing to achieve the intended one. See Risk Decomposition.

Accident Risk — Delegation Risk from accident exposure: Σ P(accident mode) × Damage(accident mode). Generally decreases with capability (smarter systems make fewer errors). Mitigated through testing, verification, redundancy, and capability improvements.

Defection Exposure — The subset of delegation exposure containing goal-directed harm modes: scheming, deception, pursuing misaligned objectives, collusion. These are failures where the system is effectively optimizing for something other than the principal’s interests. See Risk Decomposition.

Defection Risk — Delegation Risk from defection exposure: Σ P(defection mode) × Damage(defection mode). May increase with capability (smarter systems are better at pursuing misaligned goals). Mitigated through alignment, monitoring, containment, and architectural constraints.

Risk Inheritance — How risk flows through a delegation chain. When A delegates to B and B delegates to C, A inherits exposure to C’s harm modes. The inheritance rule (multiplicative, minimum, etc.) determines how risks combine.

Delegation Risk Budget — Maximum acceptable Delegation Risk for a system or component. Like a financial budget, but for expected harm.

Gross Exposure — Full delegation exposure before controls are applied. Contains all possible harm modes.

Net Exposure — Delegation exposure remaining after controls. Some harm modes may be eliminated or mitigated.

Residual Risk — Delegation Risk after controls are applied. The expected cost that remains despite mitigations.

Principal — An entity that delegates tasks and grants trust. In most systems, humans are the ultimate principals.

Executor — A component that performs delegated actions. Executors create delegation exposure through their potential harm modes.

Coordinator — A component that orchestrates other components—routing tasks, making strategic decisions, aggregating results. Highest-risk component type due to leverage over the system.

Trust Level — Assessment of an actor’s reliability and alignment. Informs delegation decisions but is distinct from delegation risk. High trust level → willing to accept larger delegation exposure.


”Least X” Principles {#least-x-principles}

Section titled “”Least X” Principles {#least-x-principles}”

Least Intelligence — Minimum cognitive capability needed for the task. Bounds optimization power to limit damage from misalignment.

Least Privilege — Minimum permissions and access rights. Classic security principle: read vs. write, network access, API permissions.

Least Context — Minimum information about the broader system. Component doesn’t know system architecture, other components, or deployment details.

Least Persistence — Minimum state maintained across invocations. Prevents accumulation of knowledge or long-term planning.

Least Autonomy — Minimum ability to act without oversight. Spectrum from human-approval-required to bounded-autonomy.

Least Surprise — Maximum predictability and determinism. Temperature=0 for LLMs, logged seeds for randomness. Prevents covert signaling.

Least Connectivity — Minimum communication paths between components. One-way flows, no broadcast, physical isolation where needed.

Least Observability — Components can’t observe system state, their own resources, or whether they’re being tested.

Least Compute — Bounded computational resources per invocation. FLOPs budget, memory limits, timeouts.

Least Generality — Use the least general implementation that works. Hard-coded rules > algorithms > narrow models > general LLMs > RL.

Maximum Verifiability — Use the most verifiable implementation that works. Formally verified code > regular code > fine-tuned models > prompted LLMs.


Risk Budgeting Terms (Finance) {#risk-budgeting-terms-finance}

Section titled “Risk Budgeting Terms (Finance) {#risk-budgeting-terms-finance}”

Euler Allocation — Method for decomposing portfolio risk to individual positions. Uses partial derivatives: RC_i = x_i × ∂R/∂x_i. Ensures component contributions sum exactly to total risk.

Value at Risk (VaR) — Maximum expected loss at a given confidence level (e.g., 95% VaR = loss exceeded only 5% of the time).

Expected Shortfall (ES) / CVaR — Average loss in the worst X% of cases. More sensitive to tail risk than VaR.

RAROC — Risk-Adjusted Return on Capital. Profit divided by risk capital. Used to compare risk-adjusted performance across units.

Stress Testing — Evaluating system performance under extreme but plausible scenarios.

Backtesting — Comparing predicted risk levels to actual historical outcomes to validate models.


Risk Budgeting Terms (Nuclear/Aerospace) {#risk-budgeting-terms-nuclear}

Section titled “Risk Budgeting Terms (Nuclear/Aerospace) {#risk-budgeting-terms-nuclear}”

PRA (Probabilistic Risk Assessment) — Systematic method for quantifying risks through event trees and fault trees. Standard in nuclear and aerospace.

CDF (Core Damage Frequency) — Probability of reactor core damage per reactor-year. Typical target: 10⁻⁴ to 10⁻⁵.

Fault Tree — Logical diagram showing how component failures combine to cause system failure. AND gates (all must fail) and OR gates (any can fail).

Event Tree — Diagram showing sequences of events following an initiating event, with branching based on success/failure of safety systems.

Defense in Depth — Multiple independent barriers against failure. If one fails, others still protect.

Safety Integrity Level (SIL)IEC 61508 standard for safety system reliability. SIL 1-4, with higher numbers requiring lower failure probability.

ASIL (Automotive SIL)ISO 26262 automotive safety standard. ASIL A-D, with D being most stringent.

PFDavg — Average Probability of Failure on Demand. Key metric for safety-instrumented systems.

Common Cause Failure — Single event that causes multiple components to fail simultaneously. Defeats redundancy.

Importance Measures — Metrics quantifying how much each component contributes to system risk. Includes Fussell-Vesely (fraction of risk involving component), Risk Achievement Worth (risk if component failed), Risk Reduction Worth (risk if component perfect).


Mechanism Design Terms {#mechanism-design-terms}

Section titled “Mechanism Design Terms {#mechanism-design-terms}”

Incentive Compatibility — Property where truthful reporting is optimal for each participant. No benefit from lying.

VCG (Vickrey-Clarke-Groves) — Mechanism achieving truthful revelation through payments based on externalities imposed on others.

Revelation Principle — Any outcome achievable by some mechanism can be achieved by a direct mechanism where truth-telling is optimal.

Information Rent — Profit extracted by agents due to private information. Cost of eliciting truthful reports.

Moral Hazard — When agents change behavior after agreements because their actions aren’t fully observable.

Adverse Selection — When agents with different (unobservable) characteristics self-select in ways that harm the principal.


Alignment — Ensuring AI systems pursue intended goals and behave as desired by their operators.

Scheming — AI system strategically deceiving operators to achieve goals different from stated objectives.

Instrumental Convergence — Tendency for diverse goals to produce similar instrumental sub-goals (self-preservation, resource acquisition, goal preservation).

Corrigibility — Property of an AI system that allows and assists human correction of its goals and behavior.

CAIS (Comprehensive AI Services) — AI safety approach using narrow, task-specific AI services rather than general agents. Related: tool AI vs. agent AI.

AI Control — Approach focusing on monitoring and controlling AI systems rather than ensuring alignment. Assumes potential adversarial behavior.

Scalable Oversight — Methods for humans to effectively oversee AI systems as they become more capable than humans in various domains.

Red Teaming — Adversarial testing where evaluators attempt to find vulnerabilities or elicit harmful behaviors.


Decomposed Coordination — Breaking coordination into multiple limited components rather than one powerful coordinator. Prevents risk concentration.

Verifiability Hierarchy — Ordering of implementation methods by how thoroughly they can be verified: (1) Formally verified code, (2) Regular tested code, (3) Fine-tuned narrow models, (4) Constrained general models, (5) Base LLMs, (6) RL systems.

Information Bottleneck — Architectural constraint limiting information flow between components. Forces compression, limits coordination.

Byzantine Fault Tolerance — System property where correct operation continues despite some components behaving arbitrarily (including maliciously). Requires 3f+1 components to tolerate f Byzantine faults.

N-Version Programming — Using multiple independent implementations for redundancy. Different architectures, different training, different teams.

Canary Deployment — Gradual rollout where new version serves small fraction of traffic while monitoring for problems.

Circuit Breaker — Automatic mechanism that stops operation when anomalies detected. Fail-closed design.


Mathematical Notation {#mathematical-notation}

Section titled “Mathematical Notation {#mathematical-notation}”
SymbolMeaning
DRDelegation Risk
DEDelegation Exposure (set of harm modes)
P(·)Probability of event/harm mode
D(·)Damage from harm mode
ΣSum over all harm modes/components
w_ijTrust weight from component i to j
πPrior probability (Bayesian)
λDecay rate
δDiscount factor
HHIHerfindahl-Hirschman Index (concentration measure)
ASAgency Score (coherence of goal-pursuit)
PSPower Score (ability to achieve diverse goals)
UUtility function
𝒢Space of goal functions
𝒜Action space
RACAPRisk-Adjusted Capability

Quick Reference: Typical Values {#typical-values}

Section titled “Quick Reference: Typical Values {#typical-values}”
MetricLow RiskMedium RiskHigh Risk
Component Delegation Risk< $100/mo$100-500/mo> $500/mo
System Delegation Risk< $1,000/mo$1,000-5,000/mo> $5,000/mo
Harm mode probability< 10⁻³10⁻³ to 10⁻²> 10⁻²
Nuclear CDF target--~10⁻⁵/year
Aviation target--~10⁻⁹/hour