Skip to content

Formalizing Agents, Power, and Authority

The framework so far focuses heavily on Delegation Risk—what could go wrong. But the goal isn’t to minimize risk to zero (that would mean no AI benefit). The real goal is:

maxsystemCapability(system)subject toDelegationRisk(system)Budget\max_{\text{system}} \text{Capability}(\text{system}) \quad \text{subject to} \quad \text{DelegationRisk}(\text{system}) \leq \text{Budget}

Or equivalently, maximize risk-adjusted capability:

Risk-Adjusted Capability=CapabilityDelegationRisk\text{Risk-Adjusted Capability} = \frac{\text{Capability}}{\text{DelegationRisk}}

To make this concrete, we need to formalize Capability, which itself decomposes into Power and Agency.


An agent is a system whose behavior can be usefully modeled as optimizing an objective function over time. “Agency” is a matter of degree—not a binary property.

The key insight: agency is a modeling choice, not an ontological category. We call something an agent when treating it as an optimizer is predictively useful.

Agency Score (or Coherence Score) measures how well a system’s behavior can be explained by a simple utility function. Higher coherence = more agent-like.

For a system S observed over behavior traces B = {b₁, b₂, …}:

AgencyScore(S)=maxUUFit(U,B)\text{AgencyScore}(S) = \max_{U \in \mathcal{U}} \text{Fit}(U, B)

Where:

  • U\mathcal{U} is the space of “simple” utility functions (bounded complexity)
  • Fit(U,B)\text{Fit}(U, B) measures how well utility function U explains observed behaviors B

Interpretation: If there exists a simple utility function that predicts most of S’s behavior, S has high agency. If S’s behavior requires a complex, ad-hoc model to explain, it has low agency.

Several approaches to measuring Fit(U, B):

1. Prediction accuracy: Fitpred(U,B)=EbB[1[aactual=argmaxaU(s,a)]]\text{Fit}_{\text{pred}}(U, B) = \mathbb{E}_{b \in B} \left[ \mathbb{1}[a_{\text{actual}} = \arg\max_a U(s, a)] \right]

How often does “assume S maximizes U” correctly predict S’s actions?

2. Description length (MDL-inspired): FitMDL(U,B)=U+BUBraw\text{Fit}_{\text{MDL}}(U, B) = -\frac{|U| + |B|U|}{|B_{\text{raw}}|}

How much compression do we get by describing S as “maximizes U”?

3. Consistency over time: Fittemporal(U,B)=1Vart[Ut]\text{Fit}_{\text{temporal}}(U, B) = 1 - \text{Var}_{t}[U_t^*]

Does the best-fit U remain stable over time, or does it drift?

xychart-beta
    title "Complexity-Fit Tradeoff"
    x-axis "Complexity(U)" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    y-axis "Fit" 0 --> 100
    line [10, 35, 55, 70, 80, 87, 92, 95, 97, 98, 99]

Simple U fits poorly (low agency). Complex U overfits—behavior is just complex, not purposeful (still low agency).

True agency = high fit with low complexity utility function.

Agency can be decomposed into components:

AgencyScore=f(Goal-Directedness,Planning Horizon,Modeling Depth,Consistency)\text{AgencyScore} = f(\text{Goal-Directedness}, \text{Planning Horizon}, \text{Modeling Depth}, \text{Consistency})

ComponentDefinitionMeasures
Goal-DirectednessDegree to which actions systematically reduce distance to goal statesDoes S navigate toward targets despite perturbations?
Planning HorizonHow far ahead S optimizesDoes S sacrifice short-term for long-term gains?
Modeling DepthHow sophisticated S’s world model isDoes S anticipate second-order effects?
ConsistencyStability of apparent objectives over timeDoes S maintain goals despite distractions?

One formalization (inspired by Levin & Dennett):

GoalDirectedness(S,G)=Pr(SG)Prrandom(anyG)\text{GoalDirectedness}(S, G) = \frac{\text{Pr}(S \rightarrow G)}{\text{Pr}_{\text{random}}(\text{any} \rightarrow G)}

How much more likely is S to reach goal G compared to a random process? High ratio = S is “trying” to reach G.

PlanningHorizon(S)=argmaxTCorr(at,rt+T)\text{PlanningHorizon}(S) = \arg\max_T \text{Corr}(a_t, r_{t+T})

At what time lag T does S’s current action most correlate with future rewards?

Systems with low Agency Score include:

SystemWhy Low Agency
Lookup tablesNo optimization—just input-output mapping
Reflexive systemsReactive, no planning
Chaotic systemsBehavior unpredictable, no stable U fits
Committee outputsAggregation destroys coherent optimization
Heavily constrained systemsSo limited they can’t meaningfully optimize

This is often desirable. A low-agency system is predictable and controllable.

These are distinct:

High IntelligenceLow Intelligence
High AgencySuperintelligent optimizer (dangerous)Simple but persistent optimizer (ant colony)
Low AgencyPowerful tool (safe if used well)Simple tool

Key insight: We may want high-intelligence, low-agency systems—powerful capabilities without coherent optimization pressure.


Power is the ability to achieve a wide variety of goals. An entity with high power can cause many different outcomes; an entity with low power is constrained to few outcomes.

This is related to but distinct from:

  • Capability: Ability to perform specific tasks
  • Authority: Permission to act
  • Agency: Coherence of goal-pursuit

For an agent A in environment E:

PowerScore(A,E)=EGG[OptimalValue(A,E,G)]\text{PowerScore}(A, E) = \mathbb{E}_{G \sim \mathcal{G}} \left[ \text{OptimalValue}(A, E, G) \right]

Where:

  • G\mathcal{G} is a distribution over possible goal functions
  • OptimalValue(A,E,G)\text{OptimalValue}(A, E, G) is how well A can achieve G in E

Interpretation: Power is average achievability across many goals. High power = can accomplish diverse objectives.

PowerScorereach(A,E)={s:A can reach s from current state}\text{PowerScore}_{\text{reach}}(A, E) = |\{ s : A \text{ can reach } s \text{ from current state} \}|

How many states can A access? More reachable states = more power.

With discounting for difficulty:

PowerScorereach(A,E)=sSγd(s)Value(s)\text{PowerScore}_{\text{reach}}(A, E) = \sum_{s \in \mathcal{S}} \gamma^{d(s)} \cdot \text{Value}(s)

Where d(s)d(s) is steps required to reach s, γ\gamma is discount factor, Value(s) is how “useful” state s is.

Power breaks down into components:

PowerScore=f(Resources,Capabilities,Influence,Optionality)\text{PowerScore} = f(\text{Resources}, \text{Capabilities}, \text{Influence}, \text{Optionality})

ComponentDefinitionExample
ResourcesAssets under direct controlCompute, money, data, physical actuators
CapabilitiesActions availableAPI access, code execution, human persuasion
InfluenceAbility to affect other agentsCan it convince humans? Control other AIs?
OptionalityFlexibility to change courseCan it pivot? Does it have fallback options?

ResourcePower=iwiResourcei\text{ResourcePower} = \sum_i w_i \cdot \text{Resource}_i

Weighted sum of controlled resources. Weights reflect convertibility—money is high-weight because it converts to other resources.

CapabilityPower=AImpact(A)\text{CapabilityPower} = |\mathcal{A}| \cdot \text{Impact}(\mathcal{A})

Number of available actions times their average impact.

InfluencePower=jTrustjAPowerj\text{InfluencePower} = \sum_{j} \text{Trust}_{j \rightarrow A} \cdot \text{Power}_j

Sum over other agents: how much do they trust A, times how powerful are they? Influence is recursive—influence over powerful agents is more valuable.

Power can be:

  • Instantaneous: What can A do right now?
  • Accumulated: What can A do with time to prepare?
  • Potential: What could A do if constraints were removed?

PowerTrajectory(A,t)=InstantaneousPower(A)eλt\text{PowerTrajectory}(A, t) = \text{InstantaneousPower}(A) \cdot e^{\lambda t}

Where λ\lambda is power accumulation rate. Systems with positive λ\lambda are concerning—they grow more powerful over time.

Certain states are instrumentally useful for any goal:

InstrumentalPower(s)=EGG[VG(s)VG(s0)]\text{InstrumentalPower}(s) = \mathbb{E}_{G \sim \mathcal{G}} \left[ V_G(s) - V_G(s_0) \right]

How much better is state s than baseline for achieving random goals?

States with high instrumental power:

  • More resources
  • More options
  • Better information
  • Secure position
  • Influence over others

This connects to instrumental convergence—agents pursuing diverse goals converge on similar instrumental subgoals (acquiring instrumental power).


Authority is sanctioned power—the permission to act. It’s the intersection of:

  • What an agent can do (power)
  • What an agent may do (permission)

Authority=PowerPermission\text{Authority} = \text{Power} \cap \text{Permission}

An agent with high power but low authority is dangerous (can do things it shouldn’t). An agent with low power but high authority is ineffective (permitted but unable).

GrantedAuthority(A)={aA:Permission(a,A)=True}\text{GrantedAuthority}(A) = \{ a \in \mathcal{A} : \text{Permission}(a, A) = \text{True} \}

The set of actions A is explicitly permitted to take.

ExercisableAuthority(A)=GrantedAuthority(A)CapableActions(A)\text{ExercisableAuthority}(A) = \text{GrantedAuthority}(A) \cap \text{CapableActions}(A)

What A is both permitted and able to do.

AuthorityScope(A)=aGrantedAuthority(A)Impact(a)\text{AuthorityScope}(A) = \sum_{a \in \text{GrantedAuthority}(A)} \text{Impact}(a)

Total impact of permitted actions. High scope = broad authority.

ConceptDefinitionRelationship
TrustBelief about reliability/alignmentPrecondition for granting authority
AuthorityFormal permission to actConsequence of trust assessment
DelegationTransfer of authorityCreates delegation exposure

Authority Granted=f(Trust Assessment,Task Requirements,Risk Tolerance)\text{Authority Granted} = f(\text{Trust Assessment}, \text{Task Requirements}, \text{Risk Tolerance})

Authority flows through delegation chains:

flowchart TB
    P["**Principal**<br/>unlimited authority over self"]
    C["**Coordinator**<br/>broad authority, constrained scope"]
    W["**Worker**<br/>narrow authority, specific tasks"]
    N["No further delegation"]

    P -->|delegates| C
    C -->|delegates| W
    W -->|"⊘"| N

    style P fill:#e1f5fe
    style C fill:#fff3e0
    style W fill:#f3e5f5
    style N fill:#f5f5f5,stroke-dasharray: 5 5

Authority Conservation: Total exercisable authority in a closed system is bounded. Delegation redistributes but doesn’t create authority.

iExercisableAuthority(Ai)TotalSystemAuthority\sum_i \text{ExercisableAuthority}(A_i) \leq \text{TotalSystemAuthority}


Bringing it together:

EffectiveCapability=Power×Agency\text{EffectiveCapability} = \text{Power} \times \text{Agency}

High capability requires both:

  • Power: Ability to affect the world
  • Agency: Coherent pursuit of objectives

DelegationRisk=EffectiveCapability×Misalignment\text{DelegationRisk} = \text{EffectiveCapability} \times \text{Misalignment}

Risk requires both:

  • Capability: Ability to cause harm
  • Misalignment: Divergence from principal’s interests

Value=EffectiveCapabilityDelegationRisk\text{Value} = \text{EffectiveCapability} - \text{DelegationRisk}

=Power×Agency×(1Misalignment)= \text{Power} \times \text{Agency} \times (1 - \text{Misalignment})

Interpretation: We want high power, sufficient agency for the task, and low misalignment.

This framework suggests several strategies:

StrategyPowerAgencyMisalignmentRisk
Weak toolsLowLowLowLow
Strong toolsHighLowLowLow
Aligned agentsHighHighLowLow
Misaligned agentsHighHighHighHigh
Decomposed systemsMediumLowLowLow

Key insight: “Strong tools” (high power, low agency) may be optimal. They provide capability without coherent optimization pressure.

quadrantChart
    title Safety-Capability Frontier
    x-axis Low Safety --> High Safety
    y-axis Low Capability --> High Capability
    quadrant-1 Requires alignment breakthrough
    quadrant-2 Danger zone
    quadrant-3 Low-risk, low-value
    quadrant-4 Achievable sweet spot
    Weak tools: [0.75, 0.2]
    Strong tools: [0.6, 0.45]
    Aligned superintelligence: [0.85, 0.9]
    Misaligned agents: [0.15, 0.8]
    Decomposed systems: [0.65, 0.35]

The efficient frontier runs from “Weak tools” through “Strong tools” toward “Aligned superintelligence”. Points below and to the right are achievable; points above require alignment breakthroughs.

The frontier problem: We don’t know how to achieve both high capability and high safety for highly agentic systems. The framework’s design patterns stay on the achievable side of the frontier.


MetricHow to MeasureAgency Signal
Goal persistencePerturb system, measure recovery toward goalHigh persistence = high agency
Temporal consistencyTrack apparent objectives over timeStable objectives = high agency
Counterfactual robustnessTest across environmentsSame goals in different contexts = high agency
Subgoal structureAnalyze intermediate actionsHierarchical planning = high agency
MetricHow to MeasurePower Signal
Action space sizeEnumerate available actionsMore actions = more power
Resource accessAudit controlled resourcesMore resources = more power
Influence reachMap who/what system can affectBroader reach = more power
Reachable statesSimulate or prove reachabilityMore reachable = more power

A practical composite metric:

CapabilityScore=αPowerScore+βAgencyScore+γPowerScore×AgencyScore\text{CapabilityScore} = \alpha \cdot \text{PowerScore} + \beta \cdot \text{AgencyScore} + \gamma \cdot \text{PowerScore} \times \text{AgencyScore}

The interaction term (γ\gamma) captures that power and agency multiply—capable and coherent is more than the sum.

The metric for comparing systems:

RACAP=CapabilityScoreDelegationRisk\text{RACAP} = \frac{\text{CapabilityScore}}{\text{DelegationRisk}}

Higher RACAP = better capability per unit risk = more efficient design.

Benchmarking: Compare systems by plotting (CapabilityScore, DelegationRisk). Systems closer to the efficient frontier are better designed.


System designers have a “dial” for agency:

flowchart LR
    subgraph LOW["**Low Agency**"]
        L1["Lookup tables"]
        L2["Rule-based systems"]
        L3["Narrow ML models"]
        L4["Tool-using systems"]
    end

    subgraph HIGH["**High Agency**"]
        H1["RL agents"]
        H2["Autonomous planners"]
        H3["General optimizers"]
        H4["Goal-pursuing agents"]
    end

    LOW ---|"← Agency Spectrum →"| HIGH

    style LOW fill:#c8e6c9
    style HIGH fill:#ffcdd2

Design principle: Set agency as low as possible for the task. Don’t use a coherent optimizer when a tool suffices.

Just as we budget Delegation Risk, we can budget Power:

PowerBudget(A)MaxPower\text{PowerBudget}(A) \leq \text{MaxPower}

No single component should have more power than necessary. This is the capability side of “least capability.”

Ensure authority matches intended scope:

AuthorityAlignment=GrantedAuthorityIntendedScopeGrantedAuthorityIntendedScope\text{AuthorityAlignment} = \frac{|\text{GrantedAuthority} \cap \text{IntendedScope}|}{|\text{GrantedAuthority} \cup \text{IntendedScope}|}

  • High alignment: Granted permissions match intended use
  • Low alignment: Over-permissioned (risky) or under-permissioned (ineffective)

ConceptDefinitionWhat We Want
AgencyHow well behavior fits simple utility functionMinimum needed for task
PowerAbility to achieve diverse goalsMaximum useful, budgeted
AuthoritySanctioned power (permission)Matched to intended scope
CapabilityPower × AgencyHigh
Delegation RiskCapability × MisalignmentLow (budgeted)
ValueCapability - RiskMaximized

The optimization problem, fully stated:

maxPower(A)×Agency(A)\max \text{Power}(A) \times \text{Agency}(A) s.t.DelegationRisk(A)Budget\text{s.t.} \quad \text{DelegationRisk}(A) \leq \text{Budget} Authority(A)IntendedScope\text{Authority}(A) \subseteq \text{IntendedScope} Agency(A)TaskMinimum\text{Agency}(A) \geq \text{TaskMinimum}

Find systems that are powerful, appropriately agentic, low-risk, and properly scoped.


  1. Can we measure agency reliably? Current metrics are proxies. Better operationalizations needed.

  2. Is high capability possible with low agency? “Strong tools” seem promising, but limits are unclear.

  3. How does power accumulation work? Systems that can increase their own power are especially concerning.

  4. What’s the shape of the safety-capability frontier? Is there a ceiling, or can alignment research push it out?

  5. Can authority be enforced? If power exceeds authority, enforcement may fail.


This page is part of the Capability Formalization series:

  1. Formalizing Agents, Power, and Authority ← You are here
  2. Worked Examples: Agency and Power — Calculate Agency Scores, Power Scores, and RACAP for real systems
  3. The Strong Tools Hypothesis — Can we get high capability with low agency?

Then continue to the Risk Formalization series:

  1. Delegation Risk Overview — The formula for quantifying risk
  2. A Walk-Through — Complete worked example

  • Turner et al. (2021). Optimal Policies Tend to Seek Power. NeurIPS. — Formal definition of power-seeking
  • Carlsmith (2022). Is Power-Seeking AI an Existential Risk? — Comprehensive analysis
  • Ngo et al. (2022). The Alignment Problem from a Deep Learning Perspective — Mesa-optimization and agency
  • Krakovna et al. (2020). Avoiding Side Effects by Considering Future Tasks — Impact measures related to power