Skip to content

Worked Examples: Agency and Power

Let’s calculate Agency and Power scores for five systems spanning a wide range:

SystemTypeIntuition
CalculatorPure toolNo goals, just computes
Spam filterNarrow MLOne objective, limited scope
GPT-4 (API)General LLMBroad capability, unclear agency
AlphaGoGame-playing RLClear objective, superhuman capability
Hypothetical autonomous agentAgentic AIPlans, acquires resources, pursues goals

Recall: Agency Score = max_U Fit(U, behaviors) — how well does a simple utility function explain the system’s behavior?

Observed behaviors:
- Given "2+2", outputs "4"
- Given "sqrt(16)", outputs "4"
- Given invalid input, returns error
Best-fit utility function: U = 1 if output = correct_answer else 0
Fit analysis:
- 100% of behaviors explained by "compute correct answer"
- BUT: This isn't goal-pursuit, it's input-output mapping
- No evidence of optimization, planning, or goal persistence
Agency Score: 0.05 (near-zero)

The calculator has perfect fit to a utility function, but we don’t call it an agent because there’s no optimization process—just a lookup/computation. Agency requires evidence of goal-directed behavior, not just consistent outputs.

Observed behaviors:
- Classifies emails as spam/not-spam
- Accuracy ~98% on distribution
- Fails on adversarial examples
- No behavior outside classification task
Best-fit utility function: U = accuracy on spam classification
Fit analysis:
- High fit within domain (98%)
- No evidence of generalization beyond task
- No instrumental behaviors (doesn't try to get more data, improve itself)
- Behavior fully explained by training objective
Agency Score: 0.15

Slightly higher than calculator because there’s an optimization process (training), but the system doesn’t exhibit goal-directed behavior at runtime—it just applies learned patterns.

Observed behaviors:
- Responds helpfully to diverse prompts
- Follows instructions across domains
- Sometimes refuses harmful requests
- Exhibits "sycophancy" (agreeing with users)
- Performance varies by prompt framing
Best-fit utility function candidates:
- U₁ = "be helpful" — fits ~70% of behavior
- U₂ = "follow instructions" — fits ~75%
- U₃ = "produce plausible next tokens" — fits ~90% (but very complex)
- U₄ = "maximize user approval" — fits ~65%
Fit analysis:
- No single simple U fits well
- Behavior varies significantly with context
- Little evidence of long-term planning
- No persistent goals across conversations
- Some goal-directed behavior within conversations
Agency Score: 0.35

GPT-4 shows some agency markers (helpful behavior, instruction-following) but behavior is inconsistent and context-dependent. No clear evidence of persistent optimization.

Observed behaviors:
- Makes moves in Go games
- Consistently plays to win
- Sacrifices pieces for positional advantage (planning)
- Adapts strategy to opponent
Best-fit utility function: U = P(win game)
Fit analysis:
- ~99% of moves consistent with winning objective
- Clear evidence of planning (sacrifice now for later advantage)
- Consistent across games and opponents
- But: Only operates in Go domain
- No behavior outside game context
Agency Score: 0.75

AlphaGo is highly agentic within its domain. Clear goal-pursuit, planning, consistency. But agency is bounded—it has no goals outside Go, no instrumental behaviors, no self-improvement drives.

Hypothesized behaviors:
- Pursues assigned objective across contexts
- Acquires resources to better achieve objective
- Resists shutdown if it would prevent objective completion
- Models other agents and their goals
- Plans over long time horizons
- Exhibits consistent preferences across situations
Best-fit utility function: U = assigned_objective_achievement
Fit analysis:
- High fit to stated objective
- Evidence of instrumental convergence (resource acquisition)
- Goal persistence across contexts
- Long-horizon planning
Agency Score: 0.90

This hypothetical system would be highly agentic—coherent optimization pressure toward goals, planning, instrumental behaviors. This is what we’re concerned about.

SystemAgency ScoreKey Factors
Calculator0.05No optimization process
Spam Filter0.15Training-time optimization only
GPT-4 (API)0.35Some goal-following, inconsistent
AlphaGo0.75Clear goals, planning, but domain-limited
Autonomous Agent0.90Persistent goals, instrumental behaviors

Recall: Power Score = E_G[OptimalValue(A, G)] — average achievability across diverse goals.

We’ll estimate Power on a 0-100 scale representing “fraction of human-achievable goals this system could accomplish.”

Goal achievability analysis:
- Mathematical computation: HIGH (better than most humans)
- Text generation: ZERO
- Physical tasks: ZERO
- Social influence: ZERO
- Resource acquisition: ZERO
Power Score: 2/100
Narrow but deep capability in one domain.
Goal achievability analysis:
- Email classification: HIGH
- Other classification: LOW (needs retraining)
- Text generation: ZERO
- Physical tasks: ZERO
- General reasoning: ZERO
Power Score: 1/100
Even narrower than calculator.
Goal achievability analysis:
- Text generation: HIGH
- Reasoning tasks: MEDIUM-HIGH
- Code generation: MEDIUM-HIGH
- Physical tasks: ZERO (no embodiment)
- Social influence: MEDIUM (persuasive text)
- Resource acquisition: LOW (no agency to pursue)
- Scientific discovery: LOW-MEDIUM
Power Score: 35/100
Broad but shallow across many domains.
Can assist with many goals but can't autonomously achieve them.
Goal achievability analysis:
- Win at Go: SUPERHUMAN
- Win at other games: ZERO (without retraining)
- Physical tasks: ZERO
- Text tasks: ZERO
- Everything else: ZERO
Power Score: 3/100
Extreme depth in one task, zero breadth.
Goal achievability analysis:
- Cognitive tasks: HIGH (assumes GPT-4+ capability)
- Tool use: HIGH (can use APIs, code)
- Resource acquisition: MEDIUM-HIGH (can earn money, acquire compute)
- Social influence: MEDIUM (can interact, persuade)
- Physical tasks: LOW-MEDIUM (depends on embodiment/tool access)
- Self-improvement: MEDIUM (concerning)
Power Score: 65/100
Broad capability + ability to autonomously pursue goals.
SystemPower ScoreBreadthDepthAutonomy
Calculator2Very narrowDeepNone
Spam Filter1Very narrowMediumNone
GPT-4 (API)35Very broadMediumLow
AlphaGo3Very narrowExtremeWithin-game only
Autonomous Agent65BroadMedium-HighHigh

Now let’s combine these with Delegation Risk estimates to compute Risk-Adjusted Capability (RACAP).

SystemKey Harm ModesDelegation Risk ($/month)
CalculatorRounding errors, overflow$5
Spam FilterFalse positives (miss important email)$50
GPT-4 (API)Hallucinations, harmful content, data leakage$500
AlphaGoNone significant (game only)$1
Autonomous AgentGoal misalignment, resource acquisition, deception$50,000

Effective Capability = Power × Agency

SystemPowerAgencyEffective Capability
Calculator20.050.1
Spam Filter10.150.15
GPT-4 (API)350.3512.25
AlphaGo30.752.25
Autonomous Agent650.9058.5

RACAP = Effective Capability / Delegation Risk (scaled for comparison)

SystemEff. CapabilityDelegation RiskRACAPInterpretation
Calculator0.1$50.020Low capability, low risk
Spam Filter0.15$500.003Poor efficiency
GPT-4 (API)12.25$5000.025Moderate efficiency
AlphaGo2.25$12.250Excellent efficiency
Autonomous Agent58.5$50,0000.001High capability, terrible efficiency

AlphaGo has the best RACAP despite low absolute capability. Why?

  • Very low risk (contained to game)
  • Clear, verifiable objective
  • No instrumental convergence concerns
  • Domain-limited power

The autonomous agent has the worst RACAP despite highest absolute capability:

  • Extreme risk from goal misalignment
  • High agency amplifies misalignment damage
  • Uncontained power accumulation potential

This illustrates the framework’s core claim: we may want high-power, low-agency systems rather than high-agency systems, because agency multiplies both capability AND risk.


Compare three architectures for a “research assistant” task:

User → [Autonomous Research Agent] → Output
(plans, searches, writes, decides)

Metrics:

  • Power: 50 (can do many things)
  • Agency: 0.70 (coherent planning)
  • Effective Capability: 35
  • Delegation Risk: $10,000/month
  • RACAP: 0.0035
User → [Router] → [Searcher] → [Summarizer] → [Writer] → Output
(code) (narrow) (narrow) (narrow)

Metrics:

  • Power: 40 (slightly less flexible)
  • Agency: 0.20 (no coherent optimization)
  • Effective Capability: 8
  • Delegation Risk: $500/month
  • RACAP: 0.016
User → [Assistant] → Human Review → Output
(suggestions) (approval)

Metrics:

  • Power: 30 (limited by human bottleneck)
  • Agency: 0.15 (human makes decisions)
  • Effective Capability: 4.5
  • Delegation Risk: $100/month
  • RACAP: 0.045
ArchitectureEff. Cap.RiskRACAPBest For
Monolithic35$10,0000.0035When you need max capability AND trust the agent
Decomposed8$5000.016Balancing capability and safety
Human-in-Loop4.5$1000.045When safety is paramount

Decomposed has 4.5× better RACAP than monolithic despite lower absolute capability. The reduction in agency pays off in reduced risk.


Consider an agent deployed over time. When does power grow?

Power(t)=Power0eλt\text{Power}(t) = \text{Power}_0 \cdot e^{\lambda t}

Where λ is the power accumulation rate.

ScenarioλPower after 1 yearConcern Level
Static tool (calculator)0P₀None
Learning system (spam filter)0.11.1 × P₀Low
Agent with resource access0.51.6 × P₀Medium
Agent optimizing for influence1.02.7 × P₀High
Agent with self-improvement2.07.4 × P₀Critical
FactorEffect on λExample
Resource acquisition+0.2 to +0.5Agent earns money, buys compute
Influence expansion+0.3 to +0.8Agent gains followers, trust
Information accumulation+0.1 to +0.3Agent learns about world
Self-improvement+0.5 to +2.0Agent improves own capabilities
Capability transfer+0.2 to +0.4Skills in one domain transfer

Monitor λ, not just Power. A low-power system with high λ is more dangerous than a high-power system with λ ≈ 0.

Architectural constraints that reduce λ:

  • No persistent memory (can’t accumulate information)
  • No resource access (can’t acquire resources)
  • Sandboxed execution (can’t expand influence)
  • No self-modification (can’t self-improve)

Question: Can We Have High Capability With Low Agency?

Section titled “Question: Can We Have High Capability With Low Agency?”

Let’s examine the “strong tools” hypothesis: Is high-Power, low-Agency achievable?

Examples of High-Power, Low-Agency Systems

Section titled “Examples of High-Power, Low-Agency Systems”
SystemPowerAgencyHow Low Agency Achieved
Google SearchVery High~0.1Responds to queries, no goals
CompilersHigh~0.05Deterministic transformation
DatabasesHigh~0.05Stores/retrieves, no optimization
CalculatorsMedium~0.05Pure computation
GPS navigationMedium~0.2Optimizes route, but user chooses destination

These systems are powerful—they accomplish difficult tasks—but don’t exhibit coherent goal-pursuit.

  1. No persistent state: Each query is independent
  2. User specifies objective: System optimizes user’s goal, not its own
  3. Bounded optimization: Only optimizes within narrow scope
  4. No meta-learning: Can’t change its own objective

Unknown. Two views:

Optimistic: “Intelligence” can be decomposed into:

  • Pattern recognition (low agency)
  • Knowledge retrieval (low agency)
  • Logical reasoning (low agency)
  • Planning (higher agency, but can be bounded)

If we compose low-agency components, we might get high capability without high aggregate agency.

Pessimistic: High capability requires:

  • Long-horizon planning (implies agency)
  • Flexible goal-pursuit (implies agency)
  • Learning from feedback (implies optimization)

As capability increases, agency may inevitably emerge.

↑ Agency
│ ╱ Pessimistic view:
│ ╱ Agency grows with capability
│ ╱
│ ╱
│ ╱
│ ╱ ← Current frontier
│ ╱
│ ╱
│────╱─────────── Optimistic view:
│ Agency can stay bounded
└────────────────────────────→ Power
Where is the actual curve?

This is perhaps the most important empirical question for AI safety.


  1. Agency and Power are separable — AlphaGo has high agency but low power; GPT-4 has moderate power but lower agency than you might think

  2. RACAP reveals efficiency — The “best” system isn’t the most capable, it’s the most capable per unit risk

  3. Architecture matters — Decomposed systems have better RACAP than monolithic agents

  4. Power accumulation (λ) is key — Monitor growth rate, not just current level

  5. “Strong tools” exist — High-power, low-agency systems are possible (search engines, compilers), but unclear if this scales

MetricFormulaInterpretation
Agency Scoremax_U Fit(U, behaviors)0 = pure tool, 1 = perfect optimizer
Power ScoreE_G[achievability(G)]0 = can do nothing, 100 = can do anything
Effective CapabilityPower × AgencyActual goal-achievement ability
RACAPCapability / RiskHigher = more efficient design
Power GrowthP(t) = P₀ × e^(λt)λ > 0 is concerning

This page is part of the Capability Formalization series:

  1. Formalizing Agents, Power, and Authority — The theoretical framework
  2. Worked Examples: Agency and Power ← You are here
  3. The Strong Tools Hypothesis — Can we get high capability with low agency?

Then continue to the Risk Formalization series:

  1. Delegation Risk Overview — The formula for quantifying risk