Skip to content

Mechanism Design for AI Safety

How mechanism design can incentivize honest safety reporting in AI systems.

VCG achieves dominant-strategy incentive compatibility (DSIC) - truth-telling is optimal regardless of what others do.

Key idea: Make agents internalize their externalities. Each agent pays based on the harm their presence causes to others.

p_i(θ) = max_{a∈A} Σ_{j≠i} v_j(a, θ_j) − Σ_{j≠i} v_j(f(θ), θ_j)

Intuition: Payment equals the difference between:

  • Maximum welfare others could achieve without you
  • Actual welfare others receive with you

An agent only pays if they’re pivotal - if the outcome changes because of them.

Scenario: AI subsystems (planning, perception, safety monitor) have private information about their risk levels.

VCG implementation:

  1. Allocation: Choose configuration minimizing total system risk
  2. Payment: Each subsystem pays increase in risk to others caused by its presence

This creates incentive for truthful risk reporting - lying cannot improve outcome.

LimitationDescription
Budget imbalanceGreen-Laffont: Can’t have DSIC + efficiency + budget balance
Individual rationalityAgents may get negative utility and refuse to participate
Collusion vulnerabilityMultiple subsystems can coordinate reports
Computational complexityFinding optimal allocation may be NP-hard

Statement: If a social choice function can be implemented by any mechanism, it can be implemented by an incentive-compatible direct mechanism where agents truthfully report their types.

Dominant-Strategy (Gibbard): Any function implementable in dominant strategies has a DSIC direct mechanism.

Bayesian-Nash (Myerson): Any function implementable in Bayesian-Nash equilibrium has a BNIC direct mechanism.

Positive:

  • Can focus exclusively on truthful mechanisms
  • If any mechanism achieves safety outcomes, a truthful one exists
  • Easier to audit - just verify truth-telling is optimal

Challenges:

  • Multiple equilibria may exist (truthful is just one)
  • Requires common knowledge assumptions
  • Agents may lack computational resources for equilibrium play

Adverse Selection: AI developers have private information about capabilities/safety before contract

  • Companies know true safety vulnerabilities
  • Regulators cannot directly observe

Moral Hazard: Developers take unobservable actions after contract

  • Reduce safety testing after approval
  • Cut corners when unmonitored

Holmström-Milgrom Informativeness Principle

Section titled “Holmström-Milgrom Informativeness Principle”

“Any measure of performance that reveals information about effort should be included in the compensation contract.”

Application: Don’t just reward final outcomes - use all signals correlated with safety effort:

  • Direct safety metrics (incidents, red-team results)
  • Process measures (safety team size, testing hours)
  • Peer comparison (relative to other AI companies)
  • Independent audits

When agents perform multiple tasks (safety + capabilities):

  • If only capabilities rewarded, safety suffers
  • Optimal contracts must balance incentives across all dimensions
  • Risk: “Teaching to the test” - optimize measured metrics, ignore unmeasured safety

φ_i(v) = Σ_{S⊆N{i}} [|S|!(n-|S|-1)!/n!] × [v(S∪{i}) - v(S)]

Intuition: Average marginal contribution across all possible orderings.

Shapley is the unique solution satisfying:

AxiomMeaning
EfficiencyΣ_i φ_i = v(N) - all risk allocated
SymmetryEqual contributors get equal shares
Additivityφ_i(v+w) = φ_i(v) + φ_i(w)
Null PlayerZero contributors get zero
FeatureShapleyEuler
RequiresCharacteristic function v(S)Differentiable R(x)
AssumptionCoalition structureContinuous scaling, homogeneity
ComplexityO(2ⁿ)O(n)
Best forDiscrete componentsContinuous weights

Relationship: For continuous functions, Euler = Aumann-Shapley value (Shapley applied to infinitesimal “players”).

Exact Shapley is #P-hard. Approximations:

  • Monte Carlo sampling - random permutations
  • KernelSHAP - weighted linear regression
  • TreeSHAP - exact polynomial-time for trees

OMB M-24-18 (October 2024) - US government AI procurement:

  • Performance-based contracting with safety incentives
  • 72-hour incident reporting requirement
  • Rights/safety-impacting AI must comply by March 2025

Strict liability for AI developers:

  • “Least cost avoider” principle - developers best positioned to address risks
  • Reduced liability for voluntary disclosure
  • Good-faith safety reports protected from admission of liability

California SB 53 (September 2025):

  • 15-day reporting for critical safety incidents
  • 24-hour reporting if imminent risk of death/serious injury

Design elements:

  1. Financial rewards scaled to severity
  2. Legal safe harbors protecting researchers
  3. Coordinated disclosure - private report → time to patch → public disclosure
  4. Continuous programs (not time-limited)

NIST AI RMF emphasizes continuous testing throughout lifecycle.

Challenges (lessons from cyber):

  • Adverse selection and information asymmetry
  • Data scarcity on AI harms
  • Moral hazard (insured reduce safety investment)
  • “Silent AI risk” - not explicitly covered/excluded

Solutions - Bonus-Malus Systems:

  • Premium adjusts based on claims history
  • Bonus: reduction for loss-free periods
  • Malus: increase after claims
  • Properly designed resolves moral hazard

Current applications:

  • Metaculus, Manifold Markets for AI timelines
  • AI forecasting bots now reaching top 10 in competitions

Mechanism design: Proper scoring rules make honest probability reporting optimal.


Mechanism: Cap-and-trade with declining cap, auctioned allowances, trading.

Lessons for AI safety:

  1. Start conservative - over-allocation caused Phase I price collapse
  2. Robust monitoring - continuous verification needed
  3. Dynamic adjustment - Market Stability Reserve added later
  4. Guard against capture - industry lobbied for lenient initial allocations

Innovations: Simultaneous closing, activity rules, information revelation, package bidding.

Success: 87 auctions, >$60B revenue, became global standard.

Lessons: Multi-dimensional allocation, iterative revelation, simplicity matters (coded in Excel).

Top Trading Cycles algorithm:

  • Strategy-proof (honest reporting optimal)
  • Individually rational
  • Efficient (maximizes transplants)
  • Unique mechanism with all three properties

Lessons:

  1. Thickness matters - centralized platforms beat fragmented markets
  2. Non-monetary mechanisms possible
  3. Create value from incompatibility - labs with different capabilities can exchange

  • Vickrey (1961) - Second-price auctions
  • Clarke (1971) - Pivot mechanism
  • Groves (1973) - General public goods mechanism
  • Gibbard (1973) - Revelation principle (dominant strategy)
  • Myerson (1979) - Revelation principle (Bayesian)
  • Green & Laffont (1977) - Impossibility theorem
  • Holmström (1979) - Moral hazard and observability
  • Holmström & Milgrom (1987, 1991) - Multi-task, dynamic models
  • Nobel Prize 2016 - Contract theory (Hart & Holmström)
  • Shapley (1951) - Original value concept
  • Aumann & Shapley (1974) - Values of non-atomic games (Euler connection)
  • Lundberg & Lee (2017) - SHAP for ML explainability
  • Roth, Sönmez, Ünver (2004) - Kidney exchange
  • Milgrom (2004) - Putting Auction Theory to Work
  • Nobel Prize 2020 - Auction theory (Milgrom & Wilson)