Mechanism Design for AI Safety

How mechanism design can incentivize honest safety reporting in AI systems.

VCG Mechanisms

The Vickrey-Clarke-Groves Mechanism

VCG achieves dominant-strategy incentive compatibility (DSIC) - truth-telling is optimal regardless of what others do.

Key idea: Make agents internalize their externalities. Each agent pays based on the harm their presence causes to others.

The Clarke Pivot Payment

p_i(θ) = max_{a∈A} Σ_{j≠i} v_j(a, θ_j) − Σ_{j≠i} v_j(f(θ), θ_j)

Intuition: Payment equals the difference between:

Maximum welfare others could achieve without you
Actual welfare others receive with you

An agent only pays if they’re pivotal - if the outcome changes because of them.

Application to AI Safety Reporting

Scenario: AI subsystems (planning, perception, safety monitor) have private information about their risk levels.

VCG implementation:

Allocation: Choose configuration minimizing total system risk
Payment: Each subsystem pays increase in risk to others caused by its presence

This creates incentive for truthful risk reporting - lying cannot improve outcome.

Limitations

Limitation	Description
Budget imbalance	Green-Laffont: Can’t have DSIC + efficiency + budget balance
Individual rationality	Agents may get negative utility and refuse to participate
Collusion vulnerability	Multiple subsystems can coordinate reports
Computational complexity	Finding optimal allocation may be NP-hard

Revelation Principle

Gibbard (1973), Myerson (1979)

Statement: If a social choice function can be implemented by any mechanism, it can be implemented by an incentive-compatible direct mechanism where agents truthfully report their types.

Two Variants

Dominant-Strategy (Gibbard): Any function implementable in dominant strategies has a DSIC direct mechanism.

Bayesian-Nash (Myerson): Any function implementable in Bayesian-Nash equilibrium has a BNIC direct mechanism.

Implications for AI Safety

Positive:

Can focus exclusively on truthful mechanisms
If any mechanism achieves safety outcomes, a truthful one exists
Easier to audit - just verify truth-telling is optimal

Challenges:

Multiple equilibria may exist (truthful is just one)
Requires common knowledge assumptions
Agents may lack computational resources for equilibrium play

Principal-Agent Theory

Moral Hazard and Adverse Selection in AI

Adverse Selection: AI developers have private information about capabilities/safety before contract

Companies know true safety vulnerabilities
Regulators cannot directly observe

Moral Hazard: Developers take unobservable actions after contract

Reduce safety testing after approval
Cut corners when unmonitored

Holmström-Milgrom Informativeness Principle

“Any measure of performance that reveals information about effort should be included in the compensation contract.”

Application: Don’t just reward final outcomes - use all signals correlated with safety effort:

Direct safety metrics (incidents, red-team results)
Process measures (safety team size, testing hours)
Peer comparison (relative to other AI companies)
Independent audits

Multi-Task Problem

When agents perform multiple tasks (safety + capabilities):

If only capabilities rewarded, safety suffers
Optimal contracts must balance incentives across all dimensions
Risk: “Teaching to the test” - optimize measured metrics, ignore unmeasured safety

Shapley Values for Risk Attribution

The Formula

φ_i(v) = Σ_{S⊆N{i}} [|S|!(n-|S|-1)!/n!] × [v(S∪{i}) - v(S)]

Intuition: Average marginal contribution across all possible orderings.

Axiomatic Foundation

Shapley is the unique solution satisfying:

Axiom	Meaning
Efficiency	Σ_i φ_i = v(N) - all risk allocated
Symmetry	Equal contributors get equal shares
Additivity	φ_i(v+w) = φ_i(v) + φ_i(w)
Null Player	Zero contributors get zero

Shapley vs Euler Allocation

Feature	Shapley	Euler
Requires	Characteristic function v(S)	Differentiable R(x)
Assumption	Coalition structure	Continuous scaling, homogeneity
Complexity	O(2ⁿ)	O(n)
Best for	Discrete components	Continuous weights

Relationship: For continuous functions, Euler = Aumann-Shapley value (Shapley applied to infinitesimal “players”).

Computational Approximations

Exact Shapley is #P-hard. Approximations:

Monte Carlo sampling - random permutations
KernelSHAP - weighted linear regression
TreeSHAP - exact polynomial-time for trees

Practical Applications

Safety-Contingent Procurement

OMB M-24-18 (October 2024) - US government AI procurement:

Performance-based contracting with safety incentives
72-hour incident reporting requirement
Rights/safety-impacting AI must comply by March 2025

Liability Frameworks

Strict liability for AI developers:

“Least cost avoider” principle - developers best positioned to address risks
Reduced liability for voluntary disclosure
Good-faith safety reports protected from admission of liability

California SB 53 (September 2025):

15-day reporting for critical safety incidents
24-hour reporting if imminent risk of death/serious injury

Bug Bounties and Red Teams

Design elements:

Financial rewards scaled to severity
Legal safe harbors protecting researchers
Coordinated disclosure - private report → time to patch → public disclosure
Continuous programs (not time-limited)

NIST AI RMF emphasizes continuous testing throughout lifecycle.

Insurance Mechanisms

Challenges (lessons from cyber):

Adverse selection and information asymmetry
Data scarcity on AI harms
Moral hazard (insured reduce safety investment)
“Silent AI risk” - not explicitly covered/excluded

Solutions - Bonus-Malus Systems:

Premium adjusts based on claims history
Bonus: reduction for loss-free periods
Malus: increase after claims
Properly designed resolves moral hazard

Prediction Markets

Current applications:

Metaculus, Manifold Markets for AI timelines
AI forecasting bots now reaching top 10 in competitions

Mechanism design: Proper scoring rules make honest probability reporting optimal.

Lessons from Existing Markets

EU Emissions Trading System

Mechanism: Cap-and-trade with declining cap, auctioned allowances, trading.

Lessons for AI safety:

Start conservative - over-allocation caused Phase I price collapse
Robust monitoring - continuous verification needed
Dynamic adjustment - Market Stability Reserve added later
Guard against capture - industry lobbied for lenient initial allocations

Spectrum Auctions (Milgrom-Wilson 1994)

Innovations: Simultaneous closing, activity rules, information revelation, package bidding.

Success: 87 auctions, >$60B revenue, became global standard.

Lessons: Multi-dimensional allocation, iterative revelation, simplicity matters (coded in Excel).

Kidney Exchange

Top Trading Cycles algorithm:

Strategy-proof (honest reporting optimal)
Individually rational
Efficient (maximizes transplants)
Unique mechanism with all three properties

Lessons:

Thickness matters - centralized platforms beat fragmented markets
Non-monetary mechanisms possible
Create value from incompatibility - labs with different capabilities can exchange

Key Citations

Mechanism Design Foundations

Vickrey (1961) - Second-price auctions
Clarke (1971) - Pivot mechanism
Groves (1973) - General public goods mechanism
Gibbard (1973) - Revelation principle (dominant strategy)
Myerson (1979) - Revelation principle (Bayesian)
Green & Laffont (1977) - Impossibility theorem

Principal-Agent Theory

Holmström (1979) - Moral hazard and observability
Holmström & Milgrom (1987, 1991) - Multi-task, dynamic models
Nobel Prize 2016 - Contract theory (Hart & Holmström)

Shapley Values

Shapley (1951) - Original value concept
Aumann & Shapley (1974) - Values of non-atomic games (Euler connection)
Lundberg & Lee (2017) - SHAP for ML explainability

Market Design

Roth, Sönmez, Ünver (2004) - Kidney exchange
Milgrom (2004) - Putting Auction Theory to Work
Nobel Prize 2020 - Auction theory (Milgrom & Wilson)