Mechanism Design for AI Safety
How mechanism design can incentivize honest safety reporting in AI systems.
VCG Mechanisms
Section titled “VCG Mechanisms”VCG achieves dominant-strategy incentive compatibility (DSIC) - truth-telling is optimal regardless of what others do.
Key idea: Make agents internalize their externalities. Each agent pays based on the harm their presence causes to others.
The Clarke Pivot Payment
Section titled “The Clarke Pivot Payment”p_i(θ) = max_{a∈A} Σ_{j≠i} v_j(a, θ_j) − Σ_{j≠i} v_j(f(θ), θ_j)
Intuition: Payment equals the difference between:
- Maximum welfare others could achieve without you
- Actual welfare others receive with you
An agent only pays if they’re pivotal - if the outcome changes because of them.
Application to AI Safety Reporting
Section titled “Application to AI Safety Reporting”Scenario: AI subsystems (planning, perception, safety monitor) have private information about their risk levels.
VCG implementation:
- Allocation: Choose configuration minimizing total system risk
- Payment: Each subsystem pays increase in risk to others caused by its presence
This creates incentive for truthful risk reporting - lying cannot improve outcome.
Limitations
Section titled “Limitations”| Limitation | Description |
|---|---|
| Budget imbalance | Green-Laffont: Can’t have DSIC + efficiency + budget balance |
| Individual rationality | Agents may get negative utility and refuse to participate |
| Collusion vulnerability | Multiple subsystems can coordinate reports |
| Computational complexity | Finding optimal allocation may be NP-hard |
Gibbard (1973), Myerson (1979)
Section titled “Gibbard (1973), Myerson (1979)”Statement: If a social choice function can be implemented by any mechanism, it can be implemented by an incentive-compatible direct mechanism where agents truthfully report their types.
Two Variants
Section titled “Two Variants”Dominant-Strategy (Gibbard): Any function implementable in dominant strategies has a DSIC direct mechanism.
Bayesian-Nash (Myerson): Any function implementable in Bayesian-Nash equilibrium has a BNIC direct mechanism.
Implications for AI Safety
Section titled “Implications for AI Safety”Positive:
- Can focus exclusively on truthful mechanisms
- If any mechanism achieves safety outcomes, a truthful one exists
- Easier to audit - just verify truth-telling is optimal
Challenges:
- Multiple equilibria may exist (truthful is just one)
- Requires common knowledge assumptions
- Agents may lack computational resources for equilibrium play
Adverse Selection: AI developers have private information about capabilities/safety before contract
- Companies know true safety vulnerabilities
- Regulators cannot directly observe
Moral Hazard: Developers take unobservable actions after contract
- Reduce safety testing after approval
- Cut corners when unmonitored
Holmström-Milgrom Informativeness Principle
Section titled “Holmström-Milgrom Informativeness Principle”“Any measure of performance that reveals information about effort should be included in the compensation contract.”
Application: Don’t just reward final outcomes - use all signals correlated with safety effort:
- Direct safety metrics (incidents, red-team results)
- Process measures (safety team size, testing hours)
- Peer comparison (relative to other AI companies)
- Independent audits
Multi-Task Problem
Section titled “Multi-Task Problem”When agents perform multiple tasks (safety + capabilities):
- If only capabilities rewarded, safety suffers
- Optimal contracts must balance incentives across all dimensions
- Risk: “Teaching to the test” - optimize measured metrics, ignore unmeasured safety
Shapley Values for Risk Attribution
Section titled “Shapley Values for Risk Attribution”The Formula
Section titled “The Formula”φ_i(v) = Σ_{S⊆N{i}} [|S|!(n-|S|-1)!/n!] × [v(S∪{i}) - v(S)]
Intuition: Average marginal contribution across all possible orderings.
Axiomatic Foundation
Section titled “Axiomatic Foundation”Shapley is the unique solution satisfying:
| Axiom | Meaning |
|---|---|
| Efficiency | Σ_i φ_i = v(N) - all risk allocated |
| Symmetry | Equal contributors get equal shares |
| Additivity | φ_i(v+w) = φ_i(v) + φ_i(w) |
| Null Player | Zero contributors get zero |
Shapley vs Euler Allocation
Section titled “Shapley vs Euler Allocation”| Feature | Shapley | Euler |
|---|---|---|
| Requires | Characteristic function v(S) | Differentiable R(x) |
| Assumption | Coalition structure | Continuous scaling, homogeneity |
| Complexity | O(2ⁿ) | O(n) |
| Best for | Discrete components | Continuous weights |
Relationship: For continuous functions, Euler = Aumann-Shapley value (Shapley applied to infinitesimal “players”).
Computational Approximations
Section titled “Computational Approximations”Exact Shapley is #P-hard. Approximations:
- Monte Carlo sampling - random permutations
- KernelSHAP - weighted linear regression
- TreeSHAP - exact polynomial-time for trees
Practical Applications
Section titled “Practical Applications”Safety-Contingent Procurement
Section titled “Safety-Contingent Procurement”OMB M-24-18 (October 2024) - US government AI procurement:
- Performance-based contracting with safety incentives
- 72-hour incident reporting requirement
- Rights/safety-impacting AI must comply by March 2025
Liability Frameworks
Section titled “Liability Frameworks”Strict liability for AI developers:
- “Least cost avoider” principle - developers best positioned to address risks
- Reduced liability for voluntary disclosure
- Good-faith safety reports protected from admission of liability
California SB 53 (September 2025):
- 15-day reporting for critical safety incidents
- 24-hour reporting if imminent risk of death/serious injury
Bug Bounties and Red Teams
Section titled “Bug Bounties and Red Teams”Design elements:
- Financial rewards scaled to severity
- Legal safe harbors protecting researchers
- Coordinated disclosure - private report → time to patch → public disclosure
- Continuous programs (not time-limited)
NIST AI RMF emphasizes continuous testing throughout lifecycle.
Insurance Mechanisms
Section titled “Insurance Mechanisms”Challenges (lessons from cyber):
- Adverse selection and information asymmetry
- Data scarcity on AI harms
- Moral hazard (insured reduce safety investment)
- “Silent AI risk” - not explicitly covered/excluded
Solutions - Bonus-Malus Systems:
- Premium adjusts based on claims history
- Bonus: reduction for loss-free periods
- Malus: increase after claims
- Properly designed resolves moral hazard
Prediction Markets
Section titled “Prediction Markets”Current applications:
- Metaculus, Manifold Markets for AI timelines
- AI forecasting bots now reaching top 10 in competitions
Mechanism design: Proper scoring rules make honest probability reporting optimal.
Lessons from Existing Markets
Section titled “Lessons from Existing Markets”EU Emissions Trading System
Section titled “EU Emissions Trading System”Mechanism: Cap-and-trade with declining cap, auctioned allowances, trading.
Lessons for AI safety:
- Start conservative - over-allocation caused Phase I price collapse
- Robust monitoring - continuous verification needed
- Dynamic adjustment - Market Stability Reserve added later
- Guard against capture - industry lobbied for lenient initial allocations
Spectrum Auctions (Milgrom-Wilson 1994)
Section titled “Spectrum Auctions (Milgrom-Wilson 1994)”Innovations: Simultaneous closing, activity rules, information revelation, package bidding.
Success: 87 auctions, >$60B revenue, became global standard.
Lessons: Multi-dimensional allocation, iterative revelation, simplicity matters (coded in Excel).
Kidney Exchange
Section titled “Kidney Exchange”Top Trading Cycles algorithm:
- Strategy-proof (honest reporting optimal)
- Individually rational
- Efficient (maximizes transplants)
- Unique mechanism with all three properties
Lessons:
- Thickness matters - centralized platforms beat fragmented markets
- Non-monetary mechanisms possible
- Create value from incompatibility - labs with different capabilities can exchange
Key Citations
Section titled “Key Citations”Mechanism Design Foundations
Section titled “Mechanism Design Foundations”- Vickrey (1961) - Second-price auctions
- Clarke (1971) - Pivot mechanism
- Groves (1973) - General public goods mechanism
- Gibbard (1973) - Revelation principle (dominant strategy)
- Myerson (1979) - Revelation principle (Bayesian)
- Green & Laffont (1977) - Impossibility theorem
Principal-Agent Theory
Section titled “Principal-Agent Theory”- Holmström (1979) - Moral hazard and observability
- Holmström & Milgrom (1987, 1991) - Multi-task, dynamic models
- Nobel Prize 2016 - Contract theory (Hart & Holmström)
Shapley Values
Section titled “Shapley Values”- Shapley (1951) - Original value concept
- Aumann & Shapley (1974) - Values of non-atomic games (Euler connection)
- Lundberg & Lee (2017) - SHAP for ML explainability
Market Design
Section titled “Market Design”- Roth, Sönmez, Ünver (2004) - Kidney exchange
- Milgrom (2004) - Putting Auction Theory to Work
- Nobel Prize 2020 - Auction theory (Milgrom & Wilson)