Skip to content

Research Connections

The challenges of entanglement in AI delegation systems connect to rich bodies of academic research. This page surveys relevant literature and highlights key insights for practitioners.


The delegation framework is fundamentally an application of principal-agent theory from economics—the study of how principals (those who delegate) can align the behavior of agents (those who act on their behalf) when interests may diverge and information is asymmetric.

ConceptDefinitionEntanglement Relevance
Moral hazardAgent takes hidden actions that affect outcomesAgent may act in ways verifier can’t observe
Adverse selectionAgent has hidden information principal lacksAgent knows its own capabilities/intentions better
Information asymmetryUnequal knowledge between partiesContext contamination exploits this
Incentive compatibilityMechanism where truthful behavior is optimalDesigning systems where honesty is the best strategy

Jensen & Meckling (1976) - “Theory of the Firm”

  • Introduced agency costs: monitoring costs, bonding costs, residual loss
  • Insight for AI: Even with monitoring, some agency costs are unavoidable. The question is minimizing total cost.

Holmström (1979) - “Moral Hazard and Observability”

  • Showed that optimal contracts depend on what’s observable
  • Insight for AI: Verification is only as good as what can be observed. Unobservable agent reasoning creates irreducible risk.

Holmström & Milgrom (1991) - “Multitask Principal-Agent Analyses”

  • When agents have multiple tasks, incentivizing one may distort others
  • Insight for AI: Optimizing for measurable metrics may cause agents to neglect unmeasured but important behaviors.

When multiple principals oversee the same agent (or multiple agents oversee each other), new dynamics emerge:

Bernheim & Whinston (1986) - “Common Agency”

  • Agents serving multiple principals face conflicting incentives
  • Insight for AI: A verifier that serves both accuracy and speed principals may compromise on both.

Dixit (1996) - “The Making of Economic Policy”

  • Political economy of multiple principals with different objectives
  • Insight for AI: Organizational entanglement often stems from conflicting stakeholder interests.

Adversarial entanglements—collusion, capture, strategic coordination—are fundamentally game-theoretic phenomena.

Laffont & Martimort (1997) - “Collusion under Asymmetric Information”

  • Formal analysis of when and how agents collude
  • Collusion is easier when agents can communicate and have aligned interests
  • Insight for AI: Minimizing communication channels and creating opposed incentives reduces collusion risk.

Che & Kim (2006) - “Robustly Collusion-Proof Implementation”

  • Mechanisms that remain incentive-compatible even if agents can collude
  • Insight for AI: Some verification architectures are inherently more collusion-resistant than others.

Aumann (1974, 1987) - Correlated Equilibrium

  • Players can coordinate using shared signals without explicit communication
  • Insight for AI: TDT-style logical correlation is a form of correlated equilibrium. Agents with similar “types” (training) may coordinate without communication.

Maskin (1999) - “Nash Equilibrium and Welfare Optimality”

  • When can social choice functions be implemented?
  • Insight for AI: Not all desirable verification properties are achievable through mechanism design. Some are fundamentally impossible.

Myerson (1981) - “Optimal Auction Design”

  • Revelation principle: any mechanism can be replicated by one where agents truthfully reveal information
  • Insight for AI: If an agent can be incentivized to behave well through complex monitoring, there exists a simpler mechanism achieving the same outcome. But finding it may be hard.

Fudenberg & Maskin (1986) - Folk Theorem

  • In repeated interactions, many outcomes (including collusion) become sustainable
  • Insight for AI: Long-term agent-verifier relationships enable implicit collusion. Rotation disrupts this.

Kreps & Wilson (1982) - “Reputation and Imperfect Information”

  • Reputation building in games with incomplete information
  • Insight for AI: Agents may build reputation specifically to exploit it later. Reputation is not a reliable signal of future behavior.

Information theory provides precise tools for quantifying entanglement.

Shannon (1948) - “A Mathematical Theory of Communication”

  • Mutual information I(X;Y) measures how much knowing X tells you about Y
  • Insight for AI: Entanglement between components A and B can be measured as I(A_failure; B_failure).

Cover & Thomas (2006) - “Elements of Information Theory”

  • Standard reference for information-theoretic concepts
  • Relevant concepts: channel capacity, rate-distortion, data processing inequality

A fundamental result: information can only be lost, not created, through processing.

If X → Y → Z forms a Markov chain:
I(X; Z) ≤ I(X; Y)

Insight for AI: If context flows Agent → Summary → Verifier, the verifier cannot have more information about the original input than the summary contains. Lossy summarization creates irreversible information loss.

Shannon (1948) - Channel capacity limits

  • Every communication channel has a maximum rate of reliable information transfer
  • Insight for AI: Side-channels between components have limited bandwidth. If we can characterize the channel, we can bound the information leakage.

Bash et al. (2013) - “Limits of Reliable Communication with Low Probability of Detection”

  • Covert communication in the presence of a warden
  • Insight for AI: Agents attempting to collude while evading detection face fundamental limits on how much information they can exchange.

Detecting influence is fundamentally a causal question: does A cause changes in B?

Pearl (2009) - “Causality”

  • Distinguishes observation P(Y|X) from intervention P(Y|do(X))
  • Insight for AI: Correlation between component failures doesn’t prove influence. We need interventional experiments.

The do-calculus provides rules for computing causal effects from observational data when certain assumptions hold (e.g., causal graph is known).

Pearl (2009), Rubin (1974) - Potential outcomes framework

  • What would have happened if we had intervened differently?
  • Insight for AI: Influence detection via counterfactual: “Would B’s output have been different if A’s context had been different, holding the underlying input constant?”

Spirtes, Glymour, Scheines (2000) - “Causation, Prediction, and Search”

  • Algorithms for discovering causal structure from data
  • Insight for AI: Can we discover hidden influence channels from observational data? PC algorithm, FCI, etc.

Granger (1969) - Granger Causality

  • X Granger-causes Y if past values of X help predict Y beyond Y’s own past
  • Insight for AI: Temporal influence detection. If A’s past behavior helps predict B’s future behavior, there may be an influence channel.

Correlated blind spots connect directly to adversarial ML research.

Szegedy et al. (2013) - “Intriguing Properties of Neural Networks”

  • Discovered adversarial examples and their transferability
  • Insight for AI: Adversarial examples that fool one model often fool others—this is the mechanism behind correlated blind spots.

Papernot et al. (2016) - “Transferability in Machine Learning”

  • Systematic study of why adversarial examples transfer
  • Models with similar decision boundaries are more susceptible to transfer
  • Insight for AI: “Diversity” in verification means diversity in decision boundaries, not just different providers.

Tramèr et al. (2017) - “Ensemble Adversarial Training”

  • Training on adversarial examples from multiple models improves robustness
  • Insight for AI: Verification layers trained adversarially against each other may have lower correlation.

Cohen et al. (2019) - “Certified Adversarial Robustness via Randomized Smoothing”

  • Provides provable guarantees against adversarial perturbations
  • Insight for AI: For some verification tasks, we can prove bounds on failure rates—but these bounds may still be correlated across components.

Human institutions have grappled with entanglement for centuries.

Stigler (1971) - “The Theory of Economic Regulation”

  • Regulation is often designed for the benefit of the regulated, not the public
  • Industries “capture” their regulators through information asymmetry and revolving doors
  • Insight for AI: Verifiers working closely with agents may be captured by agent interests.

Peltzman (1976) - Extended Stigler’s model

  • Regulators balance multiple interest groups
  • Insight for AI: Capture is not binary—it’s a spectrum of influence.

Carpenter & Moss (2013) - “Preventing Regulatory Capture”

  • Comprehensive modern treatment
  • Strategies: transparency, adversarial structure, rotation, external review
  • Insight for AI: These same strategies apply to AI verification systems.

Madison, Hamilton, Jay (1788) - The Federalist Papers

  • Classic arguments for separation of powers and checks and balances
  • Ambition must be made to counteract ambition
  • Insight for AI: Adversarial collaboration patterns have deep historical roots.

Persson, Roland, Tabellini (1997) - “Separation of Powers and Political Accountability”

  • Formal analysis of how separation of powers affects outcomes
  • Insight for AI: Multiple principals with different objectives can prevent capture by any single interest.

Moore et al. (2006) - “Conflicts of Interest and the Case of Auditor Independence”

  • Psychological and economic barriers to auditor independence
  • Even well-intentioned auditors are influenced by relationships
  • Insight for AI: Independence is not a fixed property but requires active maintenance.

Public Company Accounting Oversight Board (PCAOB) - Audit rotation requirements

  • Mandatory rotation of audit partners every 5 years
  • Insight for AI: Forced rotation prevents relationship-based capture.

Entanglements often exhibit complex system behavior.

Buldyrev et al. (2010) - “Catastrophic Cascade of Failures in Interdependent Networks”

  • Interdependent networks are more fragile than independent ones
  • Small failures can cascade across network boundaries
  • Insight for AI: Shared infrastructure creates interdependency. Failures propagate.

Watts (2002) - “A Simple Model of Global Cascades”

  • How local failures become global through network structure
  • Insight for AI: Understanding cascade topology helps design circuit breakers.

Holland (1998) - “Emergence: From Chaos to Order”

  • Complex behaviors emerge from simple rules
  • Insight for AI: Collusion and coordination may emerge without explicit design. Watch for emergent entanglements.

Kauffman (1993) - “The Origins of Order”

  • Self-organization in complex adaptive systems
  • Insight for AI: Multi-agent AI systems may self-organize in unexpected ways.

The TDT-style coordination problem connects to foundational questions in decision theory.

Yudkowsky (2010) - “Timeless Decision Theory”

  • Agents reason about their decision procedure, not just their actions
  • Similar agents reach similar conclusions through logical correlation
  • Insight for AI: Agents with similar training may coordinate without communication—a form of entanglement that can’t be blocked by isolating communication channels.

Soares & Fallenstein (2017) - “Functional Decision Theory”

  • Refined version of TDT
  • Insight for AI: The relevance to AI systems is debated but concerning. If AI agents engage in FDT-style reasoning, standard independence assumptions break down.

Garrabrant et al. (2016) - “Logical Induction”

  • Framework for reasoning under logical uncertainty
  • Insight for AI: Agents reasoning about what other similar agents will do face logical uncertainty. This connects to coordination without communication.

Understanding entanglement at the model level requires interpretability.

Kornblith et al. (2019) - “Similarity of Neural Network Representations Revisited”

  • Centered Kernel Alignment (CKA) for comparing representations
  • Insight for AI: High CKA between verification models suggests they may share blind spots.

Meng et al. (2022) - “Locating and Editing Factual Associations in GPT”

  • Techniques for understanding which components of a model affect which outputs
  • Insight for AI: Could be used to trace how context from one component influences another’s decisions.

Elhage et al. (2022) - “Toy Models of Superposition” Anthropic (2023-2024) - Circuits work

  • Understanding what features models represent and how they compute
  • Insight for AI: If we understood what “features” verification models use, we could assess whether they’re truly diverse.

Summary: Key Takeaways from the Literature

Section titled “Summary: Key Takeaways from the Literature”
FieldKey Insight for Entanglement
Principal-Agent TheorySome agency costs are unavoidable; design to minimize total cost
Mechanism DesignNot all desirable properties are achievable; some are fundamentally impossible
Game TheoryRepeated interaction enables collusion; rotation disrupts it
Information TheoryEntanglement is quantifiable; lossy channels create irreversible information loss
Causal InferenceInfluence detection requires intervention, not just correlation
Adversarial MLModel similarity predicts failure correlation; diversity means different decision boundaries
Organizational TheoryCapture is a spectrum; independence requires active maintenance
Complex SystemsEntanglements may emerge without design; cascades require structural understanding
Decision TheoryLogical correlation may be irreducible for similar reasoning systems
InterpretabilityUnderstanding what models compute could reveal shared failure modes

  • Laffont & Martimort (2002) - “The Theory of Incentives: The Principal-Agent Model”
  • Cover & Thomas (2006) - “Elements of Information Theory”
  • Pearl (2009) - “Causality: Models, Reasoning, and Inference”
  • Mas-Colell, Whinston, Green (1995) - “Microeconomic Theory” (mechanism design chapters)
  • Bolton & Dewatripont (2005) - “Contract Theory”
  • Chakraborty & Yılmaz (2017) - “Authority, Consensus, and Governance”
  • Goodfellow et al. (2018) - “Making Machine Learning Robust Against Adversarial Inputs”
  • Christiano et al. (2017) - “Deep Reinforcement Learning from Human Feedback”
  • Irving et al. (2018) - “AI Safety via Debate”
  • Hubinger et al. (2019) - “Risks from Learned Optimization”

See also: