Research Connections

The challenges of entanglement in AI delegation systems connect to rich bodies of academic research. This page surveys relevant literature and highlights key insights for practitioners.

Principal-Agent Theory

The delegation framework is fundamentally an application of principal-agent theory from economics—the study of how principals (those who delegate) can align the behavior of agents (those who act on their behalf) when interests may diverge and information is asymmetric.

Key Concepts

Concept	Definition	Entanglement Relevance
Moral hazard	Agent takes hidden actions that affect outcomes	Agent may act in ways verifier can’t observe
Adverse selection	Agent has hidden information principal lacks	Agent knows its own capabilities/intentions better
Information asymmetry	Unequal knowledge between parties	Context contamination exploits this
Incentive compatibility	Mechanism where truthful behavior is optimal	Designing systems where honesty is the best strategy

Foundational Papers

Jensen & Meckling (1976) - “Theory of the Firm”

Introduced agency costs: monitoring costs, bonding costs, residual loss
Insight for AI: Even with monitoring, some agency costs are unavoidable. The question is minimizing total cost.

Holmström (1979) - “Moral Hazard and Observability”

Showed that optimal contracts depend on what’s observable
Insight for AI: Verification is only as good as what can be observed. Unobservable agent reasoning creates irreducible risk.

Holmström & Milgrom (1991) - “Multitask Principal-Agent Analyses”

When agents have multiple tasks, incentivizing one may distort others
Insight for AI: Optimizing for measurable metrics may cause agents to neglect unmeasured but important behaviors.

The Multi-Principal Problem

When multiple principals oversee the same agent (or multiple agents oversee each other), new dynamics emerge:

Bernheim & Whinston (1986) - “Common Agency”

Agents serving multiple principals face conflicting incentives
Insight for AI: A verifier that serves both accuracy and speed principals may compromise on both.

Dixit (1996) - “The Making of Economic Policy”

Political economy of multiple principals with different objectives
Insight for AI: Organizational entanglement often stems from conflicting stakeholder interests.

Game Theory & Mechanism Design

Adversarial entanglements—collusion, capture, strategic coordination—are fundamentally game-theoretic phenomena.

Collusion Resistance

Laffont & Martimort (1997) - “Collusion under Asymmetric Information”

Formal analysis of when and how agents collude
Collusion is easier when agents can communicate and have aligned interests
Insight for AI: Minimizing communication channels and creating opposed incentives reduces collusion risk.

Che & Kim (2006) - “Robustly Collusion-Proof Implementation”

Mechanisms that remain incentive-compatible even if agents can collude
Insight for AI: Some verification architectures are inherently more collusion-resistant than others.

Correlated Equilibrium

Aumann (1974, 1987) - Correlated Equilibrium

Players can coordinate using shared signals without explicit communication
Insight for AI: TDT-style logical correlation is a form of correlated equilibrium. Agents with similar “types” (training) may coordinate without communication.

Mechanism Design for Verification

Maskin (1999) - “Nash Equilibrium and Welfare Optimality”

When can social choice functions be implemented?
Insight for AI: Not all desirable verification properties are achievable through mechanism design. Some are fundamentally impossible.

Myerson (1981) - “Optimal Auction Design”

Revelation principle: any mechanism can be replicated by one where agents truthfully reveal information
Insight for AI: If an agent can be incentivized to behave well through complex monitoring, there exists a simpler mechanism achieving the same outcome. But finding it may be hard.

Repeated Games and Reputation

Fudenberg & Maskin (1986) - Folk Theorem

In repeated interactions, many outcomes (including collusion) become sustainable
Insight for AI: Long-term agent-verifier relationships enable implicit collusion. Rotation disrupts this.

Kreps & Wilson (1982) - “Reputation and Imperfect Information”

Reputation building in games with incomplete information
Insight for AI: Agents may build reputation specifically to exploit it later. Reputation is not a reliable signal of future behavior.

Information Theory

Information theory provides precise tools for quantifying entanglement.

Mutual Information

Shannon (1948) - “A Mathematical Theory of Communication”

Mutual information I(X;Y) measures how much knowing X tells you about Y
Insight for AI: Entanglement between components A and B can be measured as I(A_failure; B_failure).

Cover & Thomas (2006) - “Elements of Information Theory”

Standard reference for information-theoretic concepts
Relevant concepts: channel capacity, rate-distortion, data processing inequality

Data Processing Inequality

A fundamental result: information can only be lost, not created, through processing.

If X → Y → Z forms a Markov chain:
I(X; Z) ≤ I(X; Y)

Insight for AI: If context flows Agent → Summary → Verifier, the verifier cannot have more information about the original input than the summary contains. Lossy summarization creates irreversible information loss.

Channel Capacity for Covert Communication

Shannon (1948) - Channel capacity limits

Every communication channel has a maximum rate of reliable information transfer
Insight for AI: Side-channels between components have limited bandwidth. If we can characterize the channel, we can bound the information leakage.

Bash et al. (2013) - “Limits of Reliable Communication with Low Probability of Detection”

Covert communication in the presence of a warden
Insight for AI: Agents attempting to collude while evading detection face fundamental limits on how much information they can exchange.

Causal Inference

Detecting influence is fundamentally a causal question: does A cause changes in B?

Intervention and Do-Calculus

Pearl (2009) - “Causality”

Distinguishes observation P(Y|X) from intervention P(Y|do(X))
Insight for AI: Correlation between component failures doesn’t prove influence. We need interventional experiments.

The do-calculus provides rules for computing causal effects from observational data when certain assumptions hold (e.g., causal graph is known).

Counterfactual Reasoning

Pearl (2009), Rubin (1974) - Potential outcomes framework

What would have happened if we had intervened differently?
Insight for AI: Influence detection via counterfactual: “Would B’s output have been different if A’s context had been different, holding the underlying input constant?”

Causal Discovery

Spirtes, Glymour, Scheines (2000) - “Causation, Prediction, and Search”

Algorithms for discovering causal structure from data
Insight for AI: Can we discover hidden influence channels from observational data? PC algorithm, FCI, etc.

Granger (1969) - Granger Causality

X Granger-causes Y if past values of X help predict Y beyond Y’s own past
Insight for AI: Temporal influence detection. If A’s past behavior helps predict B’s future behavior, there may be an influence channel.

Adversarial Machine Learning

Correlated blind spots connect directly to adversarial ML research.

Transferability of Adversarial Examples

Szegedy et al. (2013) - “Intriguing Properties of Neural Networks”

Discovered adversarial examples and their transferability
Insight for AI: Adversarial examples that fool one model often fool others—this is the mechanism behind correlated blind spots.

Papernot et al. (2016) - “Transferability in Machine Learning”

Systematic study of why adversarial examples transfer
Models with similar decision boundaries are more susceptible to transfer
Insight for AI: “Diversity” in verification means diversity in decision boundaries, not just different providers.

Tramèr et al. (2017) - “Ensemble Adversarial Training”

Training on adversarial examples from multiple models improves robustness
Insight for AI: Verification layers trained adversarially against each other may have lower correlation.

Certified Robustness

Cohen et al. (2019) - “Certified Adversarial Robustness via Randomized Smoothing”

Provides provable guarantees against adversarial perturbations
Insight for AI: For some verification tasks, we can prove bounds on failure rates—but these bounds may still be correlated across components.

Organizational Theory

Human institutions have grappled with entanglement for centuries.

Regulatory Capture

Stigler (1971) - “The Theory of Economic Regulation”

Regulation is often designed for the benefit of the regulated, not the public
Industries “capture” their regulators through information asymmetry and revolving doors
Insight for AI: Verifiers working closely with agents may be captured by agent interests.

Peltzman (1976) - Extended Stigler’s model

Regulators balance multiple interest groups
Insight for AI: Capture is not binary—it’s a spectrum of influence.

Carpenter & Moss (2013) - “Preventing Regulatory Capture”

Comprehensive modern treatment
Strategies: transparency, adversarial structure, rotation, external review
Insight for AI: These same strategies apply to AI verification systems.

Separation of Powers

Madison, Hamilton, Jay (1788) - The Federalist Papers

Classic arguments for separation of powers and checks and balances
Ambition must be made to counteract ambition
Insight for AI: Adversarial collaboration patterns have deep historical roots.

Persson, Roland, Tabellini (1997) - “Separation of Powers and Political Accountability”

Formal analysis of how separation of powers affects outcomes
Insight for AI: Multiple principals with different objectives can prevent capture by any single interest.

Audit Independence

Moore et al. (2006) - “Conflicts of Interest and the Case of Auditor Independence”

Psychological and economic barriers to auditor independence
Even well-intentioned auditors are influenced by relationships
Insight for AI: Independence is not a fixed property but requires active maintenance.

Public Company Accounting Oversight Board (PCAOB) - Audit rotation requirements

Mandatory rotation of audit partners every 5 years
Insight for AI: Forced rotation prevents relationship-based capture.

Complex Systems & Network Science

Entanglements often exhibit complex system behavior.

Cascading Failures

Buldyrev et al. (2010) - “Catastrophic Cascade of Failures in Interdependent Networks”

Interdependent networks are more fragile than independent ones
Small failures can cascade across network boundaries
Insight for AI: Shared infrastructure creates interdependency. Failures propagate.

Watts (2002) - “A Simple Model of Global Cascades”

How local failures become global through network structure
Insight for AI: Understanding cascade topology helps design circuit breakers.

Emergence and Self-Organization

Holland (1998) - “Emergence: From Chaos to Order”

Complex behaviors emerge from simple rules
Insight for AI: Collusion and coordination may emerge without explicit design. Watch for emergent entanglements.

Kauffman (1993) - “The Origins of Order”

Self-organization in complex adaptive systems
Insight for AI: Multi-agent AI systems may self-organize in unexpected ways.

Decision Theory

The TDT-style coordination problem connects to foundational questions in decision theory.

Timeless Decision Theory

Yudkowsky (2010) - “Timeless Decision Theory”

Agents reason about their decision procedure, not just their actions
Similar agents reach similar conclusions through logical correlation
Insight for AI: Agents with similar training may coordinate without communication—a form of entanglement that can’t be blocked by isolating communication channels.

Functional Decision Theory

Soares & Fallenstein (2017) - “Functional Decision Theory”

Refined version of TDT
Insight for AI: The relevance to AI systems is debated but concerning. If AI agents engage in FDT-style reasoning, standard independence assumptions break down.

Logical Uncertainty

Garrabrant et al. (2016) - “Logical Induction”

Framework for reasoning under logical uncertainty
Insight for AI: Agents reasoning about what other similar agents will do face logical uncertainty. This connects to coordination without communication.

Interpretability Research

Understanding entanglement at the model level requires interpretability.

Representation Similarity

Kornblith et al. (2019) - “Similarity of Neural Network Representations Revisited”

Centered Kernel Alignment (CKA) for comparing representations
Insight for AI: High CKA between verification models suggests they may share blind spots.

Activation Patching

Meng et al. (2022) - “Locating and Editing Factual Associations in GPT”

Techniques for understanding which components of a model affect which outputs
Insight for AI: Could be used to trace how context from one component influences another’s decisions.

Circuits and Features

Elhage et al. (2022) - “Toy Models of Superposition” Anthropic (2023-2024) - Circuits work

Understanding what features models represent and how they compute
Insight for AI: If we understood what “features” verification models use, we could assess whether they’re truly diverse.

Summary: Key Takeaways from the Literature

Field	Key Insight for Entanglement
Principal-Agent Theory	Some agency costs are unavoidable; design to minimize total cost
Mechanism Design	Not all desirable properties are achievable; some are fundamentally impossible
Game Theory	Repeated interaction enables collusion; rotation disrupts it
Information Theory	Entanglement is quantifiable; lossy channels create irreversible information loss
Causal Inference	Influence detection requires intervention, not just correlation
Adversarial ML	Model similarity predicts failure correlation; diversity means different decision boundaries
Organizational Theory	Capture is a spectrum; independence requires active maintenance
Complex Systems	Entanglements may emerge without design; cascades require structural understanding
Decision Theory	Logical correlation may be irreducible for similar reasoning systems
Interpretability	Understanding what models compute could reveal shared failure modes

Research Connections

Research Connections

Principal-Agent Theory

Key Concepts

Foundational Papers

The Multi-Principal Problem

Game Theory & Mechanism Design

Collusion Resistance

Correlated Equilibrium

Mechanism Design for Verification

Repeated Games and Reputation

Information Theory

Mutual Information

Data Processing Inequality

Channel Capacity for Covert Communication

Causal Inference

Intervention and Do-Calculus

Counterfactual Reasoning

Causal Discovery

Adversarial Machine Learning

Transferability of Adversarial Examples

Certified Robustness

Organizational Theory

Regulatory Capture

Separation of Powers

Audit Independence

Complex Systems & Network Science

Cascading Failures

Emergence and Self-Organization

Decision Theory

Timeless Decision Theory

Functional Decision Theory

Logical Uncertainty

Interpretability Research

Representation Similarity

Activation Patching

Circuits and Features

Summary: Key Takeaways from the Literature

Further Reading

Textbooks

Survey Papers

AI Safety Specific