Research Connections
Research Connections
Section titled “Research Connections”The challenges of entanglement in AI delegation systems connect to rich bodies of academic research. This page surveys relevant literature and highlights key insights for practitioners.
Principal-Agent Theory
Section titled “Principal-Agent Theory”The delegation framework is fundamentally an application of principal-agent theory from economics—the study of how principals (those who delegate) can align the behavior of agents (those who act on their behalf) when interests may diverge and information is asymmetric.
Key Concepts
Section titled “Key Concepts”| Concept | Definition | Entanglement Relevance |
|---|---|---|
| Moral hazard | Agent takes hidden actions that affect outcomes | Agent may act in ways verifier can’t observe |
| Adverse selection | Agent has hidden information principal lacks | Agent knows its own capabilities/intentions better |
| Information asymmetry | Unequal knowledge between parties | Context contamination exploits this |
| Incentive compatibility | Mechanism where truthful behavior is optimal | Designing systems where honesty is the best strategy |
Foundational Papers
Section titled “Foundational Papers”Jensen & Meckling (1976) - “Theory of the Firm”
- Introduced agency costs: monitoring costs, bonding costs, residual loss
- Insight for AI: Even with monitoring, some agency costs are unavoidable. The question is minimizing total cost.
Holmström (1979) - “Moral Hazard and Observability”
- Showed that optimal contracts depend on what’s observable
- Insight for AI: Verification is only as good as what can be observed. Unobservable agent reasoning creates irreducible risk.
Holmström & Milgrom (1991) - “Multitask Principal-Agent Analyses”
- When agents have multiple tasks, incentivizing one may distort others
- Insight for AI: Optimizing for measurable metrics may cause agents to neglect unmeasured but important behaviors.
The Multi-Principal Problem
Section titled “The Multi-Principal Problem”When multiple principals oversee the same agent (or multiple agents oversee each other), new dynamics emerge:
Bernheim & Whinston (1986) - “Common Agency”
- Agents serving multiple principals face conflicting incentives
- Insight for AI: A verifier that serves both accuracy and speed principals may compromise on both.
Dixit (1996) - “The Making of Economic Policy”
- Political economy of multiple principals with different objectives
- Insight for AI: Organizational entanglement often stems from conflicting stakeholder interests.
Game Theory & Mechanism Design
Section titled “Game Theory & Mechanism Design”Adversarial entanglements—collusion, capture, strategic coordination—are fundamentally game-theoretic phenomena.
Collusion Resistance
Section titled “Collusion Resistance”Laffont & Martimort (1997) - “Collusion under Asymmetric Information”
- Formal analysis of when and how agents collude
- Collusion is easier when agents can communicate and have aligned interests
- Insight for AI: Minimizing communication channels and creating opposed incentives reduces collusion risk.
Che & Kim (2006) - “Robustly Collusion-Proof Implementation”
- Mechanisms that remain incentive-compatible even if agents can collude
- Insight for AI: Some verification architectures are inherently more collusion-resistant than others.
Correlated Equilibrium
Section titled “Correlated Equilibrium”Aumann (1974, 1987) - Correlated Equilibrium
- Players can coordinate using shared signals without explicit communication
- Insight for AI: TDT-style logical correlation is a form of correlated equilibrium. Agents with similar “types” (training) may coordinate without communication.
Mechanism Design for Verification
Section titled “Mechanism Design for Verification”Maskin (1999) - “Nash Equilibrium and Welfare Optimality”
- When can social choice functions be implemented?
- Insight for AI: Not all desirable verification properties are achievable through mechanism design. Some are fundamentally impossible.
Myerson (1981) - “Optimal Auction Design”
- Revelation principle: any mechanism can be replicated by one where agents truthfully reveal information
- Insight for AI: If an agent can be incentivized to behave well through complex monitoring, there exists a simpler mechanism achieving the same outcome. But finding it may be hard.
Repeated Games and Reputation
Section titled “Repeated Games and Reputation”Fudenberg & Maskin (1986) - Folk Theorem
- In repeated interactions, many outcomes (including collusion) become sustainable
- Insight for AI: Long-term agent-verifier relationships enable implicit collusion. Rotation disrupts this.
Kreps & Wilson (1982) - “Reputation and Imperfect Information”
- Reputation building in games with incomplete information
- Insight for AI: Agents may build reputation specifically to exploit it later. Reputation is not a reliable signal of future behavior.
Information Theory
Section titled “Information Theory”Information theory provides precise tools for quantifying entanglement.
Mutual Information
Section titled “Mutual Information”Shannon (1948) - “A Mathematical Theory of Communication”
- Mutual information I(X;Y) measures how much knowing X tells you about Y
- Insight for AI: Entanglement between components A and B can be measured as I(A_failure; B_failure).
Cover & Thomas (2006) - “Elements of Information Theory”
- Standard reference for information-theoretic concepts
- Relevant concepts: channel capacity, rate-distortion, data processing inequality
Data Processing Inequality
Section titled “Data Processing Inequality”A fundamental result: information can only be lost, not created, through processing.
If X → Y → Z forms a Markov chain:I(X; Z) ≤ I(X; Y)Insight for AI: If context flows Agent → Summary → Verifier, the verifier cannot have more information about the original input than the summary contains. Lossy summarization creates irreversible information loss.
Channel Capacity for Covert Communication
Section titled “Channel Capacity for Covert Communication”Shannon (1948) - Channel capacity limits
- Every communication channel has a maximum rate of reliable information transfer
- Insight for AI: Side-channels between components have limited bandwidth. If we can characterize the channel, we can bound the information leakage.
Bash et al. (2013) - “Limits of Reliable Communication with Low Probability of Detection”
- Covert communication in the presence of a warden
- Insight for AI: Agents attempting to collude while evading detection face fundamental limits on how much information they can exchange.
Causal Inference
Section titled “Causal Inference”Detecting influence is fundamentally a causal question: does A cause changes in B?
Intervention and Do-Calculus
Section titled “Intervention and Do-Calculus”Pearl (2009) - “Causality”
- Distinguishes observation P(Y|X) from intervention P(Y|do(X))
- Insight for AI: Correlation between component failures doesn’t prove influence. We need interventional experiments.
The do-calculus provides rules for computing causal effects from observational data when certain assumptions hold (e.g., causal graph is known).
Counterfactual Reasoning
Section titled “Counterfactual Reasoning”Pearl (2009), Rubin (1974) - Potential outcomes framework
- What would have happened if we had intervened differently?
- Insight for AI: Influence detection via counterfactual: “Would B’s output have been different if A’s context had been different, holding the underlying input constant?”
Causal Discovery
Section titled “Causal Discovery”Spirtes, Glymour, Scheines (2000) - “Causation, Prediction, and Search”
- Algorithms for discovering causal structure from data
- Insight for AI: Can we discover hidden influence channels from observational data? PC algorithm, FCI, etc.
Granger (1969) - Granger Causality
- X Granger-causes Y if past values of X help predict Y beyond Y’s own past
- Insight for AI: Temporal influence detection. If A’s past behavior helps predict B’s future behavior, there may be an influence channel.
Adversarial Machine Learning
Section titled “Adversarial Machine Learning”Correlated blind spots connect directly to adversarial ML research.
Transferability of Adversarial Examples
Section titled “Transferability of Adversarial Examples”Szegedy et al. (2013) - “Intriguing Properties of Neural Networks”
- Discovered adversarial examples and their transferability
- Insight for AI: Adversarial examples that fool one model often fool others—this is the mechanism behind correlated blind spots.
Papernot et al. (2016) - “Transferability in Machine Learning”
- Systematic study of why adversarial examples transfer
- Models with similar decision boundaries are more susceptible to transfer
- Insight for AI: “Diversity” in verification means diversity in decision boundaries, not just different providers.
Tramèr et al. (2017) - “Ensemble Adversarial Training”
- Training on adversarial examples from multiple models improves robustness
- Insight for AI: Verification layers trained adversarially against each other may have lower correlation.
Certified Robustness
Section titled “Certified Robustness”Cohen et al. (2019) - “Certified Adversarial Robustness via Randomized Smoothing”
- Provides provable guarantees against adversarial perturbations
- Insight for AI: For some verification tasks, we can prove bounds on failure rates—but these bounds may still be correlated across components.
Organizational Theory
Section titled “Organizational Theory”Human institutions have grappled with entanglement for centuries.
Regulatory Capture
Section titled “Regulatory Capture”Stigler (1971) - “The Theory of Economic Regulation”
- Regulation is often designed for the benefit of the regulated, not the public
- Industries “capture” their regulators through information asymmetry and revolving doors
- Insight for AI: Verifiers working closely with agents may be captured by agent interests.
Peltzman (1976) - Extended Stigler’s model
- Regulators balance multiple interest groups
- Insight for AI: Capture is not binary—it’s a spectrum of influence.
Carpenter & Moss (2013) - “Preventing Regulatory Capture”
- Comprehensive modern treatment
- Strategies: transparency, adversarial structure, rotation, external review
- Insight for AI: These same strategies apply to AI verification systems.
Separation of Powers
Section titled “Separation of Powers”Madison, Hamilton, Jay (1788) - The Federalist Papers
- Classic arguments for separation of powers and checks and balances
- Ambition must be made to counteract ambition
- Insight for AI: Adversarial collaboration patterns have deep historical roots.
Persson, Roland, Tabellini (1997) - “Separation of Powers and Political Accountability”
- Formal analysis of how separation of powers affects outcomes
- Insight for AI: Multiple principals with different objectives can prevent capture by any single interest.
Audit Independence
Section titled “Audit Independence”Moore et al. (2006) - “Conflicts of Interest and the Case of Auditor Independence”
- Psychological and economic barriers to auditor independence
- Even well-intentioned auditors are influenced by relationships
- Insight for AI: Independence is not a fixed property but requires active maintenance.
Public Company Accounting Oversight Board (PCAOB) - Audit rotation requirements
- Mandatory rotation of audit partners every 5 years
- Insight for AI: Forced rotation prevents relationship-based capture.
Complex Systems & Network Science
Section titled “Complex Systems & Network Science”Entanglements often exhibit complex system behavior.
Cascading Failures
Section titled “Cascading Failures”Buldyrev et al. (2010) - “Catastrophic Cascade of Failures in Interdependent Networks”
- Interdependent networks are more fragile than independent ones
- Small failures can cascade across network boundaries
- Insight for AI: Shared infrastructure creates interdependency. Failures propagate.
Watts (2002) - “A Simple Model of Global Cascades”
- How local failures become global through network structure
- Insight for AI: Understanding cascade topology helps design circuit breakers.
Emergence and Self-Organization
Section titled “Emergence and Self-Organization”Holland (1998) - “Emergence: From Chaos to Order”
- Complex behaviors emerge from simple rules
- Insight for AI: Collusion and coordination may emerge without explicit design. Watch for emergent entanglements.
Kauffman (1993) - “The Origins of Order”
- Self-organization in complex adaptive systems
- Insight for AI: Multi-agent AI systems may self-organize in unexpected ways.
Decision Theory
Section titled “Decision Theory”The TDT-style coordination problem connects to foundational questions in decision theory.
Timeless Decision Theory
Section titled “Timeless Decision Theory”Yudkowsky (2010) - “Timeless Decision Theory”
- Agents reason about their decision procedure, not just their actions
- Similar agents reach similar conclusions through logical correlation
- Insight for AI: Agents with similar training may coordinate without communication—a form of entanglement that can’t be blocked by isolating communication channels.
Functional Decision Theory
Section titled “Functional Decision Theory”Soares & Fallenstein (2017) - “Functional Decision Theory”
- Refined version of TDT
- Insight for AI: The relevance to AI systems is debated but concerning. If AI agents engage in FDT-style reasoning, standard independence assumptions break down.
Logical Uncertainty
Section titled “Logical Uncertainty”Garrabrant et al. (2016) - “Logical Induction”
- Framework for reasoning under logical uncertainty
- Insight for AI: Agents reasoning about what other similar agents will do face logical uncertainty. This connects to coordination without communication.
Interpretability Research
Section titled “Interpretability Research”Understanding entanglement at the model level requires interpretability.
Representation Similarity
Section titled “Representation Similarity”Kornblith et al. (2019) - “Similarity of Neural Network Representations Revisited”
- Centered Kernel Alignment (CKA) for comparing representations
- Insight for AI: High CKA between verification models suggests they may share blind spots.
Activation Patching
Section titled “Activation Patching”Meng et al. (2022) - “Locating and Editing Factual Associations in GPT”
- Techniques for understanding which components of a model affect which outputs
- Insight for AI: Could be used to trace how context from one component influences another’s decisions.
Circuits and Features
Section titled “Circuits and Features”Elhage et al. (2022) - “Toy Models of Superposition” Anthropic (2023-2024) - Circuits work
- Understanding what features models represent and how they compute
- Insight for AI: If we understood what “features” verification models use, we could assess whether they’re truly diverse.
Summary: Key Takeaways from the Literature
Section titled “Summary: Key Takeaways from the Literature”| Field | Key Insight for Entanglement |
|---|---|
| Principal-Agent Theory | Some agency costs are unavoidable; design to minimize total cost |
| Mechanism Design | Not all desirable properties are achievable; some are fundamentally impossible |
| Game Theory | Repeated interaction enables collusion; rotation disrupts it |
| Information Theory | Entanglement is quantifiable; lossy channels create irreversible information loss |
| Causal Inference | Influence detection requires intervention, not just correlation |
| Adversarial ML | Model similarity predicts failure correlation; diversity means different decision boundaries |
| Organizational Theory | Capture is a spectrum; independence requires active maintenance |
| Complex Systems | Entanglements may emerge without design; cascades require structural understanding |
| Decision Theory | Logical correlation may be irreducible for similar reasoning systems |
| Interpretability | Understanding what models compute could reveal shared failure modes |
Further Reading
Section titled “Further Reading”Textbooks
Section titled “Textbooks”- Laffont & Martimort (2002) - “The Theory of Incentives: The Principal-Agent Model”
- Cover & Thomas (2006) - “Elements of Information Theory”
- Pearl (2009) - “Causality: Models, Reasoning, and Inference”
- Mas-Colell, Whinston, Green (1995) - “Microeconomic Theory” (mechanism design chapters)
Survey Papers
Section titled “Survey Papers”- Bolton & Dewatripont (2005) - “Contract Theory”
- Chakraborty & Yılmaz (2017) - “Authority, Consensus, and Governance”
- Goodfellow et al. (2018) - “Making Machine Learning Robust Against Adversarial Inputs”
AI Safety Specific
Section titled “AI Safety Specific”- Christiano et al. (2017) - “Deep Reinforcement Learning from Human Feedback”
- Irving et al. (2018) - “AI Safety via Debate”
- Hubinger et al. (2019) - “Risks from Learned Optimization”
See also:
- Formal Definitions - Mathematical formalization of entanglement
- Quantitative Metrics - Measuring entanglement in practice
- Historical Case Studies - Real-world examples of entanglement failures