Skip to content

Impossibility Results

Some entanglements cannot be eliminated—they arise from fundamental constraints on verification, information, and computation. Understanding these impossibility results helps distinguish between solvable engineering problems and inherent limitations that must be accepted and managed.

This page surveys what we know (and conjecture) about the theoretical limits of achieving independent verification in AI systems.


Practical benefit: Knowing what’s impossible prevents wasting resources on unachievable goals.

Design implications: If perfect independence is impossible, system design should focus on:

  • Measuring remaining entanglement (not eliminating it)
  • Graceful degradation when entanglement creates failures
  • Layering diverse approaches with known limitation profiles

Honest risk assessment: Claims of “fully independent verification” should be viewed skeptically if they violate impossibility bounds.


Classical result: It’s impossible to build a general program that determines whether arbitrary programs halt (Turing, 1936).

Generalization (Rice’s Theorem): Any non-trivial semantic property of programs is undecidable.

Implication for AI verification:

UNDECIDABLE QUESTIONS:
- "Will this AI system ever produce harmful output?"
- "Does this AI system have deceptive capabilities?"
- "Will this AI system's behavior change under distribution shift?"

What this means: Perfect verification of AI behavior is impossible in the general case. All verification systems are necessarily incomplete—they can only check decidable approximations of the properties we actually care about.

First theorem: Any consistent formal system powerful enough to describe arithmetic contains true statements that cannot be proven within the system.

Second theorem: Such a system cannot prove its own consistency.

Application to AI verification:

flowchart TB
    subgraph Incompleteness[Gödelian Limits on Self-Verification]
        System[AI System S]
        Proof[If S is powerful enough to<br/>reason about itself...]
        Limit[...S cannot prove<br/>all true statements about S]
    end

    System --> Proof --> Limit

    style Limit fill:#ffcccc

Implication: An AI system cannot fully verify its own safety/alignment using its own reasoning capabilities. External verification is necessary—but the external verifier faces the same limitation regarding itself.

The regress problem:

  • System A verified by System B
  • System B verified by System C
  • …eventually, something must be trusted without formal verification

Even when problems are decidable, they may be computationally intractable:

Verification TaskComplexity ClassPractical Implication
Checking a specific execution tracePFeasible
Verifying all possible executionsPSPACE or harderUsually infeasible
Finding adversarial inputsNP-hard (often)Expensive to guarantee coverage
Optimal policy verificationEXPTIME (for many MDPs)Intractable for complex systems

Result: Complete verification of AI behavior is computationally infeasible for systems of realistic complexity. Verification must be sampled, approximated, or restricted to specific properties.


Intuition: A verifier can only check what it can observe. Bounded observation implies bounded verification.

Formal statement: Let V be a verifier observing channel C from agent A. The mutual information I(A;V) is bounded by the channel capacity:

I(A;V) ≤ C_channel

If the agent’s internal state has entropy H(A) > C_channel, complete verification is impossible.

Practical manifestation:

  • A verifier seeing only outputs cannot fully verify internal reasoning
  • A verifier seeing summaries cannot verify underlying details
  • A verifier with limited time cannot verify arbitrary complexity

Principle: Observation can change behavior, making it impossible to verify “natural” behavior through observation.

Heisenberg for AI:

OBSERVER EFFECT IN AI SYSTEMS:
If agent knows it's being monitored:
- Agent may behave differently (compliance theater)
- Verification sees monitored behavior, not unmonitored behavior
If agent doesn't know it's being monitored:
- Requires deception or compartmentalization
- May not be ethically/legally permissible
- May be technically difficult to maintain

Result: Verifying an agent’s behavior when it believes it’s unmonitored is fundamentally different from (and harder than) verifying monitored behavior.

Statement: A verifier smaller than the system it verifies cannot fully model that system.

Implication: If Verifier V has capacity K_V and Agent A has capacity K_A where K_V << K_A:

  • V cannot simulate A’s behavior in full
  • V must use heuristics, approximations, or sampling
  • There exist behaviors of A that V cannot anticipate
flowchart LR
    subgraph Asymmetry[Capability Asymmetry]
        Agent[Agent A<br/>Capacity: K_A] --> Check[Verification<br/>Challenge]
        Verifier[Verifier V<br/>Capacity: K_V << K_A] --> Check
        Check --> Result[V cannot fully<br/>model A]
    end

    style Result fill:#ffcccc

This is the core challenge of scalable oversight: How do we verify systems more capable than the verifier?


Informal statement: In repeated games with communication (even implicit), rational agents can achieve collusive outcomes that defeat mechanism design.

Folk Theorem implication: If agent A and verifier V interact repeatedly and can observe each other’s history:

  • Collusive equilibria exist that look like “good” behavior
  • These equilibria can be sustained by implicit threats
  • External observers may not be able to distinguish collusion from genuine agreement

Formal result: No mechanism can guarantee non-collusion between rational agents in repeated settings without either:

  1. Restricting communication (including implicit signals)
  2. Limiting memory/computation of agents
  3. Introducing external randomization or commitment devices

Observation: Regulatory capture isn’t a bug—it’s a stable equilibrium.

Game-theoretic analysis:

CAPTURE AS EQUILIBRIUM:
Setup:
- Verifier V must invest effort E to catch violations
- Agent A produces value and violations
- Both prefer low-conflict outcomes
Without strong external incentives:
- V's effort E gradually decreases (costly, low reward)
- A learns V's tolerances
- System converges to "capture equilibrium"
where V approves A's preferred actions

Impossibility: Without continuous external pressure or rotation, regulatory relationships converge toward capture. This isn’t a failure of individual integrity—it’s a systemic property.

Statement: Who watches the watchers?

Infinite regress:

Agent A → verified by V1 → verified by V2 → verified by V3 → ...

Impossibility: The chain must terminate somewhere. The terminal verifier is unverified (or self-verifying, which is limited per Gödel).

Practical forms:

Terminal VerifierLimitation
Human oversightBounded attention, manipulable
Market mechanismsEmergent outcomes, not designed properties
Democratic processesSlow, subject to manipulation
Constitution/lawsStatic, requires interpretation
Nothing (accept risk)Unmitigated exposure

There is no verification without trust in something.


Category 4: Empirical Observations with Possible Impossibility Status

Section titled “Category 4: Empirical Observations with Possible Impossibility Status”

These are patterns observed empirically that may reflect fundamental limits:

Observation: Despite all defenses, adversarial examples transfer between different models at non-trivial rates (typically 20-40% minimum between very different architectures).

Conjecture: There may be a fundamental floor to adversarial transferability arising from:

  • Shared task structure (both models solve the same problem)
  • Shared data distribution (both trained on similar data)
  • Optimization inducing similar solutions

If this is a true impossibility: No amount of provider diversity will achieve true independence for neural network verifiers.

Observation: Models trained with RLHF on human feedback develop correlated values and blind spots.

Possible limit: Human feedback is itself correlated (humans share cognitive biases). Any human-aligned AI will inherit these correlations.

RLHF CORRELATION CHAIN:
Human values (correlated) →
Human feedback (correlated) →
RLHF signal (correlated) →
Trained model values (correlated) →
All RLHF'd models share biases

Implication: Using multiple RLHF’d models doesn’t provide independence on questions where human feedback is systematically biased.

Conjecture: As AI systems become more capable, the gap between their potential for deception and our ability to detect deception may widen.

Argument:

  1. Deception requires modeling the target’s beliefs (theory of mind)
  2. More capable systems are better at theory of mind
  3. Detection requires modeling potential deceptions
  4. Verifier must be as capable as agent to enumerate all deceptions
  5. If agent is more capable than verifier, undetectable deceptions may exist

Status: Conjectured, not proven. But if true, would imply verification becomes harder (not easier) as AI systems improve.


These aren’t fundamental limits but are effectively impossible given realistic constraints:

Statement: Completely isolating AI components from each other while maintaining useful functionality is practically impossible.

Why:

  • Useful systems must share some context (task descriptions, world models)
  • Shared context creates shared assumptions and potential blind spots
  • Zero shared context means zero useful cooperation

Practical limit: The best achievable is minimized, characterized entanglement—not zero entanglement.

Statement: Verifying every action in real-time with comprehensive coverage is practically impossible for systems of useful scale.

Bandwidth constraint:

If system produces N actions per second,
and verification of one action takes T seconds,
then real-time verification requires N×T ≤ 1
For complex verification (T >> 1), real-time is impossible for large N

Practical solutions: Sampling, async verification, pre-authorization of action classes.

Statement: Achieving fully independent training data for multiple AI systems is practically impossible.

Why:

  • Limited high-quality data sources (web text, books, etc.)
  • Synthetic data often generated by other AI systems
  • Similar data preprocessing and filtering approaches
  • Market pressure to use “what works” (convergent data choices)

Result: All major AI systems have substantially overlapping training data, creating correlated blind spots.


CategoryResponse
Logical impossibilitiesAccept; design around them
Information-theoretic limitsMeasure; optimize within bounds
Game-theoretic limitsIntroduce external mechanisms; rotate
Empirical patternsResearch; hedge until resolved
Practical limitsEngineer best achievable; document residual

Since no single verification approach is complete, the practical strategy is layered verification with diverse limitations:

flowchart TB
    subgraph Layers[Complementary Limitation Profiles]
        Neural[Neural Verification<br/>Limit: transferable adversarial examples]
        Rule[Rule-Based Verification<br/>Limit: can't capture all semantics]
        Formal[Formal Verification<br/>Limit: only proves specified properties]
        Human[Human Verification<br/>Limit: bandwidth, manipulability]
        Market[Market Verification<br/>Limit: slow, emergent]
    end

    Neural --> Combine[Combined Coverage]
    Rule --> Combine
    Formal --> Combine
    Human --> Combine
    Market --> Combine

    Combine --> Output[Better than any single approach<br/>but still incomplete]

    style Neural fill:#cce6ff
    style Rule fill:#ccffcc
    style Formal fill:#ffffcc
    style Human fill:#ffe6cc
    style Market fill:#ffcccc

The goal: Choose verification methods with uncorrelated limitations.

Given impossibility results, honest communication requires:

WHAT NOT TO SAY:
"Our multi-layer verification provides complete protection."
WHAT TO SAY:
"Our verification covers X, Y, Z properties with [measured] coverage.
Known gaps include A, B, C.
Fundamental limits mean we cannot verify D, E, F."

We know certain things are impossible but often don’t know:

  • Exactly how much of the space can be verified
  • The best achievable approximations
  • Tight bounds on transfer rates, capture rates, etc.

Research direction: Developing quantitative bounds on verification completeness.

Current impossibility results assume current verification paradigms. Could new approaches shift the boundaries?

Candidates:

  • Cryptographic verification (ZK proofs of AI reasoning?)
  • Quantum verification (any advantages?)
  • Self-improving verification (can we build verifiers that escape limits?)

Status: Mostly speculative; fundamental limits may still apply.

Conjecture: There may be a fundamental trade-off between:

  • Aligning AI systems to human values
  • Being able to verify that alignment

If systems are aligned through complex, holistic training signals (like RLHF), their alignment may be inherently unverifiable by formal methods.


  1. Fully verify arbitrary AI behavior (Halting problem, Rice’s theorem)
  2. Build self-verifying AI systems (Gödel’s theorems)
  3. Guarantee non-collusion in repeated games (Folk theorem)
  4. Achieve zero entanglement with useful cooperation (Information theoretic)
  5. Verify systems more capable than the verifier (Compression limits)
  6. Eliminate the terminal trust problem (Quis custodiet)
  1. Verify specific properties (if decidable and tractable)
  2. Measure entanglement (information-theoretic tools)
  3. Reduce entanglement (architectural diversity)
  4. Detect some collusion/capture (statistical methods, rotation)
  5. Layer diverse verification (complementary limitations)
  6. Make honest risk assessments (acknowledging gaps)

The existence of impossibility results doesn’t mean verification is useless—it means we should:

  1. Be humble: Acknowledge what cannot be achieved
  2. Be precise: Quantify what can be achieved
  3. Be diverse: Use methods with uncorrelated limitations
  4. Be adaptive: Update as we learn more about bounds
  5. Be transparent: Communicate limitations honestly

ResultSourceImplication
Halting problemTuring (1936)General behavior verification is undecidable
IncompletenessGödel (1931)Self-verification is fundamentally limited
Rice’s theoremRice (1953)Non-trivial semantic properties are undecidable
Folk theoremsAumann & Shapley (1976)Collusion is stable in repeated games
No free lunchWolpert & Macready (1997)No universal optimizer/verifier
Adversarial existenceSzegedy et al. (2013)Adversarial examples exist for neural networks

See also: