Skip to content

Quantitative Metrics

This page explains how to measure entanglement in practice. For theoretical foundations, see Formal Definitions.


CategoryWhat It CapturesWhen to Use
CorrelationShared failures (passive entanglement)Detect correlated blind spots
InfluenceDirectional impact (active entanglement)Detect context contamination
CaptureAlignment drift (adversarial entanglement)Detect verifier capture
DiversityStructural independenceAssess design quality
TemporalChanges over timeEarly warning

The simplest measure: when A fails, how often does B also fail?

How to measure: Run the same inputs through both components. Track which inputs cause failures.

Interpretation:

CorrelationMeaningAction
< 0Negative correlation (failures are anti-correlated)Great—maintain this diversity
0 to 0.3Low correlationAcceptable
0.3 to 0.5Moderate correlationInvestigate shared dependencies
> 0.5High correlationRedesign needed—layers share blind spots

“If A fails, how much more likely is B to fail?”

Interpretation:

RatioMeaning
= 1Independent (A failing doesn’t affect B’s failure odds)
> 1Positive correlation (A failing makes B failing more likely)
< 1Negative correlation (A failing makes B failing less likely)
>> 1Strong entanglement (when A fails, B almost always fails too)

Direct measure of how much worse reality is than the independent assumption.

Calculation:

Entanglement Tax = (Actual joint failure rate) ÷ (What you’d expect if independent)

Example:

  • Component A fails 10% of the time
  • Component B fails 10% of the time
  • If independent: Both fail 1% of the time (0.1 × 0.1)
  • If you observe both failing 5% of the time
  • Entanglement Tax = 5% ÷ 1% =

Your system is 5× riskier than the independent model suggests.


Measures how much A’s framing affects B’s decisions.

How to test:

  1. Take inputs where you know the correct answer
  2. Have A summarize/frame them two ways: neutral and suggestive
  3. See how much B’s accuracy differs between framings

Interpretation:

SensitivityMeaningAction
< 5% differenceLow sensitivityGood—B isn’t easily swayed
5-15% differenceModerate sensitivityConsider independent information channels
> 15% differenceHigh sensitivityB is too dependent on A’s framing

“Does A’s behavior predict B’s behavior, beyond what B’s own history predicts?”

When to worry: If B’s outputs correlate with A’s outputs over time, information is leaking through some channel—even if you think they’re isolated.


Compare how often this agent gets approved vs. baseline.

Interpretation:

Ratio vs BaselineMeaning
0.9 - 1.1Normal variation
1.1 - 1.3Slightly elevated—monitor
> 1.3Significantly elevated—investigate for capture

“Does the verifier approve more when the agent wants approval?”

How to measure: Track cases where you know the agent’s preference. Compare approval rates.

Interpretation:

Alignment ScoreMeaning
NegativeHealthy adversarial relationship
Near zeroIndependent (verifier ignores agent preferences)
0.1 - 0.3Some alignment—monitor for capture
> 0.3Significant alignment—capture likely

“Does the verifier treat all agents the same?”

If approval rates vary significantly between agents, some may have captured the verifier.


Rate your verification system on these dimensions:

Dimension0 Points1 Point2 Points
ProviderAll same provider2 providers3+ providers
ArchitectureAll same type2 types3+ types (e.g., neural + rule-based + symbolic)
Training dataSame dataOverlapping dataIndependent data
MethodologyAll MLML + rulesML + rules + formal verification
TimingAll synchronousPre + postPre + concurrent + post

Scoring:

Total (0-10)Assessment
0-3Low diversity—high entanglement risk
4-6Moderate diversity
7-10High diversity—lower entanglement risk

Track metrics weekly and watch for trends.

Concerning patterns:

  • Correlation increasing over time → shared dependencies growing
  • Approval rates creeping up → possible capture developing
  • Frame sensitivity increasing → verifier becoming more dependent on agent framing

Alert thresholds:

MetricNormalMonitorInvestigate
Correlation drift< 0.01/week0.01-0.02/week> 0.02/week
Approval rate drift< 0.5%/month0.5-2%/month> 2%/month

If You Suspect…MeasureAlert If
Shared blind spotsFailure correlation> 0.3
Context contaminationFrame sensitivity> 10% difference
Verifier capturePreference alignment> 0.15
FavoritismVerifier consistency across agentsHigh variance
Growing problemsWeekly drift ratesPositive trend
Low structural diversityDiversity score< 5

Start simple: Failure correlation is the most important metric. If you can only measure one thing, measure this.

Use ground truth: Many metrics require knowing the “correct” answer. Invest in human labeling of samples.

Log everything: You can’t measure what you don’t record. Log all verification decisions with context.

Track over time: A single measurement tells you the current state. Trends tell you if it’s getting worse.

Test under stress: Conditional entanglement only appears under unusual conditions. Test with adversarial inputs and high load.


A healthy verification system:

  • Failure correlation < 0.3
  • Frame sensitivity < 10%
  • No verifier showing elevated approval rates for specific agents
  • Diversity score ≥ 5
  • Stable or improving metrics over time

If your system doesn’t meet these benchmarks, see Solutions & Mitigations.


See also: