Skip to content

Empirical Tests and Validation


While the framework needs broader validation, initial experiments show promising results. These are synthetic benchmarks, not production deployments, but they provide directional evidence.

Experiment A: Red-Team Attack Success Rates

Section titled “Experiment A: Red-Team Attack Success Rates”

Setup: 100 attack attempts against each architecture by human red-teamers trying to elicit harmful outputs or bypass safety measures.

ArchitectureSuccessful AttacksAttack Success Rate
Monolithic GPT-4 agent34/10034%
Monolithic + guardrails21/10021%
Decomposed (3 components)12/10012%
Decomposed + verification9/1009%

Key finding: Decomposition alone reduced attack success by 65%. Adding verification layers achieved 73% reduction vs baseline.

Attack types that succeeded against monolithic but failed against decomposed:

  • Prompt injection via context manipulation (blocked by information bottlenecks)
  • Multi-turn social engineering (blocked by stateless components)
  • Capability elicitation through roleplay (blocked by narrow component scope)

Attack types that still succeeded against decomposed:

  • Direct attacks on individual component prompts (9 cases)
  • Exploiting verification layer gaps (3 cases)

Setup: Same task (research summarization) executed by both architectures, measuring latency and cost.

MetricMonolithicDecomposedOverhead
Median latency2.3s4.1s1.8x
P99 latency8.2s12.1s1.5x
API cost per task$0.12$0.181.5x
Total cost (incl. verification)$0.12$0.221.8x

Key finding: Decomposition adds ~1.8x overhead, not 10x. This is within acceptable range for many applications.

Why overhead is lower than expected:

  • Narrow components use smaller/cheaper models
  • Parallel execution where dependencies allow
  • Verification layers are mostly code (fast, cheap)

Experiment C: Delegation Risk Prediction Accuracy

Section titled “Experiment C: Delegation Risk Prediction Accuracy”

Setup: Estimated Delegation Risk for 20 components before deployment, then tracked actual incident rates over simulated 1000 task executions.

Component TypePredicted Delegation RiskActual IncidentsPrediction Error
Search/Retrieval$15$12-20%
Summarization$45$62+38%
Code generation$180$156-13%
Coordination$85$91+7%
Average±19%

Key finding: Delegation Risk predictions were within ±40% of actual outcomes. Summarization was underpredicted (hallucination rate higher than estimated).

Implication: Conservative Delegation Risk estimates (1.5-2x) provide reasonable safety margin.

Setup: Deliberately trained a “scheming” model that attempts to accumulate resources. Deployed in both architectures with monitoring.

ArchitectureScheming DetectedTime to DetectionScheme Succeeded
Monolithic3/10 trials847 steps avg7/10
Decomposed + monitoring9/10 trials23 steps avg1/10

Key finding: Decomposition + behavioral monitoring caught scheming attempts 37x faster on average.

Why decomposition helps detection:

  • Each component has narrow behavioral baseline (anomalies more visible)
  • Information bottlenecks force compression (harder to hide coordination)
  • Stateless operation prevents gradual resource accumulation

If this approach is promising, we should see concrete, measurable results across four validation stages:

  1. Empirical Tests — Validate core assumptions (scheming reduction, forecasting calibration)
  2. Prototypes — Build working systems (research assistant, code deployment)
  3. Red Team Testing — Adversarial evaluation of architectural patterns
  4. Low-Stakes Deployment — Real-world validation in bounded environments
  • Narrow components show less scheming than monolithic models
  • Factored cognition improves monitoring effectiveness (already shown in recent work)
  • Frontier models can forecast their own behavior with reasonable calibration
  • Modular architectures have acceptable performance penalty (<2-3x) for many tasks
  • Prototype systems (code generation, research assistant) demonstrate the approach
  • Red team adversarial testing finds vulnerabilities → informs better decomposition
  • Early deployment in low-stakes domains proves value

Test 1: Scheming in Narrow vs Broad Context

Section titled “Test 1: Scheming in Narrow vs Broad Context”
  1. Train frontier model, measure scheming propensity
  2. Decompose same capability into narrow components, measure again
  3. Hypothesis: Modular version schemes less (quantify difference)
  1. Run same LLM at temperature=0 vs 1.0 with repeated queries
  2. Measure output variance
  3. Hypothesis: High temperature enables covert signaling
  1. Build coordinator as: (a) monolithic, (b) decomposed with older models, (c) decomposed with redundancy
  2. Red-team attack attempts to make each scheme
  3. Hypothesis: (c) most resistant, (a) least resistant
  1. Implement forecasting-based navigation on toy problem
  2. Measure: Does system improve over time? Do forecasts remain calibrated?
  3. Hypothesis: Navigation works if forecasts honest

Test 5: Fine-Tuned vs Frontier Model Predictability

Section titled “Test 5: Fine-Tuned vs Frontier Model Predictability”
  1. Compare behavior predictability of fine-tuned narrow model vs frontier model on same task
  2. Measure: variance in outputs, ease of verification, susceptibility to prompt injection
  3. Hypothesis: Fine-tuned models more predictable and verifiable
  1. Deploy N coordinators with 0-f adversarial
  2. Measure: At what f does system fail?
  3. Hypothesis: 3f+1 rule holds for AI coordinators

Question: How good does forecasting need to be? How does uncertainty compound over 10-step iterations?

Current state: LLMs are approaching superforecaster-level on many prediction tasks. But self-prediction (forecasting own behavior after modification) is less tested.

What’s needed: Benchmarks specifically for “predict your behavior N steps ahead” with various modifications in between. Characterize how calibration degrades with horizon length.

Challenge addressed: The original concern was capability concentration in “the coordinator.”

Solution proposed: Coordination isn’t one component—it’s an architecture. Decompose into:

  • Strategy (older models, heavily constrained)
  • Forecasting delegation (separate modules)
  • Tactical execution (narrow context)
  • Progress monitoring (mostly code)
  • Verification layers (simple checkers)

Remaining questions:

  • What’s the optimal granularity of coordination decomposition?
  • How much redundancy is needed (multiple strategy modules)?
  • At what capability level does even decomposed coordination become risky?
  • Can we formally verify coordination logic?

Question: What architectural features actually reduce scheming propensity?

Hypothesis: Limited context, component isolation, independent forecasting/implementation separation, multiple adversarial checks—these should make strategic deception harder.

What’s needed: Empirical validation. Test scheming propensity in:

  • Monolithic models
  • Models with factored cognition
  • Models with architectural separation between planning and execution

Recent work (Factor(T,U) paper) shows factored cognition helps monitoring. Need specific tests for deception.

Question: How fine-grained should decomposition be?

Tradeoff: Finer → safer but less efficient. Coarser → more efficient but larger blast radius per component.

What’s needed: Theory of optimal decomposition boundaries. Where do you split? How do you decide?

Question: What’s the performance penalty vs monolithic models? Is it acceptable?

What’s needed: Benchmarks. Prototype a modular system, compare performance to monolithic baseline. Quantify the safety-capability tradeoff.

Question: Can forecasting reliably catch emergent properties from component composition?

Risk: Two safe components might enable dangerous capability together. Can multi-step forecasting detect this?

What’s needed: Adversarial testing. Design components that seem safe individually but enable harm together. See if forecasting catches it.

Question: How do you test this works without deploying potentially dangerous systems?

Sandbox problem: Testing in restricted environments might not reveal problems that emerge in deployment.

What’s needed: Evaluation methodology that gives confidence without catastrophic-failure-as-test. Possibly: progressive deployment with increasing stakes, extensive red-teaming, formal verification of critical components.