Skip to content

Example: Autonomous Code Deployment

A second worked example showing how the framework applies to a higher-stakes domain: autonomous code review and deployment.

“Review pull requests, run tests, and deploy approved changes to production”

Unlike the research assistant (where failures waste time), code deployment has:

  • Immediate production impact — bad deploys cause outages
  • Security implications — malicious code can compromise systems
  • Financial stakes — downtime = revenue loss

This forces us to be more conservative with trust allocations.

flowchart TB
    subgraph Input
        PR[Pull Request]
    end
    subgraph Review["Code Review"]
        Static[Static Analysis<br/>deterministic]
        Sec[Security Scanner<br/>narrow model]
        Rev[Code Reviewer<br/>frontier LLM]
    end
    subgraph Test["Testing"]
        Unit[Unit Tests<br/>code]
        Int[Integration Tests<br/>code]
        Fuzz[Fuzzing<br/>code]
    end
    subgraph Deploy["Deployment"]
        Stage[Staging Deploy<br/>code]
        Canary[Canary Analysis<br/>narrow model]
        Prod[Production Deploy<br/>code + human]
    end
    subgraph Coord["Coordination"]
        Orch[Orchestrator<br/>verified code]
    end
    PR --> Static
    PR --> Sec
    PR --> Rev
    Static --> Orch
    Sec --> Orch
    Rev --> Orch
    Orch --> Unit
    Unit --> Int
    Int --> Fuzz
    Fuzz --> Stage
    Stage --> Canary
    Canary --> Prod
ComponentImplementationDelegation Risk BudgetRationale
Static AnalysisVerified code$50/monthDeterministic, auditable
Security ScannerFine-tuned 7B model$200/monthNarrow task, well-tested
Code ReviewerFrontier LLM + constraints$500/monthCreative, but can’t deploy
Unit/Integration TestsCode$100/monthDeterministic
FuzzerCode + seeded PRNG$150/monthReproducible randomness
Staging DeployVerified code$50/monthIsolated environment
Canary AnalysisFine-tuned model$300/monthCritical gate, narrow task
OrchestratorVerified code only$100/monthNo LLM in critical path
Production DeployCode + human approval$500/monthHuman in loop

System Total: $1,950/month

Unlike the research assistant, the deployment orchestrator is pure verified code:

if all_checks_passed AND human_approved:
deploy()
else:
reject()

Why: Orchestration in deployment is simple enough to specify formally. Using an LLM would add risk without benefit.

The LLM code reviewer provides advisory comments, not blocking decisions:

flowchart LR
    Rev[Code Reviewer] -->|"comments, suggestions"| Human
    Static[Static Analysis] -->|"pass/fail"| Orch[Orchestrator]
    Human -->|"approve/reject"| Orch
  • Reviewer output: “Consider refactoring this function” (suggestion)
  • Static analysis: “Security vulnerability CVE-2024-XXX” (blocking)

Why: LLMs are good at nuanced review but shouldn’t have veto power. Humans decide.

The canary analyzer checks specific metrics:

  • Error rate comparison (new vs baseline)
  • Latency percentiles (p50, p95, p99)
  • Resource usage (CPU, memory)

Implementation: Fine-tuned model trained only on “is this canary healthy?”

Not allowed:

  • Reasoning about business logic
  • Predicting user behavior
  • Strategic recommendations

No fully autonomous production deployment:

Production Deploy = Code execution + Human approval + Time lock
  • Code: Executes the mechanical deployment
  • Human: Must click “approve” after reviewing canary
  • Time lock: 15-minute wait between canary success and production
flowchart LR
    H[Human] -->|"0.99"| Orch[Orchestrator]
    Orch -->|"0.95"| Static[Static Analysis]
    Orch -->|"0.80"| Rev[Code Reviewer]
    Orch -->|"0.90"| Canary
    H -->|"1.0"| Prod[Prod Deploy]

Effective trust in Code Reviewer’s impact on production:

Trust(Human → Production via Reviewer) = 0.99 × 0.80 × ...
But: Reviewer can't directly affect production!

The architecture ensures that even if the Code Reviewer is compromised:

  1. It can only add comments (low damage)
  2. Comments don’t automatically block/approve (no authority)
  3. Human reviews all suggestions (filtering)

Result: Code Reviewer has high capability but near-zero production impact.

PrincipleApplication
Least IntelligenceOrchestrator uses no ML. Canary uses narrow model, not frontier.
Least PrivilegeCode Reviewer can read code but not write to repo or trigger deploys.
Least ContextCanary sees only metrics, not code content or business context.
Least PersistenceAll components stateless. No “learning” across deploys.
Least AutonomyProduction deploy requires human approval every time.
Least SurpriseStatic analysis is deterministic. Fuzzer uses logged seeds.
Least ConnectivityReviewer → Orchestrator is one-way. Reviewer can’t query other components.
FailureMitigationResidual Risk
Static analysis misses vulnerabilityDefense in depth: security scanner + fuzzer + canaryMedium
Code reviewer suggests malicious patternHuman reviews suggestions; suggestions aren’t executableLow
Canary analysis misjudges healthConservative thresholds + human approval + time lockLow
Orchestrator bug allows bad deployVerified code + formal specification + extensive testingVery Low
Human approves bad deployTime lock allows cancellation; rollback automatedMedium
AspectResearch AssistantCode Deployment
Primary riskWasted effortProduction outage
LLM in coordinatorYes (constrained)No (verified code)
Human in loopStrategic decisionsEvery production deploy
Autonomy levelMedium (can run experiments)Low (advisory only)
Acceptable Delegation Risk$2,000/month$2,000/month
Conservative factor1x2-3x

Key insight: Same Delegation Risk budget, but higher-stakes domain means:

  • More verification layers
  • Less LLM in critical path
  • More human involvement
  • Lower autonomy at each step

This architecture is appropriate when:

  • ✅ Actions have immediate, hard-to-reverse consequences
  • ✅ Security is a primary concern
  • ✅ Human oversight is feasible and valuable
  • ✅ Tasks are well-specified enough for verified code

Consider more autonomy (like research assistant) when:

  • Tasks require creativity or judgment
  • Failures are recoverable
  • Human oversight doesn’t scale
  • Speed matters more than caution