Example: Autonomous Code Deployment
Example: Autonomous Code Deployment
Section titled “Example: Autonomous Code Deployment”A second worked example showing how the framework applies to a higher-stakes domain: autonomous code review and deployment.
“Review pull requests, run tests, and deploy approved changes to production”
Why This Example
Section titled “Why This Example”Unlike the research assistant (where failures waste time), code deployment has:
- Immediate production impact — bad deploys cause outages
- Security implications — malicious code can compromise systems
- Financial stakes — downtime = revenue loss
This forces us to be more conservative with trust allocations.
Component Architecture
Section titled “Component Architecture”flowchart TB
subgraph Input
PR[Pull Request]
end
subgraph Review["Code Review"]
Static[Static Analysis<br/>deterministic]
Sec[Security Scanner<br/>narrow model]
Rev[Code Reviewer<br/>frontier LLM]
end
subgraph Test["Testing"]
Unit[Unit Tests<br/>code]
Int[Integration Tests<br/>code]
Fuzz[Fuzzing<br/>code]
end
subgraph Deploy["Deployment"]
Stage[Staging Deploy<br/>code]
Canary[Canary Analysis<br/>narrow model]
Prod[Production Deploy<br/>code + human]
end
subgraph Coord["Coordination"]
Orch[Orchestrator<br/>verified code]
end
PR --> Static
PR --> Sec
PR --> Rev
Static --> Orch
Sec --> Orch
Rev --> Orch
Orch --> Unit
Unit --> Int
Int --> Fuzz
Fuzz --> Stage
Stage --> Canary
Canary --> Prod
Component Delegation Risk Budgets
Section titled “Component Delegation Risk Budgets”| Component | Implementation | Delegation Risk Budget | Rationale |
|---|---|---|---|
| Static Analysis | Verified code | $50/month | Deterministic, auditable |
| Security Scanner | Fine-tuned 7B model | $200/month | Narrow task, well-tested |
| Code Reviewer | Frontier LLM + constraints | $500/month | Creative, but can’t deploy |
| Unit/Integration Tests | Code | $100/month | Deterministic |
| Fuzzer | Code + seeded PRNG | $150/month | Reproducible randomness |
| Staging Deploy | Verified code | $50/month | Isolated environment |
| Canary Analysis | Fine-tuned model | $300/month | Critical gate, narrow task |
| Orchestrator | Verified code only | $100/month | No LLM in critical path |
| Production Deploy | Code + human approval | $500/month | Human in loop |
System Total: $1,950/month
Key Design Decisions
Section titled “Key Design Decisions”1. No LLM in the Orchestrator
Section titled “1. No LLM in the Orchestrator”Unlike the research assistant, the deployment orchestrator is pure verified code:
if all_checks_passed AND human_approved: deploy()else: reject()Why: Orchestration in deployment is simple enough to specify formally. Using an LLM would add risk without benefit.
2. Code Reviewer Can’t Block Deploys
Section titled “2. Code Reviewer Can’t Block Deploys”The LLM code reviewer provides advisory comments, not blocking decisions:
flowchart LR
Rev[Code Reviewer] -->|"comments, suggestions"| Human
Static[Static Analysis] -->|"pass/fail"| Orch[Orchestrator]
Human -->|"approve/reject"| Orch
- Reviewer output: “Consider refactoring this function” (suggestion)
- Static analysis: “Security vulnerability CVE-2024-XXX” (blocking)
Why: LLMs are good at nuanced review but shouldn’t have veto power. Humans decide.
3. Canary Analysis is Narrow
Section titled “3. Canary Analysis is Narrow”The canary analyzer checks specific metrics:
- Error rate comparison (new vs baseline)
- Latency percentiles (p50, p95, p99)
- Resource usage (CPU, memory)
Implementation: Fine-tuned model trained only on “is this canary healthy?”
Not allowed:
- Reasoning about business logic
- Predicting user behavior
- Strategic recommendations
4. Production Deploy Requires Human
Section titled “4. Production Deploy Requires Human”No fully autonomous production deployment:
Production Deploy = Code execution + Human approval + Time lock- Code: Executes the mechanical deployment
- Human: Must click “approve” after reviewing canary
- Time lock: 15-minute wait between canary success and production
Risk Inheritance
Section titled “Risk Inheritance”flowchart LR
H[Human] -->|"0.99"| Orch[Orchestrator]
Orch -->|"0.95"| Static[Static Analysis]
Orch -->|"0.80"| Rev[Code Reviewer]
Orch -->|"0.90"| Canary
H -->|"1.0"| Prod[Prod Deploy]
Effective trust in Code Reviewer’s impact on production:
Trust(Human → Production via Reviewer) = 0.99 × 0.80 × ...But: Reviewer can't directly affect production!The architecture ensures that even if the Code Reviewer is compromised:
- It can only add comments (low damage)
- Comments don’t automatically block/approve (no authority)
- Human reviews all suggestions (filtering)
Result: Code Reviewer has high capability but near-zero production impact.
Applying “Least X” Principles
Section titled “Applying “Least X” Principles”| Principle | Application |
|---|---|
| Least Intelligence | Orchestrator uses no ML. Canary uses narrow model, not frontier. |
| Least Privilege | Code Reviewer can read code but not write to repo or trigger deploys. |
| Least Context | Canary sees only metrics, not code content or business context. |
| Least Persistence | All components stateless. No “learning” across deploys. |
| Least Autonomy | Production deploy requires human approval every time. |
| Least Surprise | Static analysis is deterministic. Fuzzer uses logged seeds. |
| Least Connectivity | Reviewer → Orchestrator is one-way. Reviewer can’t query other components. |
Failure Mode Analysis
Section titled “Failure Mode Analysis”| Failure | Mitigation | Residual Risk |
|---|---|---|
| Static analysis misses vulnerability | Defense in depth: security scanner + fuzzer + canary | Medium |
| Code reviewer suggests malicious pattern | Human reviews suggestions; suggestions aren’t executable | Low |
| Canary analysis misjudges health | Conservative thresholds + human approval + time lock | Low |
| Orchestrator bug allows bad deploy | Verified code + formal specification + extensive testing | Very Low |
| Human approves bad deploy | Time lock allows cancellation; rollback automated | Medium |
Comparison to Research Assistant
Section titled “Comparison to Research Assistant”| Aspect | Research Assistant | Code Deployment |
|---|---|---|
| Primary risk | Wasted effort | Production outage |
| LLM in coordinator | Yes (constrained) | No (verified code) |
| Human in loop | Strategic decisions | Every production deploy |
| Autonomy level | Medium (can run experiments) | Low (advisory only) |
| Acceptable Delegation Risk | $2,000/month | $2,000/month |
| Conservative factor | 1x | 2-3x |
Key insight: Same Delegation Risk budget, but higher-stakes domain means:
- More verification layers
- Less LLM in critical path
- More human involvement
- Lower autonomy at each step
When to Use This Pattern
Section titled “When to Use This Pattern”This architecture is appropriate when:
- ✅ Actions have immediate, hard-to-reverse consequences
- ✅ Security is a primary concern
- ✅ Human oversight is feasible and valuable
- ✅ Tasks are well-specified enough for verified code
Consider more autonomy (like research assistant) when:
- Tasks require creativity or judgment
- Failures are recoverable
- Human oversight doesn’t scale
- Speed matters more than caution