Other industries have decades of experience quantifying and managing risk. This page provides reference points for calibrating AI delegation risk estimates.
AI risk quantification is new. Established industries provide:
Calibration anchors : What risk levels are achievable?
Methodological guidance : How do mature domains quantify risk?
Regulatory precedents : What standards do regulators accept?
Cost-benefit data : What does risk reduction cost?
Metric Definition Typical Target Best Achieved CDF Core Damage Frequency (per reactor-year) < 10⁻⁴ ~10⁻⁵ LERF Large Early Release Frequency < 10⁻⁵ ~10⁻⁶ SCRAM rate Emergency shutdowns per year < 1 ~0.5
// Fault tree analysis: AND/OR gates
// Top event probability = product/sum of basic events
// Example: Coolant system failure
pumps_fail = 10^-3 // per demand
valves_fail = 10^-4 // per demand
backup_fails = 10^-2 // conditional on primary failure
// With redundancy: P(both primary AND backup fail)
coolant_failure = pumps_fail * backup_fails + valves_fail * backup_fails
Nuclear Concept AI Analog Typical AI Value CDF ~10⁻⁵/year Critical system failure ~10⁻²/year (needs improvement) SCRAM ~0.5/year Emergency human takeover ~10/year (much higher) Defense in depth Layered verification Partially implemented Common cause failure Correlated AI errors Major concern
Insight : Nuclear achieves 10⁻⁵ failure rates through:
Extensive redundancy
Formal safety analysis
Continuous monitoring
Strong safety culture
AI systems are ~1000× less reliable for critical failures.
Metric Definition Current Level Fatal accident rate Per million flights ~0.2 Catastrophic failure Per flight hour < 10⁻⁹ Software criticality DO-178C Level A ~10⁻⁹ per hour
Level Failure Condition Probability Target AI Equivalent A Catastrophic < 10⁻⁹ Safety-critical autonomous B Hazardous < 10⁻⁷ Mission-critical AI C Major < 10⁻⁵ Important AI functions D Minor < 10⁻³ Convenience features E No effect No requirement Internal tools
Level MC/DC Coverage Reviews Testing A 100% Formal Exhaustive B 100% Independent Extensive C Statement Peer Moderate D - Self Basic
AI Implication : Most AI systems would not pass Level C requirements, let alone Level A.
// Commercial aviation: ~0.2 fatal accidents per million flights
// Flight duration: ~2 hours average
// Fatal accident rate: ~10⁻⁷ per flight hour
// AI task: ~1000 tokens, ~30 seconds
// To match aviation safety:
// AI critical failure rate should be < 10⁻⁷ × (30/3600) ≈ 10⁻⁹ per task
// This is far below current AI reliability
// Realistic AI critical failure: ~10⁻⁴ per task
// Gap: ~100,000× worse than aviation
// VaR: Maximum loss at confidence level over time horizon
// Example: 1-day 99% VaR = $1M means
// P(loss > $1M in one day) < 1%
// For AI delegation risk:
delegationVaR(confidence, horizon) = {
// Quantile of risk distribution
return quantile(riskDistribution, confidence) * horizon
// Example: 30-day 95% VaR for AI system
// If monthly risk ~ lognormal(log(1000), 1.0)
// 95th percentile ≈ $5,000/month
// ES: Average loss given loss exceeds VaR
// Better captures tail risk than VaR
// ES = E[Loss | Loss > VaR]
// For heavy-tailed AI damages:
// ES_95 might be $15,000 (tail is 3× the threshold)
Principle Financial Application AI Application Risk parity Equal risk per asset Equal risk per component Marginal contribution Risk added by position Risk added by capability Euler allocation Total = sum of contributions Budget = sum of component risks
Category Rate Source Diagnostic errors 10-15% Meta-analyses Medication errors 1-2% of administrations Hospital data Surgical complications 3-17% Procedure-dependent Preventable adverse events 2-4% of hospitalizations IOM estimates
// Human diagnostic accuracy: ~85-90%
// AI triage should match or exceed: > 90%
// AI dosing recommendations should be: < 0.5%
// False negative rate (miss serious condition)
// AI should be: < 2% (higher bar for safety-critical)
Standard Requirement AI Mapping FDA 510(k) Substantial equivalence Match human performance FDA De Novo Novel with reasonable assurance Exceed human + demonstrate safety FDA PMA Highest scrutiny Extensive clinical evidence
Metric Value Source Average breach cost $4.45M IBM 2023 Mean time to identify 204 days IBM 2023 Mean time to contain 73 days IBM 2023 Breaches with AI involvement Growing Emerging
// Phishing click rate: 3-5%
// Successful exploitation given click: 20-40%
// End-to-end compromise rate: ~1-2%
// AI-assisted attacks may have higher success:
aiAssistedAttack_success = baseRate * 2-5×
// Defense evasion by AI:
// Traditional filters catch: 80-90%
// Against AI-crafted content: 40-60%
Quality Level Defects per KLOC Industry Typical commercial 15-50 General High-quality 5-10 Well-tested Safety-critical 0.5-2 Aerospace, medical Best-in-class < 0.1 NASA, formal methods
// Defect detection by method:
codeReview_detection = beta(50, 50) // ~50%
unitTest_detection = beta(40, 60) // ~40%
integrationTest_detection = beta(45, 55) // ~45%
systemTest_detection = beta(35, 65) // ~35%
// Combined (with overlap):
combined_detection = 1 - (1-0.5)*(1-0.4)*(1-0.45)*(1-0.35)
// Residual defects: ~12% escape to production
Domain Critical Failure Target Achieved AI Current Gap Nuclear 10⁻⁵/year 10⁻⁵ 10⁻²/year 1000× Aviation 10⁻⁹/hour 10⁻⁷ 10⁻⁴/hour 1000× Medical devices 10⁻⁶/use 10⁻⁵ 10⁻³/use 100× Financial trading 10⁻⁴/day 10⁻³ 10⁻²/day 10× General software 10⁻³/operation 10⁻² 10⁻¹/operation 10×
AI reliability similar to general software (~10⁻²)
Appropriate for low-stakes, human-in-loop applications
Need extensive monitoring and fallback
Target financial-grade reliability (~10⁻³ to 10⁻⁴)
Suitable for moderate-stakes with verification
Formal methods for critical paths
Target medical-device-grade (~10⁻⁵ to 10⁻⁶)
Required for safety-critical autonomous systems
May require fundamental advances
Aviation/nuclear-grade (~10⁻⁹)
Uncertain if achievable with current approaches
May require formal verification of AI systems
Fault trees and event trees
Common cause failure analysis
Importance measures (Fussell-Vesely, Risk Achievement Worth)
Criticality classification
Verification requirements scaling
Independence requirements
Risk aggregation
Diversification benefits
Coherent risk measures
Prospective studies
Sensitivity/specificity tradeoffs
Subgroup analysis
Attack surface analysis
Defense in depth
Assume breach mentality