Skip to content

Cross-Domain Benchmarks

Other industries have decades of experience quantifying and managing risk. This page provides reference points for calibrating AI delegation risk estimates.

AI risk quantification is new. Established industries provide:

  1. Calibration anchors: What risk levels are achievable?
  2. Methodological guidance: How do mature domains quantify risk?
  3. Regulatory precedents: What standards do regulators accept?
  4. Cost-benefit data: What does risk reduction cost?

MetricDefinitionTypical TargetBest Achieved
CDFCore Damage Frequency (per reactor-year)< 10⁻⁴~10⁻⁵
LERFLarge Early Release Frequency< 10⁻⁵~10⁻⁶
SCRAM rateEmergency shutdowns per year< 1~0.5
// Fault tree analysis: AND/OR gates
// Top event probability = product/sum of basic events
// Example: Coolant system failure
pumps_fail = 10^-3 // per demand
valves_fail = 10^-4 // per demand
backup_fails = 10^-2 // conditional on primary failure
// With redundancy: P(both primary AND backup fail)
coolant_failure = pumps_fail * backup_fails + valves_fail * backup_fails
// ~10^-5 per demand
Nuclear ConceptAI AnalogTypical AI Value
CDF ~10⁻⁵/yearCritical system failure~10⁻²/year (needs improvement)
SCRAM ~0.5/yearEmergency human takeover~10/year (much higher)
Defense in depthLayered verificationPartially implemented
Common cause failureCorrelated AI errorsMajor concern

Insight: Nuclear achieves 10⁻⁵ failure rates through:

  • Extensive redundancy
  • Formal safety analysis
  • Continuous monitoring
  • Strong safety culture

AI systems are ~1000× less reliable for critical failures.


MetricDefinitionCurrent Level
Fatal accident ratePer million flights~0.2
Catastrophic failurePer flight hour< 10⁻⁹
Software criticalityDO-178C Level A~10⁻⁹ per hour
LevelFailure ConditionProbability TargetAI Equivalent
ACatastrophic< 10⁻⁹Safety-critical autonomous
BHazardous< 10⁻⁷Mission-critical AI
CMajor< 10⁻⁵Important AI functions
DMinor< 10⁻³Convenience features
ENo effectNo requirementInternal tools
LevelMC/DC CoverageReviewsTesting
A100%FormalExhaustive
B100%IndependentExtensive
CStatementPeerModerate
D-SelfBasic

AI Implication: Most AI systems would not pass Level C requirements, let alone Level A.

// Commercial aviation: ~0.2 fatal accidents per million flights
// Flight duration: ~2 hours average
// Fatal accident rate: ~10⁻⁷ per flight hour
// AI task: ~1000 tokens, ~30 seconds
// To match aviation safety:
// AI critical failure rate should be < 10⁻⁷ × (30/3600) ≈ 10⁻⁹ per task
// This is far below current AI reliability
// Realistic AI critical failure: ~10⁻⁴ per task
// Gap: ~100,000× worse than aviation

// VaR: Maximum loss at confidence level over time horizon
// Example: 1-day 99% VaR = $1M means
// P(loss > $1M in one day) < 1%
// For AI delegation risk:
delegationVaR(confidence, horizon) = {
// Quantile of risk distribution
return quantile(riskDistribution, confidence) * horizon
}
// Example: 30-day 95% VaR for AI system
// If monthly risk ~ lognormal(log(1000), 1.0)
// 95th percentile ≈ $5,000/month
// ES: Average loss given loss exceeds VaR
// Better captures tail risk than VaR
// ES = E[Loss | Loss > VaR]
// For heavy-tailed AI damages:
// If VaR_95 = $5,000
// ES_95 might be $15,000 (tail is 3× the threshold)
PrincipleFinancial ApplicationAI Application
Risk parityEqual risk per assetEqual risk per component
Marginal contributionRisk added by positionRisk added by capability
Euler allocationTotal = sum of contributionsBudget = sum of component risks

CategoryRateSource
Diagnostic errors10-15%Meta-analyses
Medication errors1-2% of administrationsHospital data
Surgical complications3-17%Procedure-dependent
Preventable adverse events2-4% of hospitalizationsIOM estimates
~1.5%
// Human diagnostic accuracy: ~85-90%
// AI triage should match or exceed: > 90%
// AI dosing recommendations should be: < 0.5%
// False negative rate (miss serious condition)
// Human: ~5-10%
// AI should be: < 2% (higher bar for safety-critical)
StandardRequirementAI Mapping
FDA 510(k)Substantial equivalenceMatch human performance
FDA De NovoNovel with reasonable assuranceExceed human + demonstrate safety
FDA PMAHighest scrutinyExtensive clinical evidence

MetricValueSource
Average breach cost$4.45MIBM 2023
Mean time to identify204 daysIBM 2023
Mean time to contain73 daysIBM 2023
Breaches with AI involvementGrowingEmerging
// Phishing click rate: 3-5%
// Successful exploitation given click: 20-40%
// End-to-end compromise rate: ~1-2%
// AI-assisted attacks may have higher success:
aiAssistedAttack_success = baseRate * 2-5×
// ~2-10% success rate
// Defense evasion by AI:
// Traditional filters catch: 80-90%
// Against AI-crafted content: 40-60%

Quality LevelDefects per KLOCIndustry
Typical commercial15-50General
High-quality5-10Well-tested
Safety-critical0.5-2Aerospace, medical
Best-in-class< 0.1NASA, formal methods
// Defect detection by method:
codeReview_detection = beta(50, 50) // ~50%
unitTest_detection = beta(40, 60) // ~40%
integrationTest_detection = beta(45, 55) // ~45%
systemTest_detection = beta(35, 65) // ~35%
// Combined (with overlap):
combined_detection = 1 - (1-0.5)*(1-0.4)*(1-0.45)*(1-0.35)
// ~88% detection rate
// Residual defects: ~12% escape to production

DomainCritical Failure TargetAchievedAI CurrentGap
Nuclear10⁻⁵/year10⁻⁵10⁻²/year1000×
Aviation10⁻⁹/hour10⁻⁷10⁻⁴/hour1000×
Medical devices10⁻⁶/use10⁻⁵10⁻³/use100×
Financial trading10⁻⁴/day10⁻³10⁻²/day10×
General software10⁻³/operation10⁻²10⁻¹/operation10×

  • AI reliability similar to general software (~10⁻²)
  • Appropriate for low-stakes, human-in-loop applications
  • Need extensive monitoring and fallback
  • Target financial-grade reliability (~10⁻³ to 10⁻⁴)
  • Suitable for moderate-stakes with verification
  • Formal methods for critical paths
  • Target medical-device-grade (~10⁻⁵ to 10⁻⁶)
  • Required for safety-critical autonomous systems
  • May require fundamental advances
  • Aviation/nuclear-grade (~10⁻⁹)
  • Uncertain if achievable with current approaches
  • May require formal verification of AI systems

From Nuclear: Probabilistic Risk Assessment

Section titled “From Nuclear: Probabilistic Risk Assessment”
  • Fault trees and event trees
  • Common cause failure analysis
  • Importance measures (Fussell-Vesely, Risk Achievement Worth)
  • Criticality classification
  • Verification requirements scaling
  • Independence requirements
  • Risk aggregation
  • Diversification benefits
  • Coherent risk measures
  • Prospective studies
  • Sensitivity/specificity tradeoffs
  • Subgroup analysis
  • Attack surface analysis
  • Defense in depth
  • Assume breach mentality