Trust Calibration
Trust Calibration
Section titled “Trust Calibration”This page explains how to convert observed track records into calibrated trust estimates using Bayesian updating.
The Trust Update Problem
Section titled “The Trust Update Problem”You have:
- A prior belief about component reliability
- Observations of successes and failures
- Need: Posterior belief for risk calculations
Bayesian Framework
Section titled “Bayesian Framework”Model Setup
Section titled “Model Setup”Let θ = true reliability (probability of success per task).
// Prior: What we believe before observations// Beta distribution is conjugate prior for binomial likelihoodprior = beta(alpha_0, beta_0)
// Observation model// P(success | θ) = θ// P(failure | θ) = 1 - θ
// Posterior after n successes, m failuresposterior = beta(alpha_0 + n, beta_0 + m)Prior Selection
Section titled “Prior Selection”The prior encodes what we believe before seeing any data:
// Uninformative prior (know nothing)uninformative = beta(1, 1)// Uniform over [0, 1]
// Skeptical prior (expect ~50% reliability)skeptical = beta(5, 5)// Mean: 50%, concentrated around middle
// Optimistic prior (expect ~90% reliability)optimistic = beta(9, 1)// Mean: 90%, assumes competence
// Pessimistic prior (expect ~10% reliability)pessimistic = beta(1, 9)// Mean: 10%, assumes unreliability
// Informative prior based on component typeprior_deterministicCode = beta(100, 1) // ~99% reliableprior_narrowML = beta(19, 1) // ~95% reliableprior_generalLLM = beta(9, 1) // ~90% reliableprior_RLAgent = beta(4, 1) // ~80% reliablePrior Strength
Section titled “Prior Strength”The sum α₀ + β₀ determines how much data is needed to move the posterior:
| Prior Strength | α₀ + β₀ | Data to Move Significantly |
|---|---|---|
| Very weak | 2 | 5-10 observations |
| Weak | 10 | 20-50 observations |
| Moderate | 50 | 100-200 observations |
| Strong | 200 | 500+ observations |
Updating Examples
Section titled “Updating Examples”Example 1: New LLM-Based Component
Section titled “Example 1: New LLM-Based Component”// Prior: General LLM, expect ~90% reliabilityprior = beta(9, 1)
// After 50 successes, 3 failuresposterior = beta(9 + 50, 1 + 3)// = beta(59, 4)// Mean: 93.7%// 90% CI: [86%, 98%]Interpretation: Started at 90% prior, observed 94% success rate, posterior is 94% with tightened uncertainty.
Example 2: Critical Security Component
Section titled “Example 2: Critical Security Component”// Prior: Skeptical for security-critical (trust must be earned)prior = beta(5, 5)// Mean: 50%, wide uncertainty
// After 100 successes, 0 failuresposterior = beta(5 + 100, 5 + 0)// = beta(105, 5)// Mean: 95.5%// 90% CI: [90%, 98%]
// After 100 successes, 2 failuresposterior_with_failures = beta(5 + 100, 5 + 2)// = beta(105, 7)// Mean: 93.8%// 90% CI: [88%, 97%]Example 3: Established Production System
Section titled “Example 3: Established Production System”// Prior: Strong optimistic prior (long track record)prior = beta(900, 100)// Mean: 90%, narrow uncertainty
// After 10 new failures in 1000 tasksposterior = beta(900 + 990, 100 + 10)// = beta(1890, 110)// Mean: 94.5%
// Prior barely moved - strong prior dominatesConverting Track Record to Risk Modifier
Section titled “Converting Track Record to Risk Modifier”The framework uses a track record modifier in risk calculations:
// Base risk from component typebaseRisk = probabilityPrior * damageEstimate
// Track record modifier (0 to 2, where 1 = no adjustment)trackRecordModifier(posterior, prior) = { posteriorMean = posterior.alpha / (posterior.alpha + posterior.beta) priorMean = prior.alpha / (prior.alpha + prior.beta)
// Ratio adjustment return priorMean / posteriorMean // > 1 if posterior is worse than prior (bad track record) // < 1 if posterior is better than prior (good track record)}
// Adjusted riskadjustedRisk = baseRisk * trackRecordModifier(posterior, prior)Example Track Record Modifiers
Section titled “Example Track Record Modifiers”| Scenario | Prior | Observed | Modifier | Effect |
|---|---|---|---|---|
| New, untested | 90% | - | 1.0 | Baseline |
| Good track record | 90% | 98% | 0.92 | 8% reduction |
| Excellent record | 90% | 99.5% | 0.90 | 10% reduction |
| Mediocre record | 90% | 85% | 1.06 | 6% increase |
| Bad track record | 90% | 70% | 1.29 | 29% increase |
Time-Weighted Observations
Section titled “Time-Weighted Observations”Recent observations should count more than old ones:
Exponential Decay
Section titled “Exponential Decay”// Weight observation by recencyweight(age_days, halfLife_days) = 0.5^(age_days / halfLife_days)
// Example: 30-day half-life// Today's observation: weight = 1.0// 30 days ago: weight = 0.5// 60 days ago: weight = 0.25
// Effective observation counteffectiveSuccesses = sum(weight(age) for success in successes)effectiveFailures = sum(weight(age) for failure in failures)
posterior = beta( alpha_0 + effectiveSuccesses, beta_0 + effectiveFailures)Sliding Window
Section titled “Sliding Window”// Only consider last N observationswindowSize = 100
recentSuccesses = count(successes in last windowSize)recentFailures = count(failures in last windowSize)
posterior = beta( alpha_0 + recentSuccesses, beta_0 + recentFailures)Observation Quality Adjustments
Section titled “Observation Quality Adjustments”Not all observations are equally informative:
Task Complexity Weighting
Section titled “Task Complexity Weighting”// Complex tasks are more informativecomplexityWeight(complexity) = { if complexity <= 3: 0.5 // Easy tasks tell us less if complexity <= 6: 1.0 // Standard weight if complexity <= 9: 1.5 // Hard tasks more informative else: 2.0 // Very hard tasks most informative}
weightedSuccesses = sum(complexityWeight(task.complexity) for task in successes)Adversarial vs. Random Errors
Section titled “Adversarial vs. Random Errors”// Adversarial failures are more concerning// Use different update for adversarial vs. benign failures
// Standard failurestandardFailure_weight = 1.0
// Adversarial failure (e.g., red team found exploit)adversarialFailure_weight = 3.0 // Counts as 3 failures
// Random/benign errorbenignFailure_weight = 0.7 // Less concerningMulti-Component Trust
Section titled “Multi-Component Trust”When multiple components form a pipeline:
Independent Components
Section titled “Independent Components”// Pipeline reliability = product of component reliabilitiespipelineReliability = product(componentReliabilities)
// Update each component independently based on its observationsCorrelated Components
Section titled “Correlated Components”// Components may share failure modes// Use hierarchical model
// Global factor affecting all componentsglobalReliability = beta(alpha_global, beta_global)
// Component-specific deviationcomponentReliability[i] = globalReliability * componentFactor[i]
// Observations from any component inform global estimateCalibration Checks
Section titled “Calibration Checks”Prediction Intervals
Section titled “Prediction Intervals”After updating, check if predictions match reality:
// Posterior predicts next 100 tasks will have X failures// where X ~ BetaBinomial(100, posterior.alpha, posterior.beta)
// If actual failures fall outside 90% prediction interval,// model may be miscalibratedBrier Score
Section titled “Brier Score”// Track prediction accuracy over timebrierScore = mean((predicted_prob - actual_outcome)^2)
// Perfect calibration: Brier score ≈ predicted_prob * (1 - predicted_prob)Practical Recommendations
Section titled “Practical Recommendations”For New Components
Section titled “For New Components”- Start with informative prior based on component type
- Use weak prior strength (α₀ + β₀ ≈ 10) to allow quick updates
- Monitor first 50-100 tasks carefully
- Tighten prior as track record accumulates
For Established Components
Section titled “For Established Components”- Use moderate prior strength based on historical volume
- Apply time-weighting (30-90 day half-life)
- Weight complex tasks higher
- Review calibration quarterly
For Critical Components
Section titled “For Critical Components”- Use skeptical prior (must earn trust)
- Weight adversarial failures heavily
- Consider worst-case bounds, not just mean
- Require statistical significance before trust increases
Reference Tables
Section titled “Reference Tables”Quick Prior Lookup
Section titled “Quick Prior Lookup”| Component Type | Prior | Mean | 90% CI |
|---|---|---|---|
| Deterministic code | beta(99, 1) | 99% | [95%, 100%] |
| Narrow ML | beta(19, 1) | 95% | [82%, 99%] |
| General LLM | beta(9, 1) | 90% | [67%, 99%] |
| RL/Agentic | beta(4, 1) | 80% | [45%, 98%] |
| New/Unknown | beta(1, 1) | 50% | [5%, 95%] |
| Skeptical | beta(1, 9) | 10% | [1%, 33%] |
Observations Needed for Confidence
Section titled “Observations Needed for Confidence”| Starting Prior | Target Confidence | Successes Needed (0 failures) |
|---|---|---|
| beta(1, 1) | 90% ± 5% | ~35 |
| beta(1, 1) | 95% ± 3% | ~60 |
| beta(1, 1) | 99% ± 1% | ~200 |
| beta(9, 1) | 95% ± 3% | ~30 |
| beta(9, 1) | 99% ± 1% | ~100 |