Skip to content

Trust Calibration

This page explains how to convert observed track records into calibrated trust estimates using Bayesian updating.

You have:

  • A prior belief about component reliability
  • Observations of successes and failures
  • Need: Posterior belief for risk calculations

Let θ = true reliability (probability of success per task).

// Prior: What we believe before observations
// Beta distribution is conjugate prior for binomial likelihood
prior = beta(alpha_0, beta_0)
// Observation model
// P(success | θ) = θ
// P(failure | θ) = 1 - θ
// Posterior after n successes, m failures
posterior = beta(alpha_0 + n, beta_0 + m)

The prior encodes what we believe before seeing any data:

// Uninformative prior (know nothing)
uninformative = beta(1, 1)
// Uniform over [0, 1]
// Skeptical prior (expect ~50% reliability)
skeptical = beta(5, 5)
// Mean: 50%, concentrated around middle
// Optimistic prior (expect ~90% reliability)
optimistic = beta(9, 1)
// Mean: 90%, assumes competence
// Pessimistic prior (expect ~10% reliability)
pessimistic = beta(1, 9)
// Mean: 10%, assumes unreliability
// Informative prior based on component type
prior_deterministicCode = beta(100, 1) // ~99% reliable
prior_narrowML = beta(19, 1) // ~95% reliable
prior_generalLLM = beta(9, 1) // ~90% reliable
prior_RLAgent = beta(4, 1) // ~80% reliable

The sum α₀ + β₀ determines how much data is needed to move the posterior:

Prior Strengthα₀ + β₀Data to Move Significantly
Very weak25-10 observations
Weak1020-50 observations
Moderate50100-200 observations
Strong200500+ observations

// Prior: General LLM, expect ~90% reliability
prior = beta(9, 1)
// After 50 successes, 3 failures
posterior = beta(9 + 50, 1 + 3)
// = beta(59, 4)
// Mean: 93.7%
// 90% CI: [86%, 98%]

Interpretation: Started at 90% prior, observed 94% success rate, posterior is 94% with tightened uncertainty.

// Prior: Skeptical for security-critical (trust must be earned)
prior = beta(5, 5)
// Mean: 50%, wide uncertainty
// After 100 successes, 0 failures
posterior = beta(5 + 100, 5 + 0)
// = beta(105, 5)
// Mean: 95.5%
// 90% CI: [90%, 98%]
// After 100 successes, 2 failures
posterior_with_failures = beta(5 + 100, 5 + 2)
// = beta(105, 7)
// Mean: 93.8%
// 90% CI: [88%, 97%]
// Prior: Strong optimistic prior (long track record)
prior = beta(900, 100)
// Mean: 90%, narrow uncertainty
// After 10 new failures in 1000 tasks
posterior = beta(900 + 990, 100 + 10)
// = beta(1890, 110)
// Mean: 94.5%
// Prior barely moved - strong prior dominates

The framework uses a track record modifier in risk calculations:

// Base risk from component type
baseRisk = probabilityPrior * damageEstimate
// Track record modifier (0 to 2, where 1 = no adjustment)
trackRecordModifier(posterior, prior) = {
posteriorMean = posterior.alpha / (posterior.alpha + posterior.beta)
priorMean = prior.alpha / (prior.alpha + prior.beta)
// Ratio adjustment
return priorMean / posteriorMean
// > 1 if posterior is worse than prior (bad track record)
// < 1 if posterior is better than prior (good track record)
}
// Adjusted risk
adjustedRisk = baseRisk * trackRecordModifier(posterior, prior)
ScenarioPriorObservedModifierEffect
New, untested90%-1.0Baseline
Good track record90%98%0.928% reduction
Excellent record90%99.5%0.9010% reduction
Mediocre record90%85%1.066% increase
Bad track record90%70%1.2929% increase

Recent observations should count more than old ones:

// Weight observation by recency
weight(age_days, halfLife_days) = 0.5^(age_days / halfLife_days)
// Example: 30-day half-life
// Today's observation: weight = 1.0
// 30 days ago: weight = 0.5
// 60 days ago: weight = 0.25
// Effective observation count
effectiveSuccesses = sum(weight(age) for success in successes)
effectiveFailures = sum(weight(age) for failure in failures)
posterior = beta(
alpha_0 + effectiveSuccesses,
beta_0 + effectiveFailures
)
// Only consider last N observations
windowSize = 100
recentSuccesses = count(successes in last windowSize)
recentFailures = count(failures in last windowSize)
posterior = beta(
alpha_0 + recentSuccesses,
beta_0 + recentFailures
)

Not all observations are equally informative:

// Complex tasks are more informative
complexityWeight(complexity) = {
if complexity <= 3: 0.5 // Easy tasks tell us less
if complexity <= 6: 1.0 // Standard weight
if complexity <= 9: 1.5 // Hard tasks more informative
else: 2.0 // Very hard tasks most informative
}
weightedSuccesses = sum(complexityWeight(task.complexity) for task in successes)
// Adversarial failures are more concerning
// Use different update for adversarial vs. benign failures
// Standard failure
standardFailure_weight = 1.0
// Adversarial failure (e.g., red team found exploit)
adversarialFailure_weight = 3.0 // Counts as 3 failures
// Random/benign error
benignFailure_weight = 0.7 // Less concerning

When multiple components form a pipeline:

// Pipeline reliability = product of component reliabilities
pipelineReliability = product(componentReliabilities)
// Update each component independently based on its observations
// Components may share failure modes
// Use hierarchical model
// Global factor affecting all components
globalReliability = beta(alpha_global, beta_global)
// Component-specific deviation
componentReliability[i] = globalReliability * componentFactor[i]
// Observations from any component inform global estimate

After updating, check if predictions match reality:

// Posterior predicts next 100 tasks will have X failures
// where X ~ BetaBinomial(100, posterior.alpha, posterior.beta)
// If actual failures fall outside 90% prediction interval,
// model may be miscalibrated
// Track prediction accuracy over time
brierScore = mean((predicted_prob - actual_outcome)^2)
// Perfect calibration: Brier score ≈ predicted_prob * (1 - predicted_prob)

  1. Start with informative prior based on component type
  2. Use weak prior strength (α₀ + β₀ ≈ 10) to allow quick updates
  3. Monitor first 50-100 tasks carefully
  4. Tighten prior as track record accumulates
  1. Use moderate prior strength based on historical volume
  2. Apply time-weighting (30-90 day half-life)
  3. Weight complex tasks higher
  4. Review calibration quarterly
  1. Use skeptical prior (must earn trust)
  2. Weight adversarial failures heavily
  3. Consider worst-case bounds, not just mean
  4. Require statistical significance before trust increases

Component TypePriorMean90% CI
Deterministic codebeta(99, 1)99%[95%, 100%]
Narrow MLbeta(19, 1)95%[82%, 99%]
General LLMbeta(9, 1)90%[67%, 99%]
RL/Agenticbeta(4, 1)80%[45%, 98%]
New/Unknownbeta(1, 1)50%[5%, 95%]
Skepticalbeta(1, 9)10%[1%, 33%]
Starting PriorTarget ConfidenceSuccesses Needed (0 failures)
beta(1, 1)90% ± 5%~35
beta(1, 1)95% ± 3%~60
beta(1, 1)99% ± 1%~200
beta(9, 1)95% ± 3%~30
beta(9, 1)99% ± 1%~100