Calibration Explained: Why 70 Percent Should Mean 70 Percent
What calibration means
Calibration is the relationship between what you predict and what actually happens. If you repeatedly assign 70% to similar events, then roughly 70% of those events should resolve YES.
Calibration is not about being right on one question. It is about whether your probabilities are honest and statistically consistent over many questions.
The simple test
Imagine you make 100 forecasts at:
• p = 0.70
If you are calibrated, you should see about:
• 70 YES outcomes
• 30 NO outcomes
If you see 55 YES outcomes, you are overconfident at that level. If you see 85 YES outcomes, you are underconfident at that level.
Why calibration matters
• It is a core property of good probabilistic forecasting, not a bonus.
• It is directly connected to Brier score and its decomposition. Poor calibration usually shows up as worse scores.
• In decision making, miscalibration causes systematic mistakes: you bet too big when you are overconfident and you miss value when you are underconfident.
How to measure calibration
The common approach is to group forecasts into probability ranges (often called buckets) and compute:
• average predicted probability in the bucket
• realized frequency (share of YES outcomes) in the bucket
This is typically presented as a calibration table.
How to read a calibration table
Look for these three things:
1) Counts per bucket
Calibration is noisy with small samples. A bucket with N = 6 can look awful just due to variance. Always check sample size and, when available, a confidence interval.
2) Overconfidence patterns
Overconfidence means your realized frequency is below your predicted probability in the higher buckets.
Example: your 0.80 bucket resolves YES only 0.62 of the time.
See: Overconfidence.
3) Underconfidence patterns
Underconfidence means your realized frequency is above your predicted probability in the higher buckets.
Example: your 0.60 bucket resolves YES 0.75 of the time.
See: Underconfidence.
Calibration and sharpness are different
You can be calibrated but still not very useful if you always predict near 0.50. That is low sharpness.
The goal is not to be extreme. The goal is to be calibrated at whatever confidence you choose to express.
Common causes of miscalibration
• Base rate neglect: ignoring the base rate leads to overly extreme forecasts.
• Mixing different event types: combining very different categories can create fake calibration problems.
• Small N: apparent problems that disappear with more data.
• Drift: changes over time can break a previously calibrated process. See Calibration Drift.
Practical ways to improve calibration
1) Start from base rates
Use the base rate as a prior, then move away from it only when you have real evidence.
2) Reduce extreme probabilities
If your misses are mostly from very confident wrong calls, you may need to compress probabilities toward the middle. This is often the fastest win for reducing squared error.
3) Review your buckets monthly
Track calibration by time window. If the pattern changes, you may have drift. See Rolling Window and Forecast Drift.
Takeaway
Calibration means your probabilities are honest. If you say 70% repeatedly, those events should happen about 70% of the time. Use calibration tables with sample size context, avoid confusing sharpness with calibration, and apply simple corrections when patterns repeat.