Calibration Explained: Why 70 Percent Should Mean 70 Percent

What calibration means

Calibration is the relationship between what you predict and what actually happens. If you repeatedly assign 70% to similar events, then roughly 70% of those events should resolve YES.

Calibration is not about being right on one question. It is about whether your probabilities are honest and statistically consistent over many questions.

The simple test

Imagine you make 100 forecasts at:

• p = 0.70

If you are calibrated, you should see about:

• 70 YES outcomes

• 30 NO outcomes

If you see 55 YES outcomes, you are overconfident at that level. If you see 85 YES outcomes, you are underconfident at that level.

Why calibration matters

• It is a core property of good probabilistic forecasting, not a bonus.

• It is directly connected to Brier score and its decomposition. Poor calibration usually shows up as worse scores.

• In decision making, miscalibration causes systematic mistakes: you bet too big when you are overconfident and you miss value when you are underconfident.

How to measure calibration

The common approach is to group forecasts into probability ranges (often called buckets) and compute:

• average predicted probability in the bucket

• realized frequency (share of YES outcomes) in the bucket

This is typically presented as a calibration table.

How to read a calibration table

Look for these three things:

1) Counts per bucket

Calibration is noisy with small samples. A bucket with N = 6 can look awful just due to variance. Always check sample size and, when available, a confidence interval.

2) Overconfidence patterns

Overconfidence means your realized frequency is below your predicted probability in the higher buckets.

Example: your 0.80 bucket resolves YES only 0.62 of the time.

See: Overconfidence.

3) Underconfidence patterns

Underconfidence means your realized frequency is above your predicted probability in the higher buckets.

Example: your 0.60 bucket resolves YES 0.75 of the time.

See: Underconfidence.

Calibration and sharpness are different

You can be calibrated but still not very useful if you always predict near 0.50. That is low sharpness.

The goal is not to be extreme. The goal is to be calibrated at whatever confidence you choose to express.

Common causes of miscalibration

• Base rate neglect: ignoring the base rate leads to overly extreme forecasts.

• Mixing different event types: combining very different categories can create fake calibration problems.

• Small N: apparent problems that disappear with more data.

• Drift: changes over time can break a previously calibrated process. See Calibration Drift.

Practical ways to improve calibration

1) Start from base rates

Use the base rate as a prior, then move away from it only when you have real evidence.

2) Reduce extreme probabilities

If your misses are mostly from very confident wrong calls, you may need to compress probabilities toward the middle. This is often the fastest win for reducing squared error.

3) Review your buckets monthly

Track calibration by time window. If the pattern changes, you may have drift. See Rolling Window and Forecast Drift.

Takeaway

Calibration means your probabilities are honest. If you say 70% repeatedly, those events should happen about 70% of the time. Use calibration tables with sample size context, avoid confusing sharpness with calibration, and apply simple corrections when patterns repeat.

• Calibration

• Calibration Table

• Overconfidence

• Underconfidence

• Sharpness

• How to Read a Forecast Scorecard