How to Read a Calibration Curve and Table
Two ways to view calibration
Calibration is usually shown in two formats:
• a calibration table with bucket statistics
• a calibration curve (also called a reliability diagram)
Both are built from the same data. The table is better for details. The curve is better for pattern recognition.
Step 1: understand probability buckets
Calibration diagnostics group forecasts into ranges, for example:
• 0.00 to 0.10
• 0.10 to 0.20
• ...
• 0.90 to 1.00
Each row of the table reports:
• how many forecasts fell into that bucket
• the average predicted probability in the bucket
• the realized frequency (share of YES outcomes) in the bucket
Step 2: read the calibration table
Start from the counts.
Bucket counts and stability
Small buckets are noisy. A bucket with N = 8 can look wildly miscalibrated due to variance.
That is why good scorecards show sample size per bucket and, when possible, a confidence interval.
Overconfidence pattern
Overconfidence means realized frequency is lower than predicted probability in higher buckets.
Example:
• predicted average = 0.80
• realized frequency = 0.62
See Overconfidence.
Underconfidence pattern
Underconfidence means realized frequency is higher than predicted probability in higher buckets.
Example:
• predicted average = 0.60
• realized frequency = 0.75
See Underconfidence.
Step 3: read the calibration curve
A calibration curve plots:
• x axis: average predicted probability per bucket
• y axis: realized frequency per bucket
The diagonal line y = x is perfect calibration.
What the shape tells you
Curve below the diagonal: overconfidence.
Curve above the diagonal: underconfidence.
S shape: common pattern where mid range probabilities are too timid and high probabilities are too extreme.
Do not confuse calibration with sharpness
You can be perfectly calibrated but still not very informative if you never leave 0.40 to 0.60. That is low sharpness.
That is why calibration is often paired with forecast distribution.
Common mistakes when reading calibration
Mistake 1: ignoring bucket counts
Always read calibration together with N. A dramatic looking point with N = 4 should not drive conclusions.
Mistake 2: mixing unlike questions
If you combine categories with very different base rates, you can create fake miscalibration. Segment by category when needed.
Mistake 3: using too many buckets
More buckets means fewer samples per bucket. If your total N is small, use fewer buckets or wider ranges.
Mistake 4: forgetting drift
Your calibration can change over time due to calibration drift. Use a rolling window to track stability.
Quick checklist
• Are bucket counts healthy?
• Is the curve mostly above or below the diagonal?
• Are deviations consistent across multiple buckets, or isolated noise?
• Does forecast distribution show enough sharpness?
• Is there evidence of drift over time?
Takeaway
Calibration tables and curves show whether your probabilities mean what they say. The most common evaluation failure is overinterpreting noise in small buckets. Always read calibration with bucket counts, and pair it with sharpness and rolling stability checks.
Related
• Calibration Explained: Why 70 Percent Should Mean 70 Percent