How to Read a Calibration Curve and Table

Two ways to view calibration

Calibration is usually shown in two formats:

• a calibration table with bucket statistics

• a calibration curve (also called a reliability diagram)

Both are built from the same data. The table is better for details. The curve is better for pattern recognition.

Step 1: understand probability buckets

Calibration diagnostics group forecasts into ranges, for example:

• 0.00 to 0.10

• 0.10 to 0.20

• ...

• 0.90 to 1.00

Each row of the table reports:

• how many forecasts fell into that bucket

• the average predicted probability in the bucket

• the realized frequency (share of YES outcomes) in the bucket

Step 2: read the calibration table

Start from the counts.

Bucket counts and stability

Small buckets are noisy. A bucket with N = 8 can look wildly miscalibrated due to variance.

That is why good scorecards show sample size per bucket and, when possible, a confidence interval.

Overconfidence pattern

Overconfidence means realized frequency is lower than predicted probability in higher buckets.

Example:

• predicted average = 0.80

• realized frequency = 0.62

See Overconfidence.

Underconfidence pattern

Underconfidence means realized frequency is higher than predicted probability in higher buckets.

Example:

• predicted average = 0.60

• realized frequency = 0.75

See Underconfidence.

Step 3: read the calibration curve

A calibration curve plots:

• x axis: average predicted probability per bucket

• y axis: realized frequency per bucket

The diagonal line y = x is perfect calibration.

What the shape tells you

Curve below the diagonal: overconfidence.

Curve above the diagonal: underconfidence.

S shape: common pattern where mid range probabilities are too timid and high probabilities are too extreme.

Do not confuse calibration with sharpness

You can be perfectly calibrated but still not very informative if you never leave 0.40 to 0.60. That is low sharpness.

That is why calibration is often paired with forecast distribution.

Common mistakes when reading calibration

Mistake 1: ignoring bucket counts

Always read calibration together with N. A dramatic looking point with N = 4 should not drive conclusions.

Mistake 2: mixing unlike questions

If you combine categories with very different base rates, you can create fake miscalibration. Segment by category when needed.

Mistake 3: using too many buckets

More buckets means fewer samples per bucket. If your total N is small, use fewer buckets or wider ranges.

Mistake 4: forgetting drift

Your calibration can change over time due to calibration drift. Use a rolling window to track stability.

Quick checklist

• Are bucket counts healthy?

• Is the curve mostly above or below the diagonal?

• Are deviations consistent across multiple buckets, or isolated noise?

• Does forecast distribution show enough sharpness?

• Is there evidence of drift over time?

Takeaway

Calibration tables and curves show whether your probabilities mean what they say. The most common evaluation failure is overinterpreting noise in small buckets. Always read calibration with bucket counts, and pair it with sharpness and rolling stability checks.

• Calibration

• Calibration Table

• Calibration Curve

• Reliability Diagram

• Sample Size

• Confidence Interval

• Calibration Explained: Why 70 Percent Should Mean 70 Percent