Confidence Intervals for Calibration Buckets

Why calibration needs uncertainty

A calibration table compares predicted probabilities to realized frequencies in buckets.

But realized frequency is an estimate from a finite sample. With small sample size, it can swing a lot even if the forecaster is actually calibrated.

Confidence intervals make that uncertainty visible.

What you are estimating in a bucket

In a bucket, you have:

• N forecasts

• k YES outcomes

The realized frequency is:

p_hat = k / N

A confidence interval gives a plausible range for the true underlying rate.

A practical 95 percent interval (binomial)

Bucket outcomes are binary, so the natural model is binomial.

You can compute a 95% interval for p_hat.

Implementation note: there are multiple interval formulas. Some behave better than the simplest normal approximation, especially for small N or extreme rates.

Interpretation: what the interval means

If you repeat the same forecasting process many times, 95% of such intervals would contain the true underlying rate (under the binomial model).

It does not mean there is a 95% chance the true rate is in the interval for this one bucket. It is a long run coverage concept.

How to use intervals on a scorecard

Rule 1: show N per bucket

An interval is only meaningful with N. Put N in the table so users do not overinterpret small buckets.

Rule 2: compare the predicted bucket mean to the interval

Let:

• p_bar be the average predicted probability in the bucket

• p_hat be realized frequency

If p_bar is inside the interval, the bucket is consistent with calibration given the sample size. If it is outside by a lot, you likely have miscalibration.

Rule 3: treat extreme buckets carefully

When p_hat is near 0 or 1 and N is small, intervals can be wide and asymmetric. That is normal. Do not “fix” calibration based on tiny extreme buckets.

Worked intuition example

Suppose your 0.70 to 0.80 bucket has:

• N = 20

• k = 12 YES outcomes

• p_hat = 0.60

It might look like you are overconfident if p_bar is 0.75.

But with N = 20, the interval can be wide. The correct conclusion is often: “insufficient evidence to diagnose”, not “definitely overconfident”.

How intervals change behavior

Adding intervals tends to improve scorecard quality because:

• users stop drawing stories from N=5 buckets

• platform support questions drop (“why is my 0.80 bucket bad”) because uncertainty is visible

• calibration fixes become data driven and less reactive

Best practice: combine intervals with adaptive buckets

Intervals help, but bucket design still matters. If your buckets are thin, use fewer buckets or equal-count buckets.

See Probability Buckets: How Many and How Wide.

Takeaway

Calibration bucket hit rates are estimates with uncertainty. Confidence intervals stop you from overreacting to noise and help you separate real miscalibration from random variation. For honest scorecards, always show N and an interval per bucket.

• Confidence Interval

• Calibration Table

• Sample Size

• Probability Buckets

• How to Read a Calibration Curve and Table