Confidence Intervals for Calibration Buckets
Why calibration needs uncertainty
A calibration table compares predicted probabilities to realized frequencies in buckets.
But realized frequency is an estimate from a finite sample. With small sample size, it can swing a lot even if the forecaster is actually calibrated.
Confidence intervals make that uncertainty visible.
What you are estimating in a bucket
In a bucket, you have:
• N forecasts
• k YES outcomes
The realized frequency is:
p_hat = k / N
A confidence interval gives a plausible range for the true underlying rate.
A practical 95 percent interval (binomial)
Bucket outcomes are binary, so the natural model is binomial.
You can compute a 95% interval for p_hat.
Implementation note: there are multiple interval formulas. Some behave better than the simplest normal approximation, especially for small N or extreme rates.
Interpretation: what the interval means
If you repeat the same forecasting process many times, 95% of such intervals would contain the true underlying rate (under the binomial model).
It does not mean there is a 95% chance the true rate is in the interval for this one bucket. It is a long run coverage concept.
How to use intervals on a scorecard
Rule 1: show N per bucket
An interval is only meaningful with N. Put N in the table so users do not overinterpret small buckets.
Rule 2: compare the predicted bucket mean to the interval
Let:
• p_bar be the average predicted probability in the bucket
• p_hat be realized frequency
If p_bar is inside the interval, the bucket is consistent with calibration given the sample size. If it is outside by a lot, you likely have miscalibration.
Rule 3: treat extreme buckets carefully
When p_hat is near 0 or 1 and N is small, intervals can be wide and asymmetric. That is normal. Do not “fix” calibration based on tiny extreme buckets.
Worked intuition example
Suppose your 0.70 to 0.80 bucket has:
• N = 20
• k = 12 YES outcomes
• p_hat = 0.60
It might look like you are overconfident if p_bar is 0.75.
But with N = 20, the interval can be wide. The correct conclusion is often: “insufficient evidence to diagnose”, not “definitely overconfident”.
How intervals change behavior
Adding intervals tends to improve scorecard quality because:
• users stop drawing stories from N=5 buckets
• platform support questions drop (“why is my 0.80 bucket bad”) because uncertainty is visible
• calibration fixes become data driven and less reactive
Best practice: combine intervals with adaptive buckets
Intervals help, but bucket design still matters. If your buckets are thin, use fewer buckets or equal-count buckets.
See Probability Buckets: How Many and How Wide.
Takeaway
Calibration bucket hit rates are estimates with uncertainty. Confidence intervals stop you from overreacting to noise and help you separate real miscalibration from random variation. For honest scorecards, always show N and an interval per bucket.