Probability Buckets: How Many and How Wide
Why buckets matter
Calibration is usually evaluated by grouping forecasts into probability buckets and comparing:
• average predicted probability
• realized frequency (YES rate)
This produces a calibration table and a calibration curve.
But bucket design is a methodology choice. Bad buckets create fake patterns or hide real problems.
The default bucket scheme
A common default is 10 buckets of width 0.10:
• 0.00 to 0.10
• 0.10 to 0.20
• ...
• 0.90 to 1.00
This is easy to explain and works well when you have enough volume.
Bucket count depends on sample size
If your total sample size is small, 10 buckets is often too many.
Rule of thumb:
• aim for at least 20 to 30 forecasts per bucket for stable calibration signals
If you have N = 120 total, you will average 12 per bucket. That is noisy. Use fewer buckets or wider ranges.
Wide buckets vs narrow buckets
Wide buckets (fewer buckets)
Pros:
• more stable realized frequencies
• fewer false alarms from noise
Cons:
• hides subtle miscalibration patterns
Narrow buckets (more buckets)
Pros:
• more detail and diagnostic power
Cons:
• high variance, easy to overinterpret
• creates a misleading sense of precision
Adaptive buckets: a practical alternative
Instead of fixed width buckets, use equal count buckets.
Example:
• sort forecasts by probability
• split into 10 groups with equal counts
This makes each bucket have similar N, which stabilizes comparisons.
The tradeoff is interpretability: bucket ranges are not uniform.
Special handling near 0 and 1
Many forecasters rarely use extremes. If your 0.90 to 1.00 bucket is tiny, you can:
• merge high buckets (0.80 to 1.00)
• or report extremes separately but with a clear “low N” warning
Never present a strong story about calibration in a bucket with N = 5.
Show uncertainty, not just point estimates
A calibration table is stronger when it includes:
• N per bucket
• a confidence interval for realized frequency
This prevents false conclusions from noise. See Confidence Intervals for Calibration Buckets.
Bucket choices can be gamed
If users know you publish bucket based calibration, some will avoid buckets where they look bad by clustering around a different range.
This is another reason to publish:
• forecast distribution
• coverage and sample size
• rolling windows
Takeaway
Bucket design should match sample size. Ten equal width buckets is a good default for large N, but it is too noisy for small N. If your buckets are thin, use fewer buckets or equal count buckets, and always show N and uncertainty so you do not mistake noise for miscalibration.