Probability Buckets: How Many and How Wide

Why buckets matter

Calibration is usually evaluated by grouping forecasts into probability buckets and comparing:

• average predicted probability

• realized frequency (YES rate)

This produces a calibration table and a calibration curve.

But bucket design is a methodology choice. Bad buckets create fake patterns or hide real problems.

The default bucket scheme

A common default is 10 buckets of width 0.10:

• 0.00 to 0.10

• 0.10 to 0.20

• ...

• 0.90 to 1.00

This is easy to explain and works well when you have enough volume.

Bucket count depends on sample size

If your total sample size is small, 10 buckets is often too many.

Rule of thumb:

• aim for at least 20 to 30 forecasts per bucket for stable calibration signals

If you have N = 120 total, you will average 12 per bucket. That is noisy. Use fewer buckets or wider ranges.

Wide buckets vs narrow buckets

Wide buckets (fewer buckets)

Pros:

• more stable realized frequencies

• fewer false alarms from noise

Cons:

• hides subtle miscalibration patterns

Narrow buckets (more buckets)

Pros:

• more detail and diagnostic power

Cons:

• high variance, easy to overinterpret

• creates a misleading sense of precision

Adaptive buckets: a practical alternative

Instead of fixed width buckets, use equal count buckets.

Example:

• sort forecasts by probability

• split into 10 groups with equal counts

This makes each bucket have similar N, which stabilizes comparisons.

The tradeoff is interpretability: bucket ranges are not uniform.

Special handling near 0 and 1

Many forecasters rarely use extremes. If your 0.90 to 1.00 bucket is tiny, you can:

• merge high buckets (0.80 to 1.00)

• or report extremes separately but with a clear “low N” warning

Never present a strong story about calibration in a bucket with N = 5.

Show uncertainty, not just point estimates

A calibration table is stronger when it includes:

• N per bucket

• a confidence interval for realized frequency

This prevents false conclusions from noise. See Confidence Intervals for Calibration Buckets.

Bucket choices can be gamed

If users know you publish bucket based calibration, some will avoid buckets where they look bad by clustering around a different range.

This is another reason to publish:

• forecast distribution

• coverage and sample size

• rolling windows

Takeaway

Bucket design should match sample size. Ten equal width buckets is a good default for large N, but it is too noisy for small N. If your buckets are thin, use fewer buckets or equal count buckets, and always show N and uncertainty so you do not mistake noise for miscalibration.

• Calibration Table

• Calibration Curve

• Sample Size

• Confidence Interval

• How to Read a Calibration Curve and Table