← Back to Guides

Rolling Windows: Tracking Improvement Over Time

January 1, 2026 Scorecards

What a rolling window is

A rolling window is a moving slice of your history used to compute metrics repeatedly over time.

Example: compute Brier score over the last 30 days, then move the window forward by one day and compute again.

Why rolling windows matter

Single period scores can be misleading because forecasting is noisy.

• One extreme miss can dominate a short period, especially under log loss.

• A short hot streak can look like skill when it is variance.

• Your process can change over time due to learning or drift.

Rolling windows turn performance into a time series so you can see stability, trends, and breaks.

Two common rolling window styles

1) Time based window

You include all forecasts within a time range.

• last 14 days

• last 30 days

• last 90 days

This is good when forecast volume is fairly consistent.

2) Count based window

You include the last N forecasts.

• last 100 forecasts

• last 500 forecasts

This is good when activity is bursty, because each point uses a similar sample size.

Choosing a window size

Pick a window that balances sensitivity and stability:

• smaller windows react faster but are noisier

• larger windows are smoother but can hide recent changes

A practical default

If your volume is moderate, start with:

• 30 day window for headline trends

• 90 day window for stability checks

If your volume is low, use count based windows (for example last 200 forecasts) so you do not publish meaningless swings.

What to plot in a rolling scorecard

At minimum, plot:

• rolling Brier score

• rolling Brier skill score vs a stable benchmark

Then add diagnostics when possible:

• rolling calibration deviation (or bucket based calibration summary)

• rolling coverage and activity counts

How to interpret the trend

Trend up or down

For Brier score, lower is better, so a downward trend is improvement.

For Brier skill score, higher is better, so an upward trend is improvement.

Step change

A sudden jump can signal a process change or a dataset change.

Common causes:

• you started forecasting a different category mix

• you changed your update behavior

• your benchmark definition changed (do not do this)

Wide swings

Wide swings usually mean window size is too small or volume is too low. Increase the window or use count based windows.

Detecting drift

Rolling windows are the easiest way to detect:

forecast drift, where your probabilities shift over time

calibration drift, where the same confidence levels stop matching reality

If your score is stable but calibration worsens, you likely need a mapping adjustment rather than new signal.

Fairness: keep methodology consistent

Rolling charts only mean something if your methodology is stable:

• same benchmark definition

• same evaluation checkpoint rule

• same bucket scheme for calibration

If any of those change, the time series breaks and comparisons become unreliable.

Takeaway

Rolling windows turn one score into a trend you can trust. Use them to separate real improvement from noise, detect drift early, and keep your scorecards honest. If the chart is too jumpy, the fix is usually a larger window, not a new story.

Related

Rolling Window

Sample Size

Forecast Drift

Calibration Drift

How to Read a Forecast Scorecard