Rolling Windows: Tracking Improvement Over Time
What a rolling window is
A rolling window is a moving slice of your history used to compute metrics repeatedly over time.
Example: compute Brier score over the last 30 days, then move the window forward by one day and compute again.
Why rolling windows matter
Single period scores can be misleading because forecasting is noisy.
• One extreme miss can dominate a short period, especially under log loss.
• A short hot streak can look like skill when it is variance.
• Your process can change over time due to learning or drift.
Rolling windows turn performance into a time series so you can see stability, trends, and breaks.
Two common rolling window styles
1) Time based window
You include all forecasts within a time range.
• last 14 days
• last 30 days
• last 90 days
This is good when forecast volume is fairly consistent.
2) Count based window
You include the last N forecasts.
• last 100 forecasts
• last 500 forecasts
This is good when activity is bursty, because each point uses a similar sample size.
Choosing a window size
Pick a window that balances sensitivity and stability:
• smaller windows react faster but are noisier
• larger windows are smoother but can hide recent changes
A practical default
If your volume is moderate, start with:
• 30 day window for headline trends
• 90 day window for stability checks
If your volume is low, use count based windows (for example last 200 forecasts) so you do not publish meaningless swings.
What to plot in a rolling scorecard
At minimum, plot:
• rolling Brier score
• rolling Brier skill score vs a stable benchmark
Then add diagnostics when possible:
• rolling calibration deviation (or bucket based calibration summary)
• rolling coverage and activity counts
How to interpret the trend
Trend up or down
For Brier score, lower is better, so a downward trend is improvement.
For Brier skill score, higher is better, so an upward trend is improvement.
Step change
A sudden jump can signal a process change or a dataset change.
Common causes:
• you started forecasting a different category mix
• you changed your update behavior
• your benchmark definition changed (do not do this)
Wide swings
Wide swings usually mean window size is too small or volume is too low. Increase the window or use count based windows.
Detecting drift
Rolling windows are the easiest way to detect:
• forecast drift, where your probabilities shift over time
• calibration drift, where the same confidence levels stop matching reality
If your score is stable but calibration worsens, you likely need a mapping adjustment rather than new signal.
Fairness: keep methodology consistent
Rolling charts only mean something if your methodology is stable:
• same benchmark definition
• same evaluation checkpoint rule
• same bucket scheme for calibration
If any of those change, the time series breaks and comparisons become unreliable.
Takeaway
Rolling windows turn one score into a trend you can trust. Use them to separate real improvement from noise, detect drift early, and keep your scorecards honest. If the chart is too jumpy, the fix is usually a larger window, not a new story.