Out of Sample Testing for Forecasters
What out of sample means
Out of sample means you evaluate performance on data that was not used to shape your forecasting process, rules, or model.
If you tune your approach on the same set you score, you can accidentally overfit and end up with a score that looks strong but does not repeat.
Why forecasters overfit without noticing
Overfitting is not just a machine learning problem. Human forecasters overfit too:
• you learn which question types you are good at and avoid the rest
• you learn the platform patterns and only forecast when outcomes are almost decided
• you change your calibration mapping based on yesterday and assume it will hold tomorrow
This creates a nice looking track record that does not generalize.
The simplest out of sample setup: time split
For forecasting, time based splits are usually the right default.
Example:
• train or tune on months 1 to 3
• evaluate on month 4
This prevents you from using future information by accident.
Use checkpoints so timing does not dominate
If you score the last update before settlement, you often reward late forecasting, not skill.
Use an evaluation checkpoint rule:
• score the forecast that exists at T-24h
• or score the forecast at a fixed market close time
Then apply the same rule to both in-sample and out-of-sample periods.
Two common out of sample designs
1) Holdout period
You choose one block of future time as the holdout and do not touch it until evaluation.
This is simple and easy to explain on a scorecard.
2) Rolling walk forward
You evaluate repeatedly as time moves:
• tune on days 1 to 30, test on days 31 to 45
• tune on days 1 to 45, test on days 46 to 60
• repeat
This pairs well with rolling window scorecards.
Leakage: the silent killer
Data leakage means your evaluation uses information that was not available at the time of the forecast.
In forecasting, leakage often happens via:
• using market prices from after the forecast as a feature
• mixing time zones and accidentally shifting timestamps
• scoring against a consensus snapshot taken later than your forecast time
The fix is strict timestamp integrity and an audit trail of forecast times and benchmark times.
What to report in an out of sample scorecard
At minimum:
• Brier score in sample and out of sample
• Brier skill score versus a stable benchmark
• sample size (N) in both periods
• coverage or participation in both periods
If your out of sample results collapse, that is valuable information. It means you were not measuring general skill.
Practical rules that make the test harder to game
• require a minimum N and minimum active days in the test period
• define the eligible market set and compute coverage against it
• use fixed checkpoints so waiting does not win
• document benchmark definition and keep it stable
Takeaway
Out of sample testing is the simplest way to tell real forecasting skill from overfit. Use time splits, fixed checkpoints, and strict timestamp rules to avoid leakage. Then compare BS and BSS in sample versus out of sample with N and coverage, so your scorecard stays honest.
Related
• Selection Bias and Coverage: How People Accidentally Fake Skill