← Back to Guides

Out of Sample Testing for Forecasters

January 1, 2026 Data and Methodology

What out of sample means

Out of sample means you evaluate performance on data that was not used to shape your forecasting process, rules, or model.

If you tune your approach on the same set you score, you can accidentally overfit and end up with a score that looks strong but does not repeat.

Why forecasters overfit without noticing

Overfitting is not just a machine learning problem. Human forecasters overfit too:

• you learn which question types you are good at and avoid the rest

• you learn the platform patterns and only forecast when outcomes are almost decided

• you change your calibration mapping based on yesterday and assume it will hold tomorrow

This creates a nice looking track record that does not generalize.

The simplest out of sample setup: time split

For forecasting, time based splits are usually the right default.

Example:

• train or tune on months 1 to 3

• evaluate on month 4

This prevents you from using future information by accident.

Use checkpoints so timing does not dominate

If you score the last update before settlement, you often reward late forecasting, not skill.

Use an evaluation checkpoint rule:

• score the forecast that exists at T-24h

• or score the forecast at a fixed market close time

Then apply the same rule to both in-sample and out-of-sample periods.

Two common out of sample designs

1) Holdout period

You choose one block of future time as the holdout and do not touch it until evaluation.

This is simple and easy to explain on a scorecard.

2) Rolling walk forward

You evaluate repeatedly as time moves:

• tune on days 1 to 30, test on days 31 to 45

• tune on days 1 to 45, test on days 46 to 60

• repeat

This pairs well with rolling window scorecards.

Leakage: the silent killer

Data leakage means your evaluation uses information that was not available at the time of the forecast.

In forecasting, leakage often happens via:

• using market prices from after the forecast as a feature

• mixing time zones and accidentally shifting timestamps

• scoring against a consensus snapshot taken later than your forecast time

The fix is strict timestamp integrity and an audit trail of forecast times and benchmark times.

What to report in an out of sample scorecard

At minimum:

Brier score in sample and out of sample

Brier skill score versus a stable benchmark

sample size (N) in both periods

• coverage or participation in both periods

If your out of sample results collapse, that is valuable information. It means you were not measuring general skill.

Practical rules that make the test harder to game

• require a minimum N and minimum active days in the test period

• define the eligible market set and compute coverage against it

• use fixed checkpoints so waiting does not win

• document benchmark definition and keep it stable

Takeaway

Out of sample testing is the simplest way to tell real forecasting skill from overfit. Use time splits, fixed checkpoints, and strict timestamp rules to avoid leakage. Then compare BS and BSS in sample versus out of sample with N and coverage, so your scorecard stays honest.

Related

Out of Sample

Benchmark

Brier Skill Score

Evaluation Checkpoint

Selection Bias and Coverage: How People Accidentally Fake Skill