Evaluation Checkpoints: How to Score Forecasts Fairly
The fairness problem
If you let people update forecasts until the last minute and then score the final probability, you usually measure timing, not forecasting skill.
Why:
• late forecasts have more information
• some users only forecast when the answer is almost known
• others forecast early and take real uncertainty
This is why scoreboards often become a game of waiting.
What an evaluation checkpoint is
An evaluation checkpoint is a fixed rule that says which forecast gets scored.
Examples:
• score the last forecast at T-24h before settlement
• score the last forecast at market close time
• score the first forecast after market open
The key is that everyone is evaluated at the same forecast horizon.
Why checkpoints improve leaderboards
Checkpoints reduce gaming and improve comparability:
• less advantage to waiting
• easier to compare forecasters fairly
• more meaningful calibration diagnostics by horizon
• better out of sample testing because the scoring rule is stable
Three common checkpoint designs
1) Fixed time before settlement
Example: T-24h.
Pros:
• easy to explain
• normalizes difficulty across markets
Cons:
• requires correct settlement timestamps
• tricky if markets resolve early or get suspended
2) Daily snapshot checkpoint
Example: every day at 18:00 UTC, score the latest forecast for open markets.
Pros:
• works well for long running questions
• creates a consistent audit trail and time series
Cons:
• can miss meaningful intraday updates unless you keep multiple snapshots
3) Market close checkpoint
Score the last forecast at the moment a market closes to new trading or new entries.
Pros:
• aligns with platform mechanics
• simple for free to play tournaments
Cons:
• if close happens very late, you still reward late info
Edge cases you must define
No forecast before the checkpoint
Decide whether to:
• treat as missing and reduce coverage
• or fill with a default benchmark like base rate
Most scorecards should treat it as missing so coverage stays meaningful.
Multiple updates
Use a simple rule: the forecast that exists at the checkpoint is the one that is scored. Later updates do not matter for that checkpoint.
Market suspension, early resolution, disputes
Define what happens if a market freezes, resolves early, or enters dispute. If the checkpoint timestamp becomes ambiguous, exclude the market or fall back to a documented rule.
Avoiding look ahead bias
The scoring system must not use future information.
Two hard rules:
• the forecast time must be at or before the checkpoint time
• any benchmark snapshot (for example market consensus) must also be at or before the checkpoint time
If you violate either, you introduce look ahead bias and your leaderboard becomes fake.
What to show on the scorecard
To make checkpoint scoring trustworthy, publish:
• checkpoint definition (for example T-24h)
• sample size and coverage
• Brier score and Brier skill score at that checkpoint
• an audit trail (timestamps for forecasts and benchmarks)
Takeaway
Evaluation checkpoints are the simplest way to make forecasting scores fair. Pick a checkpoint rule, apply it consistently, and document it. Without checkpoints, you mostly reward late information, not forecasting skill.
Related
• Forecast Horizon: Why Early Predictions Are Harder
• Selection Bias and Coverage: How People Accidentally Fake Skill