Evaluation Checkpoints: How to Score Forecasts Fairly

The fairness problem

If you let people update forecasts until the last minute and then score the final probability, you usually measure timing, not forecasting skill.

Why:

• late forecasts have more information

• some users only forecast when the answer is almost known

• others forecast early and take real uncertainty

This is why scoreboards often become a game of waiting.

What an evaluation checkpoint is

An evaluation checkpoint is a fixed rule that says which forecast gets scored.

Examples:

• score the last forecast at T-24h before settlement

• score the last forecast at market close time

• score the first forecast after market open

The key is that everyone is evaluated at the same forecast horizon.

Why checkpoints improve leaderboards

Checkpoints reduce gaming and improve comparability:

• less advantage to waiting

• easier to compare forecasters fairly

• more meaningful calibration diagnostics by horizon

• better out of sample testing because the scoring rule is stable

Three common checkpoint designs

1) Fixed time before settlement

Example: T-24h.

Pros:

• easy to explain

• normalizes difficulty across markets

Cons:

• requires correct settlement timestamps

• tricky if markets resolve early or get suspended

2) Daily snapshot checkpoint

Example: every day at 18:00 UTC, score the latest forecast for open markets.

Pros:

• works well for long running questions

• creates a consistent audit trail and time series

Cons:

• can miss meaningful intraday updates unless you keep multiple snapshots

3) Market close checkpoint

Score the last forecast at the moment a market closes to new trading or new entries.

Pros:

• aligns with platform mechanics

• simple for free to play tournaments

Cons:

• if close happens very late, you still reward late info

Edge cases you must define

No forecast before the checkpoint

Decide whether to:

• treat as missing and reduce coverage

• or fill with a default benchmark like base rate

Most scorecards should treat it as missing so coverage stays meaningful.

Multiple updates

Use a simple rule: the forecast that exists at the checkpoint is the one that is scored. Later updates do not matter for that checkpoint.

Market suspension, early resolution, disputes

Define what happens if a market freezes, resolves early, or enters dispute. If the checkpoint timestamp becomes ambiguous, exclude the market or fall back to a documented rule.

Avoiding look ahead bias

The scoring system must not use future information.

Two hard rules:

• the forecast time must be at or before the checkpoint time

• any benchmark snapshot (for example market consensus) must also be at or before the checkpoint time

If you violate either, you introduce look ahead bias and your leaderboard becomes fake.

What to show on the scorecard

To make checkpoint scoring trustworthy, publish:

• checkpoint definition (for example T-24h)

• sample size and coverage

• Brier score and Brier skill score at that checkpoint

• an audit trail (timestamps for forecasts and benchmarks)

Takeaway

Evaluation checkpoints are the simplest way to make forecasting scores fair. Pick a checkpoint rule, apply it consistently, and document it. Without checkpoints, you mostly reward late information, not forecasting skill.

• Evaluation Checkpoint

• Forecast Horizon

• Look Ahead Bias

• Forecast Horizon: Why Early Predictions Are Harder

• Selection Bias and Coverage: How People Accidentally Fake Skill