Backtesting Forecasters: A Minimal, Repeatable Template
What “backtesting forecasters” means
Backtesting a forecaster means measuring their historical performance using a fixed scoring methodology.
The goal is to answer:
• is this performance real skill
• does it generalize over time
• how does it compare to a baseline like base rate or market consensus
The trap: changing rules after you see results
The easiest way to fake a great backtest is to tweak settings until it looks good.
To avoid that, you need a template with fixed definitions that you apply repeatedly.
That is why methodology disclosure matters. See Scorecard Methodology.
The minimal backtest template
Use this template as a checklist. If you cannot fill a field, your backtest is not ready.
1) Define the dataset
• date range (start and end)
• eligible questions or markets (categories, exclusions)
• event type (binary only or multi class)
• how you handle voids and disputes
2) Define the forecast selection rule
This is the most important part.
• score final forecast before settlement
• or score the forecast at an evaluation checkpoint (recommended)
Example: “score the last forecast at T-24h.”
3) Define the metric
Pick one primary metric:
• Brier score for binary
• optionally log loss as secondary
State whether each market is equally weighted.
4) Define the benchmark
To make results comparable, define a benchmark and compute skill:
• base rate benchmark (default)
• market consensus benchmark (only with liquidity filters)
Then compute Brier skill score:
BSS = 1 - (BS / BS_ref)
5) Define quality and anti gaming rules
• minimum sample size (N)
• minimum coverage against eligibility
• horizon splits or checkpoint only
• liquidity filters if you use market consensus
6) Define reporting outputs
Minimum outputs:
• BS overall
• BSS vs base rate
• N and coverage
• calibration table with buckets and N per bucket
Optional but strong:
• rolling windows trend
• breakdown by category
A concrete example you can copy
• Dataset: all binary markets in categories A and B from Oct 1 to Dec 31
• Checkpoint: T-24h before settlement
• Metric: Brier score, equal weight per market
• Benchmark: category base rate for BSS
• Filters: require N at least 50 and coverage at least 30%
• Output: overall BS and BSS, plus bucket table and rolling 30 day chart
Common mistakes
Using final forecasts: rewards late updates. If you want skill, use checkpoints.
No coverage reporting: allows cherry picking and selection bias.
Benchmark drift: changing benchmark definition breaks comparability.
Ignoring thin markets: market baselines can be noisy without liquidity filters.
Takeaway
A good backtest is boring: fixed dataset rules, fixed checkpoint rule, a primary metric, a benchmark, and clear reporting of N and coverage. If you can run the same template every month without changing definitions, your scorecard becomes credible.