Selection Bias and Coverage: How People Accidentally Fake Skill
The leaderboard problem
If you let users choose what to forecast, you create a measurement problem: people do not forecast the same set of questions.
That means differences in score can come from:
• real forecasting skill
• picking easier questions
• avoiding high variance questions
• forecasting only when the answer is already obvious
This is selection bias.
What selection bias looks like in practice
Common patterns:
• only forecasting near-certain events (0.90 to 0.99) to avoid looking wrong
• only forecasting after a big news update, just before resolution
• skipping categories with low signal and forecasting only your comfort zone
• deleting or avoiding hard questions after early misses
The result is a scorecard that looks strong but does not generalize.
Coverage is the simplest defense
Coverage is the share of eligible questions a user actually forecasts.
A scorecard should never show performance without coverage context. A great Brier skill score with very low coverage is a red flag.
Why coverage matters even with Brier score
Brier score is a proper scoring rule, but it does not solve selection bias by itself.
If you only forecast easy questions, your squared errors will be smaller. That does not prove you can forecast broadly.
Three practical ways to reduce selection bias
1) Define an eligible question set
Create a clear pool of eligible markets or questions and measure coverage against that pool.
Example:
• all binary markets that were open for at least 24 hours
• all markets in selected categories
• all markets that meet minimum liquidity
2) Use evaluation checkpoints
Many leaderboard systems accidentally reward late forecasting.
Use an evaluation checkpoint rule, such as scoring the forecast that existed at T-24h, so users cannot win by waiting until the last minute.
3) Require minimum activity
For a credible track record, set minimums like:
• minimum N forecasts
• minimum number of active days
• minimum coverage percentage
These are simple guardrails that make gaming harder.
Market consensus can help, but not by itself
If you benchmark against market consensus, you still have selection bias if users pick only the markets where the market is already very sure.
That is why market benchmarking should be paired with:
• coverage
• horizon rules
• minimum N
How to read a scorecard with bias in mind
When you see a scorecard, check:
• sample size (N)
• coverage percentage
• horizon or checkpoint definition
• category mix
• whether results are stable in a rolling window
Common mistakes
Mistake: comparing users with different question sets
Without common eligibility and coverage reporting, rankings are not meaningful.
Mistake: rewarding late forecasts
Ignoring forecast horizon makes the game about timing, not forecasting.
Mistake: ignoring liquidity filters
Thin markets can be noisy. If you use market consensus, include liquidity rules or flags.
Takeaway
Selection bias is the number one reason forecasting leaderboards mislead. The fix is not a fancier metric. The fix is rules: define eligibility, track coverage, use checkpoints, and require minimum activity so the score reflects skill instead of cherry picking.
Related
• Coverage