Selection Bias and Coverage: How People Accidentally Fake Skill

The leaderboard problem

If you let users choose what to forecast, you create a measurement problem: people do not forecast the same set of questions.

That means differences in score can come from:

• real forecasting skill

• picking easier questions

• avoiding high variance questions

• forecasting only when the answer is already obvious

This is selection bias.

What selection bias looks like in practice

Common patterns:

• only forecasting near-certain events (0.90 to 0.99) to avoid looking wrong

• only forecasting after a big news update, just before resolution

• skipping categories with low signal and forecasting only your comfort zone

• deleting or avoiding hard questions after early misses

The result is a scorecard that looks strong but does not generalize.

Coverage is the simplest defense

Coverage is the share of eligible questions a user actually forecasts.

A scorecard should never show performance without coverage context. A great Brier skill score with very low coverage is a red flag.

Why coverage matters even with Brier score

Brier score is a proper scoring rule, but it does not solve selection bias by itself.

If you only forecast easy questions, your squared errors will be smaller. That does not prove you can forecast broadly.

Three practical ways to reduce selection bias

1) Define an eligible question set

Create a clear pool of eligible markets or questions and measure coverage against that pool.

Example:

• all binary markets that were open for at least 24 hours

• all markets in selected categories

• all markets that meet minimum liquidity

2) Use evaluation checkpoints

Many leaderboard systems accidentally reward late forecasting.

Use an evaluation checkpoint rule, such as scoring the forecast that existed at T-24h, so users cannot win by waiting until the last minute.

3) Require minimum activity

For a credible track record, set minimums like:

• minimum N forecasts

• minimum number of active days

• minimum coverage percentage

These are simple guardrails that make gaming harder.

Market consensus can help, but not by itself

If you benchmark against market consensus, you still have selection bias if users pick only the markets where the market is already very sure.

That is why market benchmarking should be paired with:

• coverage

• horizon rules

• minimum N

How to read a scorecard with bias in mind

When you see a scorecard, check:

• sample size (N)

• coverage percentage

• horizon or checkpoint definition

• category mix

• whether results are stable in a rolling window

Common mistakes

Mistake: comparing users with different question sets

Without common eligibility and coverage reporting, rankings are not meaningful.

Mistake: rewarding late forecasts

Ignoring forecast horizon makes the game about timing, not forecasting.

Mistake: ignoring liquidity filters

Thin markets can be noisy. If you use market consensus, include liquidity rules or flags.

Takeaway

Selection bias is the number one reason forecasting leaderboards mislead. The fix is not a fancier metric. The fix is rules: define eligibility, track coverage, use checkpoints, and require minimum activity so the score reflects skill instead of cherry picking.

• Selection Bias

• Coverage

• Evaluation Checkpoint

• Forecast Horizon

• How to Read a Forecast Scorecard