How to Read a Forecast Scorecard

What a scorecard should answer

A scorecard should answer two things:

• How accurate are the probabilities?

• Is that performance real, or is it inflated by small samples, weak benchmarks, or cherry picking?

Start with these four fields

1) Brier score (BS)

Brier score is your average squared error. Lower is better. It is a clean measure of probability accuracy, but it depends on the question set.

2) Brier skill score (BSS)

Brier skill score tells you whether you beat a benchmark on the same questions. For comparisons and leaderboards, BSS is usually more informative than BS.

3) Sample size (N)

Sample size is how many forecasts were scored. With small N, scores can swing a lot. Always read BS or BSS next to N.

4) Coverage and participation

Coverage is the share of eligible questions you forecasted. Low coverage can hide selection bias. Pair it with participation rate to see consistency over time.

Calibration section: do your probabilities mean what they say

Calibration answers: when you say 70%, does it happen about 70% of the time?

Look for:

• a calibration table with bucket counts and realized frequencies

• a reliability diagram or calibration curve

• patterns of overconfidence or underconfidence

Sharpness and forecast distribution

Calibration alone is not enough. Two forecasters can be similarly calibrated but behave very differently.

Check:

• forecast distribution (are you stuck near 0.50?)

• use of extreme probabilities (valuable when justified, costly when wrong)

• the tradeoff between sharpness and calibration

Breakdowns that make the score credible

By horizon: split by forecast horizon or use fixed evaluation checkpoints. Early forecasts are harder than last minute updates.

By category: mixed domains can hide weaknesses. A score is more meaningful when you can see where it comes from.

Over time: use a rolling window to detect forecast drift or calibration drift.

Methodology: the part people skip

A scorecard is only comparable if the methodology is clear. Confirm:

• which benchmark was used for BSS (50/50, base rate, or market consensus)

• how consensus was defined (for example mid price or VWAP, and which window)

• how buckets were built (count and ranges)

• whether probability clipping was applied

• which forecast was scored (last update before close, checkpoint rule)

Red flags

• great BSS with very low coverage

• strong claims with tiny N

• no calibration section

• benchmark not defined

• unclear timestamps or no audit trail

Takeaway

Read a scorecard in this order: BSS, BS, N, coverage. Then validate with calibration, horizon splits, and methodology. If those pieces are missing, treat the headline score as marketing, not measurement.

• Scorecard

• Brier Score

• Brier Skill Score

• Calibration

• Coverage

• Brier Score vs Brier Skill Score: When to Use Which

• Calibration Explained: Why 70 Percent Should Mean 70 Percent