Brier Score vs Brier Skill Score: When to Use Which
Why this distinction matters
Brier score (BS) is a raw error metric. It tells you how close your probabilities were to outcomes on a specific set of questions.
But raw BS is hard to compare across different datasets. If one dataset is easier, has different base rates, or includes many near certain events, BS can look better even when true skill is not higher.
That is why Brier skill score (BSS) exists: it measures how much you beat a chosen benchmark.
What Brier score tells you
BS is the average squared error:
BS = (1/N) * sum((p_i - o_i)^2)
Lower is better. BS is useful for tracking your own progress on a stable question set.
What Brier skill score tells you
BSS compares your BS to a reference forecast on the same questions:
BSS = 1 - (BS / BS_ref)
Where BS_ref is the Brier score of the benchmark.
How to interpret BSS
• BSS = 1.00 means perfect forecasting (BS = 0).
• BSS = 0.00 means you match the benchmark.
• BSS < 0 means you are worse than the benchmark.
Worked example
You score BS = 0.160.
Your benchmark scores BS_ref = 0.200.
Then:
BSS = 1 - (0.160 / 0.200) = 0.200
That means you are 20% better than the benchmark in Brier terms on this dataset.
Choosing the benchmark
50 50: good for teaching and sanity checks, often too weak for real evaluation.
Base rate: strong default. This is climatology when you forecast the same base rate for every event.
Market consensus: strong when market prices are reliable and liquidity is decent. Define consensus clearly, for example mid price or VWAP with a defined consensus window.
When to use BS vs BSS
Use Brier score when:
• you want a simple absolute error metric
• you track one person over time on a stable question set
• you pair it with calibration diagnostics
Use Brier skill score when:
• you compare forecasters across different datasets
• you want to know if you beat base rate or the market
• you want more robust leaderboards against selection bias
Common mistakes
Benchmark drift: changing the benchmark definition makes BSS non comparable. Document your methodology.
Ignoring coverage: low coverage can inflate both BS and BSS if you only pick easy questions.
Mixing horizons: compare forecasts at consistent forecast horizons or use checkpoints.
Takeaway
Brier score measures raw error. Brier skill score measures performance relative to a benchmark. For comparisons and leaderboards, BSS is usually the better headline metric, as long as the benchmark and methodology are defined clearly.
Related
• What Is the Brier Score and What It Measures
• Choosing a Baseline: 50 50 vs Base Rate vs Market Consensus