Brier Score Calculator: Step by Step With Examples
What you are calculating
Brier score measures the accuracy of probabilistic forecasts for a binary event (YES or NO).
Each forecast is scored by squared error:
(p - o)^2
• p is your predicted probability of YES
• o is the outcome (1 if YES, 0 if NO)
Single forecast examples
Example 1: you say 0.70 and YES happens
• p = 0.70
• o = 1
• score = (0.70 - 1)^2 = 0.09
Example 2: you say 0.70 and NO happens
• p = 0.70
• o = 0
• score = (0.70 - 0)^2 = 0.49
Example 3: you say 0.55 and NO happens
• score = (0.55 - 0)^2 = 0.3025
Average across many forecasts
If you have N forecasts, Brier score is the mean:
BS = (1/N) * sum((p_i - o_i)^2)
Worked mini dataset
Suppose you made 4 forecasts:
• p = 0.70, outcome YES
• p = 0.40, outcome NO
• p = 0.60, outcome YES
• p = 0.30, outcome NO
Compute each squared error:
• (0.70 - 1)^2 = 0.09
• (0.40 - 0)^2 = 0.16
• (0.60 - 1)^2 = 0.16
• (0.30 - 0)^2 = 0.09
Average:
• BS = (0.09 + 0.16 + 0.16 + 0.09) / 4 = 0.125
How to interpret the number
Lower is better.
• 0.00 is perfect
• values closer to 0.25 often correspond to weak, near 50/50 forecasting on balanced questions, but interpretation depends on the dataset
That is why you usually want a baseline and Brier skill score, not just raw BS.
Add a baseline so the score means something
Raw BS is hard to compare across different question sets. Use a benchmark such as:
• base rate (recommended default)
• market consensus when liquidity is real
Then compute BSS:
BSS = 1 - (BS / BS_ref)
Use checkpoints so timing does not dominate
If users can update forecasts until the last minute, scoring the final update rewards late information.
A practical calculator and scorecard should support an evaluation checkpoint rule such as:
• score the last forecast at T-24h
That makes comparisons fair across users and markets.
Calculator inputs and outputs
Inputs
• a list of forecasts (probability p)
• outcomes (0 or 1)
• optional: category labels and timestamps
• optional: benchmark probabilities (base rate or market consensus)
Outputs
• Brier score (overall)
• Brier skill score vs selected benchmark
• sample size (N) and coverage
• calibration buckets for diagnostics
Common mistakes
Mixing horizons: early and late forecasts have different difficulty. Use checkpoint scoring or horizon splits. See Forecast Horizon.
Ignoring selection bias: if users choose only easy questions, BS improves without real skill. Track coverage. See Selection Bias and Coverage.
No baseline: BS alone is not comparable across datasets. Use BSS vs base rate or market consensus.
Takeaway
Brier score is easy to compute: squared error per forecast, then average. The hard part is making the result meaningful. Add a baseline (BSS), track N and coverage, and use checkpoints so the score reflects forecasting skill rather than timing.
Related
• Brier Score vs Brier Skill Score