Brier Score Calculator: Step by Step With Examples

What you are calculating

Brier score measures the accuracy of probabilistic forecasts for a binary event (YES or NO).

Each forecast is scored by squared error:

(p - o)^2

• p is your predicted probability of YES

• o is the outcome (1 if YES, 0 if NO)

Single forecast examples

Example 1: you say 0.70 and YES happens

• p = 0.70

• o = 1

• score = (0.70 - 1)^2 = 0.09

Example 2: you say 0.70 and NO happens

• p = 0.70

• o = 0

• score = (0.70 - 0)^2 = 0.49

Example 3: you say 0.55 and NO happens

• score = (0.55 - 0)^2 = 0.3025

Average across many forecasts

If you have N forecasts, Brier score is the mean:

BS = (1/N) * sum((p_i - o_i)^2)

Worked mini dataset

Suppose you made 4 forecasts:

• p = 0.70, outcome YES

• p = 0.40, outcome NO

• p = 0.60, outcome YES

• p = 0.30, outcome NO

Compute each squared error:

• (0.70 - 1)^2 = 0.09

• (0.40 - 0)^2 = 0.16

• (0.60 - 1)^2 = 0.16

• (0.30 - 0)^2 = 0.09

Average:

• BS = (0.09 + 0.16 + 0.16 + 0.09) / 4 = 0.125

How to interpret the number

Lower is better.

• 0.00 is perfect

• values closer to 0.25 often correspond to weak, near 50/50 forecasting on balanced questions, but interpretation depends on the dataset

That is why you usually want a baseline and Brier skill score, not just raw BS.

Add a baseline so the score means something

Raw BS is hard to compare across different question sets. Use a benchmark such as:

• base rate (recommended default)

• market consensus when liquidity is real

Then compute BSS:

BSS = 1 - (BS / BS_ref)

Use checkpoints so timing does not dominate

If users can update forecasts until the last minute, scoring the final update rewards late information.

A practical calculator and scorecard should support an evaluation checkpoint rule such as:

• score the last forecast at T-24h

That makes comparisons fair across users and markets.

Calculator inputs and outputs

Inputs

• a list of forecasts (probability p)

• outcomes (0 or 1)

• optional: category labels and timestamps

• optional: benchmark probabilities (base rate or market consensus)

Outputs

• Brier score (overall)

• Brier skill score vs selected benchmark

• sample size (N) and coverage

• calibration buckets for diagnostics

Common mistakes

Mixing horizons: early and late forecasts have different difficulty. Use checkpoint scoring or horizon splits. See Forecast Horizon.

Ignoring selection bias: if users choose only easy questions, BS improves without real skill. Track coverage. See Selection Bias and Coverage.

No baseline: BS alone is not comparable across datasets. Use BSS vs base rate or market consensus.

Takeaway

Brier score is easy to compute: squared error per forecast, then average. The hard part is making the result meaningful. Add a baseline (BSS), track N and coverage, and use checkpoints so the score reflects forecasting skill rather than timing.

• Brier Score vs Brier Skill Score

• How to Read a Forecast Scorecard

• Evaluation Checkpoints

• Choosing a Baseline