← Back to Glossary

Data Leakage

Data leakage is when training or evaluation accidentally uses information that would not be available at prediction time. It creates unrealistic performance.

Definition

Data leakage is the accidental use of information that “leaks” from the future or from the target outcome into the forecasting process. It can happen in models, pipelines, or manual workflows.

Why it matters

Leakage inflates measured skill. A scorecard built on leaked information is not a real track record and will fail out of sample.

Common sources

• Using features that are only known after settlement.

• Joining datasets incorrectly (future rows attached to past forecasts).

• Backfilling probabilities after late news.

How to reduce it

• Enforce strict timestamps and time based splits.

• Define evaluation checkpoints and lock them.

• Audit the pipeline with an audit trail.

Related

Data leakage overlaps with look ahead bias but can also occur in subtle ways beyond time splits.