Data Leakage
Data leakage is when training or evaluation accidentally uses information that would not be available at prediction time. It creates unrealistic performance.
Definition
Data leakage is the accidental use of information that “leaks” from the future or from the target outcome into the forecasting process. It can happen in models, pipelines, or manual workflows.
Why it matters
Leakage inflates measured skill. A scorecard built on leaked information is not a real track record and will fail out of sample.
Common sources
• Using features that are only known after settlement.
• Joining datasets incorrectly (future rows attached to past forecasts).
• Backfilling probabilities after late news.
How to reduce it
• Enforce strict timestamps and time based splits.
• Define evaluation checkpoints and lock them.
• Audit the pipeline with an audit trail.
Related
Data leakage overlaps with look ahead bias but can also occur in subtle ways beyond time splits.