Skip to content

🚨 Data Leakage & Evaluation Pitfalls

Level: Foundation Solves: Prevent "too good to be true" models that fail catastrophically in production

❌ The #1 ML Failure Mode

Data leakage là nguyên nhân hàng đầu của "model works great offline, fails in production". Nó không visible trong metrics — chỉ visible trong business impact.

Leakage Taxonomy

┌─────────────────────────────────────────────────────────────────┐
│                    LEAKAGE TAXONOMY                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TARGET LEAKAGE                                              │
│  ────────────────                                               │
│  Feature contains direct information about target               │
│                                                                 │
│  Example: Predicting fraud                                      │
│  ❌ Feature: "fraud_investigation_date" (only exists if fraud)  │
│  → Model learns: if investigation_date exists → fraud           │
│  → 99.9% AUC offline, 50% in production                         │
│                                                                 │
│  2. TEMPORAL LEAKAGE                                            │
│  ──────────────────                                             │
│  Using future information to predict past                       │
│                                                                 │
│  Example: Predict next-day stock price                          │
│  ❌ Using data from day T+1 to predict day T                    │
│  ❌ Random train/test split instead of temporal split           │
│                                                                 │
│  3. PREPROCESSING LEAKAGE                                       │
│  ─────────────────────────                                      │
│  Test data influences training preprocessing                    │
│                                                                 │
│  Example: Feature scaling                                       │
│  ❌ scaler.fit_transform(all_data) then split                   │
│  → Test set statistics leak into training                       │
│                                                                 │
│  4. FEATURE LEAKAGE                                             │
│  ─────────────────                                              │
│  Feature derived from target or unavailable at prediction time  │
│                                                                 │
│  Example: Predict customer churn                                │
│  ❌ Feature: "cancellation_reason" (only exists after churn)    │
│  ❌ Feature: "days_until_churn" (literally the target)          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Detection Strategies

Red Flags Checklist

Red FlagWhat It MeansInvestigation
AUC > 0.95 on first trySuspiciously goodCheck for target leakage
Perfect score on subsetData issueExamine that subset
Performance drops in productionTrain/deploy mismatchAudit feature pipeline
Feature importance shows unexpected winnerProxy for target?Analyze correlation
Model "learns" in 1 epochToo easyCheck data splitting

Statistical Detection

python
import pandas as pd
import numpy as np
from scipy import stats

def detect_target_leakage(df: pd.DataFrame, target: str, threshold: float = 0.9):
    """
    Detect features suspiciously correlated with target.
    
    Returns features with correlation > threshold.
    """
    suspicious = []
    
    for col in df.columns:
        if col == target:
            continue
            
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[col].corr(df[target]))
            if corr > threshold:
                suspicious.append({
                    'feature': col,
                    'correlation': corr,
                    'reason': 'high_correlation'
                })
        else:
            # For categorical: check if feature only exists for one target class
            for val in df[col].unique():
                subset = df[df[col] == val]
                if len(subset[target].unique()) == 1:
                    suspicious.append({
                        'feature': col,
                        'value': val,
                        'reason': 'single_class_subset'
                    })
    
    return suspicious


def detect_temporal_leakage(df: pd.DataFrame, time_col: str, 
                            train_mask: pd.Series, test_mask: pd.Series):
    """
    Check if any test samples have timestamps before train samples.
    """
    train_max = df.loc[train_mask, time_col].max()
    test_min = df.loc[test_mask, time_col].min()
    
    if test_min <= train_max:
        overlap = df.loc[test_mask & (df[time_col] <= train_max)]
        return {
            'leakage_detected': True,
            'overlapping_samples': len(overlap),
            'train_max': train_max,
            'test_min': test_min
        }
    return {'leakage_detected': False}

Feature Availability Audit

┌─────────────────────────────────────────────────────────────────┐
│               FEATURE AVAILABILITY MATRIX                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  For EACH feature, answer:                                      │
│                                                                 │
│  1. WHEN is this feature available?                             │
│     □ At prediction time (T+0)                                  │
│     □ After prediction (T+1, T+7, ...)                          │
│     □ Only after target is known                                │
│                                                                 │
│  2. HOW is this feature computed?                               │
│     □ Raw from source system                                    │
│     □ Aggregated over time window                               │
│     □ Derived from other features                               │
│                                                                 │
│  3. WHAT data does computation depend on?                       │
│     □ Only historical data                                      │
│     □ Current data point                                        │
│     □ Future data (LEAKAGE!)                                    │
│                                                                 │
│  EXAMPLE AUDIT:                                                 │
│  ─────────────                                                  │
│  Feature: avg_purchases_last_30_days                            │
│  ✓ Available at T+0                                             │
│  ✓ Aggregated over past 30 days                                 │
│  ✓ Only uses historical data                                    │
│  → SAFE                                                         │
│                                                                 │
│  Feature: total_purchases_this_month                            │
│  ❌ If predicting mid-month, includes future data               │
│  → LEAKAGE (temporal)                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Evaluation Pitfalls

Common Mistakes

┌─────────────────────────────────────────────────────────────────┐
│               EVALUATION PITFALLS                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PITFALL 1: WRONG SPLITTING                                     │
│  ───────────────────────────                                    │
│  ❌ Random split for time-series data                           │
│  ❌ No stratification for imbalanced classes                    │
│  ❌ Same entity in train and test (user/session leak)           │
│                                                                 │
│  PITFALL 2: METRIC GAMING                                       │
│  ─────────────────────────                                      │
│  ❌ Optimizing AUC when business needs precision@k              │
│  ❌ Reporting best validation score, not test score             │
│  ❌ Threshold tuning on test set                                │
│                                                                 │
│  PITFALL 3: OFFLINE/ONLINE GAP                                  │
│  ─────────────────────────────                                  │
│  ❌ Batch features in training, realtime in serving             │
│  ❌ Different preprocessing pipelines                           │
│  ❌ Stale features in production                                │
│                                                                 │
│  PITFALL 4: CALIBRATION BLINDNESS                               │
│  ──────────────────────────────                                 │
│  ❌ Good ranking but bad probability estimates                  │
│  ❌ Not checking calibration curves                             │
│  ❌ Using uncalibrated probabilities for thresholds             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Correct Splitting Strategies

Data TypeSplit StrategyImplementation
Time-seriesTemporal splitTrain on T-N:T-1, test on T
Users/entitiesGroup splitAll user data in same split
ImbalancedStratifiedPreserve class ratios
Multi-labelIterative stratificationscikit-multilearn
Cross-validationTime-series CVsklearn TimeSeriesSplit
python
from sklearn.model_selection import (
    TimeSeriesSplit, 
    GroupKFold, 
    StratifiedKFold
)

# Time-series: Don't leak future
tscv = TimeSeriesSplit(n_splits=5, gap=7)  # 7-day gap

# Users: Keep all user data together
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=user_ids):
    # No user appears in both train and test
    pass

# Imbalanced: Preserve class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Calibration

python
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.pyplot as plt

def check_calibration(y_true, y_prob, n_bins=10):
    """
    Check if predicted probabilities match actual frequencies.
    
    Good calibration: predicted 30% → ~30% actually positive
    """
    fraction_pos, mean_pred = calibration_curve(
        y_true, y_prob, n_bins=n_bins
    )
    
    # Perfect calibration: diagonal line
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect')
    plt.plot(mean_pred, fraction_pos, 's-', label='Model')
    plt.xlabel('Mean predicted probability')
    plt.ylabel('Fraction of positives')
    plt.legend()
    plt.title('Calibration Curve')
    plt.show()
    
    # Expected Calibration Error (ECE)
    ece = np.abs(fraction_pos - mean_pred).mean()
    print(f"ECE: {ece:.4f}")
    return ece


# Fix calibration with isotonic regression
calibrated_model = CalibratedClassifierCV(
    base_model, 
    method='isotonic',  # or 'sigmoid'
    cv=5
)
calibrated_model.fit(X_train, y_train)

Prevention Strategies

Pipeline Design

python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# ✅ CORRECT: Fit preprocessing only on training data
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])),
    ('model', LogisticRegression())
])

# Pipeline ensures scaler.fit() only sees training data
pipeline.fit(X_train, y_train)  # Fit on train
score = pipeline.score(X_test, y_test)  # Transform test using train params

Feature Store Discipline

┌─────────────────────────────────────────────────────────────────┐
│               FEATURE STORE PREVENTS LEAKAGE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TRAINING TIME:                                                 │
│  ──────────────                                                 │
│  get_features(entity_id, timestamp=T)                           │
│  → Returns features as they were at time T                      │
│  → Point-in-time correctness guaranteed                         │
│                                                                 │
│  SERVING TIME:                                                  │
│  ─────────────                                                  │
│  get_features(entity_id)                                        │
│  → Returns current feature values                               │
│  → Same computation as training                                 │
│                                                                 │
│  WHY THIS PREVENTS LEAKAGE:                                     │
│  ──────────────────────────                                     │
│  1. Training features computed with same logic as serving       │
│  2. Point-in-time join prevents future data in training         │
│  3. Feature definitions versioned and auditable                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Operational Checklist

Pre-Training Gate

CheckRequiredHow to Verify
Feature availability audit completeDocument when each feature is available
Temporal split implementedTest timestamps > all train timestamps
Entity leakage preventedSame entity not in train and test
Preprocessing fitted on train onlyUse sklearn Pipeline
Target leakage scan passedRun detect_target_leakage()

Pre-Deployment Gate

CheckRequiredHow to Verify
Offline/online parity validatedCompare feature values
Calibration checkedECE < 0.05
Business metric alignedNot just AUC, actual business KPI
Holdout evaluation doneTrue holdout, not validation set
Suspicious performance investigatedAUC > 0.95 explained

Production Monitoring

CheckRequiredTrigger
Feature distribution driftDaily
Prediction distribution stabilityDaily
Offline-online metric gapWeekly comparison
Calibration stabilityMonthly

📎 Cross-References