Skip to content

📊 Evaluation Fundamentals

Level: Foundation Solves: Đánh giá model một cách chính xác, tránh overfitting và đảm bảo generalization

Tại sao Evaluation quan trọng?

💡 Giáo sư Tom

Model evaluation không chỉ là chạy model.score(). Đó là nghệ thuật đảm bảo model của bạn sẽ hoạt động tốt trên data chưa từng thấy. Một evaluation strategy sai có thể khiến bạn deploy model tệ hại mà không hề biết.

┌─────────────────────────────────────────────────────────────────┐
│                 EVALUATION HIERARCHY                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              OFFLINE EVALUATION                          │    │
│  │  • Train/Val/Test splits                                 │    │
│  │  • Cross-validation                                      │    │
│  │  • Holdout metrics                                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              ONLINE EVALUATION                           │    │
│  │  • A/B testing                                           │    │
│  │  • Shadow mode                                           │    │
│  │  • Canary deployment                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              PRODUCTION MONITORING                       │    │
│  │  • Drift detection                                       │    │
│  │  • Performance tracking                                  │    │
│  │  • Business metrics                                      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Splitting Strategies

Basic Train/Val/Test Split

┌─────────────────────────────────────────────────────────────────┐
│                 STANDARD DATA SPLIT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    FULL DATASET                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│           ┌───────────────┼───────────────┐                     │
│           │               │               │                     │
│           ▼               ▼               ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   TRAIN     │  │ VALIDATION  │  │    TEST     │              │
│  │   (60-70%)  │  │  (15-20%)   │  │  (15-20%)   │              │
│  ├─────────────┤  ├─────────────┤  ├─────────────┤              │
│  │ Model       │  │ Hyperparameter│ │ Final       │              │
│  │ training    │  │ tuning       │  │ evaluation  │              │
│  │             │  │ Model select │  │ Report      │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│                                                                 │
│  ⚠️ RULE: Never touch test set until final evaluation!          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Temporal Split (Time Series)

⚠️ Time Series Data

Với time series, KHÔNG BAO GIỜ dùng random split. Luôn split theo thời gian để tránh data leakage.

┌─────────────────────────────────────────────────────────────────┐
│                 TEMPORAL DATA SPLIT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Timeline: ──────────────────────────────────────────────────►  │
│                                                                 │
│  ┌──────────────┬──────────────┬──────────────┐                 │
│  │    TRAIN     │  VALIDATION  │     TEST     │                 │
│  │  Jan - Jun   │   Jul - Aug  │  Sep - Oct   │                 │
│  └──────────────┴──────────────┴──────────────┘                 │
│                                                                 │
│  ❌ WRONG: Random split mixes future data into training         │
│  ✅ CORRECT: Temporal split preserves time ordering             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Stratified Split

python
# Stratified split giữ nguyên class distribution
from sklearn.model_selection import train_test_split

# ❌ WRONG: Random split có thể mất cân bằng
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# ✅ CORRECT: Stratified split giữ tỷ lệ classes
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Group Split

┌─────────────────────────────────────────────────────────────────┐
│                    GROUP-AWARE SPLIT                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Use Case: Multiple samples per entity (user, patient, etc.)   │
│                                                                 │
│  ❌ WRONG: Same user in both train and test                     │
│     Train: [User_A_sample1, User_B_sample1, User_A_sample2]     │
│     Test:  [User_A_sample3, User_C_sample1]                     │
│     → Model memorizes User_A patterns!                          │
│                                                                 │
│  ✅ CORRECT: Users completely separated                         │
│     Train: [User_A_sample1, User_A_sample2, User_A_sample3]     │
│     Test:  [User_B_sample1, User_C_sample1]                     │
│     → Model must generalize to new users                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cross-Validation

K-Fold Cross-Validation

┌─────────────────────────────────────────────────────────────────┐
│                 5-FOLD CROSS-VALIDATION                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Fold 1: [VAL] [TRAIN] [TRAIN] [TRAIN] [TRAIN] → Score_1        │
│  Fold 2: [TRAIN] [VAL] [TRAIN] [TRAIN] [TRAIN] → Score_2        │
│  Fold 3: [TRAIN] [TRAIN] [VAL] [TRAIN] [TRAIN] → Score_3        │
│  Fold 4: [TRAIN] [TRAIN] [TRAIN] [VAL] [TRAIN] → Score_4        │
│  Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [VAL] → Score_5        │
│                                                                 │
│  Final Score = mean(Score_1, ..., Score_5) ± std                │
│                                                                 │
│  Benefits:                                                      │
│  • Uses all data for both training and validation               │
│  • Provides variance estimate                                   │
│  • More robust than single split                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CV Variants

VariantUse CaseDescription
K-FoldGeneral purposeStandard k splits
Stratified K-FoldImbalanced classesPreserves class ratio
Group K-FoldGrouped dataKeeps groups together
Time Series SplitTemporal dataExpanding window
Leave-One-OutSmall datasetsK = N samples
Nested CVModel selection + evaluationInner + outer loops

Time Series Cross-Validation

┌─────────────────────────────────────────────────────────────────┐
│              TIME SERIES CROSS-VALIDATION                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Expanding Window:                                              │
│  Fold 1: [TRAIN    ] [VAL]                                      │
│  Fold 2: [TRAIN         ] [VAL]                                 │
│  Fold 3: [TRAIN              ] [VAL]                            │
│  Fold 4: [TRAIN                   ] [VAL]                       │
│                                                                 │
│  Sliding Window:                                                │
│  Fold 1: [TRAIN    ] [VAL]                                      │
│  Fold 2:    [TRAIN    ] [VAL]                                   │
│  Fold 3:       [TRAIN    ] [VAL]                                │
│  Fold 4:          [TRAIN    ] [VAL]                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Classification Metrics

Confusion Matrix

┌─────────────────────────────────────────────────────────────────┐
│                    CONFUSION MATRIX                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                        PREDICTED                                │
│                    Positive  Negative                           │
│                  ┌──────────┬──────────┐                        │
│         Positive │    TP    │    FN    │  ← Actual Positives    │
│  ACTUAL          ├──────────┼──────────┤                        │
│         Negative │    FP    │    TN    │  ← Actual Negatives    │
│                  └──────────┴──────────┘                        │
│                       ↑          ↑                              │
│                  Predicted   Predicted                          │
│                  Positives   Negatives                          │
│                                                                 │
│  TP = True Positive  (Correct positive prediction)              │
│  TN = True Negative  (Correct negative prediction)              │
│  FP = False Positive (Type I error - False alarm)               │
│  FN = False Negative (Type II error - Missed detection)         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

MetricFormulaUse When
Accuracy(TP+TN)/(TP+TN+FP+FN)Balanced classes
PrecisionTP/(TP+FP)FP is costly
RecallTP/(TP+FN)FN is costly
F1-Score2×(P×R)/(P+R)Balance P and R
SpecificityTN/(TN+FP)Need true negative rate

ROC and AUC

┌─────────────────────────────────────────────────────────────────┐
│                    ROC CURVE                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  True Positive Rate (Recall)                                    │
│  1.0 ┤                                    ╭────────             │
│      │                               ╭────╯                     │
│      │                          ╭────╯                          │
│  0.5 ┤                     ╭────╯         Perfect: AUC = 1.0    │
│      │                ╭────╯              Good: AUC > 0.8       │
│      │           ╭────╯                   Random: AUC = 0.5     │
│      │      ╭────╯                                              │
│  0.0 ┼─────╯─────────────────────────────────────────           │
│      0.0        0.5        1.0                                  │
│           False Positive Rate (1 - Specificity)                 │
│                                                                 │
│  AUC = Area Under Curve                                         │
│  • Threshold-independent metric                                 │
│  • Measures ranking quality                                     │
│  • Good for imbalanced datasets                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Precision-Recall Curve

💡 Khi nào dùng PR Curve thay vì ROC?

Với highly imbalanced datasets (e.g., fraud detection với 0.1% positive), PR curve cho insight tốt hơn vì nó focus vào positive class.

Regression Metrics

MetricFormulaInterpretation
MAEmean(|y - ŷ|)Average absolute error
MSEmean((y - ŷ)²)Penalizes large errors
RMSE√MSESame unit as target
MAPEmean(|y - ŷ|/y) × 100Percentage error
1 - SS_res/SS_totVariance explained

Metric Selection Guide

┌─────────────────────────────────────────────────────────────────┐
│              REGRESSION METRIC SELECTION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Question                          Recommended Metric           │
│  ────────                          ──────────────────           │
│                                                                 │
│  "Average error magnitude?"    →   MAE (robust to outliers)     │
│                                                                 │
│  "Penalize large errors?"      →   MSE/RMSE (sensitive)         │
│                                                                 │
│  "Relative error?"             →   MAPE (percentage-based)      │
│                                                                 │
│  "How much variance explained?"→   R² (0 to 1 scale)            │
│                                                                 │
│  "Business-specific cost?"     →   Custom weighted metric       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Calibration

What is Calibration?

📊 Calibration Definition

Model được calibrated khi predicted probability phản ánh đúng actual probability. Ví dụ: trong 100 predictions với P=0.7, khoảng 70 cases nên là positive.

Calibration Plot

┌─────────────────────────────────────────────────────────────────┐
│                  CALIBRATION PLOT                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Actual Positive Rate                                           │
│  1.0 ┤                                         ╱                │
│      │                                    ╱                     │
│      │                               ╱    Perfect calibration   │
│  0.5 ┤                          ╱                               │
│      │                     ╱                                    │
│      │                ╱                                         │
│      │           ╱                                              │
│  0.0 ┼──────╱────────────────────────────────────               │
│      0.0        0.5        1.0                                  │
│           Predicted Probability                                 │
│                                                                 │
│  • Points above line: Under-confident (predicts lower than actual)
│  • Points below line: Over-confident (predicts higher than actual)
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Calibration Methods

MethodDescriptionUse Case
Platt ScalingLogistic regression on predictionsBinary classification
Isotonic RegressionNon-parametric monotonic fitMore flexible
Temperature ScalingSingle parameter scalingNeural networks
python
from sklearn.calibration import CalibratedClassifierCV

# Calibrate model predictions
calibrated_model = CalibratedClassifierCV(
    base_estimator=model,
    method='isotonic',  # or 'sigmoid' for Platt scaling
    cv=5
)
calibrated_model.fit(X_train, y_train)

Evaluation Best Practices

Evaluation Checklist

StepActionWhy
1Define metrics before trainingAvoid metric shopping
2Use appropriate split strategyPrevent leakage
3Report confidence intervalsQuantify uncertainty
4Compare to baselineEstablish improvement
5Check calibrationEnsure reliable probabilities
6Evaluate on subgroupsDetect fairness issues

Common Pitfalls

🚨 Evaluation Anti-Patterns

  • Metric shopping: Trying many metrics, reporting best one
  • Test set peeking: Using test set for model selection
  • Ignoring variance: Reporting single number without confidence
  • Wrong baseline: Comparing to weak baseline
  • Leaky evaluation: Future data in training

Statistical Significance

python
# Compare two models with paired t-test
from scipy import stats

# Cross-validation scores for two models
scores_model_a = [0.82, 0.85, 0.83, 0.84, 0.86]
scores_model_b = [0.84, 0.87, 0.85, 0.86, 0.88]

# Paired t-test
t_stat, p_value = stats.ttest_rel(scores_model_a, scores_model_b)

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("No significant difference")

Operational Checklist

Offline Evaluation Gate (Pre-Deployment)

CheckRequiredOwnerNotes
Correct split strategy usedML EngineerTemporal/Group as needed
Test set not touched during trainingML EngineerNo peeking!
Business metric computedML EngineerNot just AUC
Baseline comparison doneML Engineervs current solution
Calibration checkedML EngineerECE < 0.05
Fairness metrics reviewedML LeadSubgroup performance
Confidence intervals reportedML EngineerCross-validation std
Leakage audit passedML LeadSee leakage.md

Metric Reporting Standards

Metric TypeReport FormatExample
ClassificationAUC, Precision@K, Recall@KAUC: 0.85 ± 0.02
RegressionRMSE, MAE, R²RMSE: 12.3 ± 1.1
RankingNDCG@K, MAPNDCG@10: 0.72
CalibrationECE, Brier ScoreECE: 0.03

📎 Cross-References