📊 Evaluation Fundamentals

Level: Foundation Solves: Đánh giá model một cách chính xác, tránh overfitting và đảm bảo generalization

Tại sao Evaluation quan trọng?

💡 Giáo sư Tom

Model evaluation không chỉ là chạy model.score(). Đó là nghệ thuật đảm bảo model của bạn sẽ hoạt động tốt trên data chưa từng thấy. Một evaluation strategy sai có thể khiến bạn deploy model tệ hại mà không hề biết.

┌─────────────────────────────────────────────────────────────────┐
│                 EVALUATION HIERARCHY                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              OFFLINE EVALUATION                          │    │
│  │  • Train/Val/Test splits                                 │    │
│  │  • Cross-validation                                      │    │
│  │  • Holdout metrics                                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              ONLINE EVALUATION                           │    │
│  │  • A/B testing                                           │    │
│  │  • Shadow mode                                           │    │
│  │  • Canary deployment                                     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│                           ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              PRODUCTION MONITORING                       │    │
│  │  • Drift detection                                       │    │
│  │  • Performance tracking                                  │    │
│  │  • Business metrics                                      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Splitting Strategies

Basic Train/Val/Test Split

┌─────────────────────────────────────────────────────────────────┐
│                 STANDARD DATA SPLIT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    FULL DATASET                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                     │
│           ┌───────────────┼───────────────┐                     │
│           │               │               │                     │
│           ▼               ▼               ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   TRAIN     │  │ VALIDATION  │  │    TEST     │              │
│  │   (60-70%)  │  │  (15-20%)   │  │  (15-20%)   │              │
│  ├─────────────┤  ├─────────────┤  ├─────────────┤              │
│  │ Model       │  │ Hyperparameter│ │ Final       │              │
│  │ training    │  │ tuning       │  │ evaluation  │              │
│  │             │  │ Model select │  │ Report      │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│                                                                 │
│  ⚠️ RULE: Never touch test set until final evaluation!          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Temporal Split (Time Series)

⚠️ Time Series Data

Với time series, KHÔNG BAO GIỜ dùng random split. Luôn split theo thời gian để tránh data leakage.

┌─────────────────────────────────────────────────────────────────┐
│                 TEMPORAL DATA SPLIT                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Timeline: ──────────────────────────────────────────────────►  │
│                                                                 │
│  ┌──────────────┬──────────────┬──────────────┐                 │
│  │    TRAIN     │  VALIDATION  │     TEST     │                 │
│  │  Jan - Jun   │   Jul - Aug  │  Sep - Oct   │                 │
│  └──────────────┴──────────────┴──────────────┘                 │
│                                                                 │
│  ❌ WRONG: Random split mixes future data into training         │
│  ✅ CORRECT: Temporal split preserves time ordering             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Stratified Split

python

# Stratified split giữ nguyên class distribution
from sklearn.model_selection import train_test_split

# ❌ WRONG: Random split có thể mất cân bằng
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# ✅ CORRECT: Stratified split giữ tỷ lệ classes
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Group Split

┌─────────────────────────────────────────────────────────────────┐
│                    GROUP-AWARE SPLIT                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Use Case: Multiple samples per entity (user, patient, etc.)   │
│                                                                 │
│  ❌ WRONG: Same user in both train and test                     │
│     Train: [User_A_sample1, User_B_sample1, User_A_sample2]     │
│     Test:  [User_A_sample3, User_C_sample1]                     │
│     → Model memorizes User_A patterns!                          │
│                                                                 │
│  ✅ CORRECT: Users completely separated                         │
│     Train: [User_A_sample1, User_A_sample2, User_A_sample3]     │
│     Test:  [User_B_sample1, User_C_sample1]                     │
│     → Model must generalize to new users                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cross-Validation

K-Fold Cross-Validation

┌─────────────────────────────────────────────────────────────────┐
│                 5-FOLD CROSS-VALIDATION                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Fold 1: [VAL] [TRAIN] [TRAIN] [TRAIN] [TRAIN] → Score_1        │
│  Fold 2: [TRAIN] [VAL] [TRAIN] [TRAIN] [TRAIN] → Score_2        │
│  Fold 3: [TRAIN] [TRAIN] [VAL] [TRAIN] [TRAIN] → Score_3        │
│  Fold 4: [TRAIN] [TRAIN] [TRAIN] [VAL] [TRAIN] → Score_4        │
│  Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [VAL] → Score_5        │
│                                                                 │
│  Final Score = mean(Score_1, ..., Score_5) ± std                │
│                                                                 │
│  Benefits:                                                      │
│  • Uses all data for both training and validation               │
│  • Provides variance estimate                                   │
│  • More robust than single split                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CV Variants

Variant	Use Case	Description
K-Fold	General purpose	Standard k splits
Stratified K-Fold	Imbalanced classes	Preserves class ratio
Group K-Fold	Grouped data	Keeps groups together
Time Series Split	Temporal data	Expanding window
Leave-One-Out	Small datasets	K = N samples
Nested CV	Model selection + evaluation	Inner + outer loops

Time Series Cross-Validation

┌─────────────────────────────────────────────────────────────────┐
│              TIME SERIES CROSS-VALIDATION                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Expanding Window:                                              │
│  Fold 1: [TRAIN    ] [VAL]                                      │
│  Fold 2: [TRAIN         ] [VAL]                                 │
│  Fold 3: [TRAIN              ] [VAL]                            │
│  Fold 4: [TRAIN                   ] [VAL]                       │
│                                                                 │
│  Sliding Window:                                                │
│  Fold 1: [TRAIN    ] [VAL]                                      │
│  Fold 2:    [TRAIN    ] [VAL]                                   │
│  Fold 3:       [TRAIN    ] [VAL]                                │
│  Fold 4:          [TRAIN    ] [VAL]                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Classification Metrics

Confusion Matrix

┌─────────────────────────────────────────────────────────────────┐
│                    CONFUSION MATRIX                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                        PREDICTED                                │
│                    Positive  Negative                           │
│                  ┌──────────┬──────────┐                        │
│         Positive │    TP    │    FN    │  ← Actual Positives    │
│  ACTUAL          ├──────────┼──────────┤                        │
│         Negative │    FP    │    TN    │  ← Actual Negatives    │
│                  └──────────┴──────────┘                        │
│                       ↑          ↑                              │
│                  Predicted   Predicted                          │
│                  Positives   Negatives                          │
│                                                                 │
│  TP = True Positive  (Correct positive prediction)              │
│  TN = True Negative  (Correct negative prediction)              │
│  FP = False Positive (Type I error - False alarm)               │
│  FN = False Negative (Type II error - Missed detection)         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

Metric	Formula	Use When
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes
Precision	TP/(TP+FP)	FP is costly
Recall	TP/(TP+FN)	FN is costly
F1-Score	2×(P×R)/(P+R)	Balance P and R
Specificity	TN/(TN+FP)	Need true negative rate

ROC and AUC

┌─────────────────────────────────────────────────────────────────┐
│                    ROC CURVE                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  True Positive Rate (Recall)                                    │
│  1.0 ┤                                    ╭────────             │
│      │                               ╭────╯                     │
│      │                          ╭────╯                          │
│  0.5 ┤                     ╭────╯         Perfect: AUC = 1.0    │
│      │                ╭────╯              Good: AUC > 0.8       │
│      │           ╭────╯                   Random: AUC = 0.5     │
│      │      ╭────╯                                              │
│  0.0 ┼─────╯─────────────────────────────────────────           │
│      0.0        0.5        1.0                                  │
│           False Positive Rate (1 - Specificity)                 │
│                                                                 │
│  AUC = Area Under Curve                                         │
│  • Threshold-independent metric                                 │
│  • Measures ranking quality                                     │
│  • Good for imbalanced datasets                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Precision-Recall Curve

💡 Khi nào dùng PR Curve thay vì ROC?

Với highly imbalanced datasets (e.g., fraud detection với 0.1% positive), PR curve cho insight tốt hơn vì nó focus vào positive class.

Regression Metrics

Metric	Formula	Interpretation
MAE	mean(\|y - ŷ\|)	Average absolute error
MSE	mean((y - ŷ)²)	Penalizes large errors
RMSE	√MSE	Same unit as target
MAPE	mean(\|y - ŷ\|/y) × 100	Percentage error
R²	1 - SS_res/SS_tot	Variance explained

Metric Selection Guide

┌─────────────────────────────────────────────────────────────────┐
│              REGRESSION METRIC SELECTION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Question                          Recommended Metric           │
│  ────────                          ──────────────────           │
│                                                                 │
│  "Average error magnitude?"    →   MAE (robust to outliers)     │
│                                                                 │
│  "Penalize large errors?"      →   MSE/RMSE (sensitive)         │
│                                                                 │
│  "Relative error?"             →   MAPE (percentage-based)      │
│                                                                 │
│  "How much variance explained?"→   R² (0 to 1 scale)            │
│                                                                 │
│  "Business-specific cost?"     →   Custom weighted metric       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Calibration

What is Calibration?

📊 Calibration Definition

Model được calibrated khi predicted probability phản ánh đúng actual probability. Ví dụ: trong 100 predictions với P=0.7, khoảng 70 cases nên là positive.

Calibration Plot

┌─────────────────────────────────────────────────────────────────┐
│                  CALIBRATION PLOT                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Actual Positive Rate                                           │
│  1.0 ┤                                         ╱                │
│      │                                    ╱                     │
│      │                               ╱    Perfect calibration   │
│  0.5 ┤                          ╱                               │
│      │                     ╱                                    │
│      │                ╱                                         │
│      │           ╱                                              │
│  0.0 ┼──────╱────────────────────────────────────               │
│      0.0        0.5        1.0                                  │
│           Predicted Probability                                 │
│                                                                 │
│  • Points above line: Under-confident (predicts lower than actual)
│  • Points below line: Over-confident (predicts higher than actual)
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Calibration Methods

Method	Description	Use Case
Platt Scaling	Logistic regression on predictions	Binary classification
Isotonic Regression	Non-parametric monotonic fit	More flexible
Temperature Scaling	Single parameter scaling	Neural networks

python

from sklearn.calibration import CalibratedClassifierCV

# Calibrate model predictions
calibrated_model = CalibratedClassifierCV(
    base_estimator=model,
    method='isotonic',  # or 'sigmoid' for Platt scaling
    cv=5
)
calibrated_model.fit(X_train, y_train)

Evaluation Best Practices

Evaluation Checklist

Step	Action	Why
1	Define metrics before training	Avoid metric shopping
2	Use appropriate split strategy	Prevent leakage
3	Report confidence intervals	Quantify uncertainty
4	Compare to baseline	Establish improvement
5	Check calibration	Ensure reliable probabilities
6	Evaluate on subgroups	Detect fairness issues

Common Pitfalls

🚨 Evaluation Anti-Patterns

Metric shopping: Trying many metrics, reporting best one
Test set peeking: Using test set for model selection
Ignoring variance: Reporting single number without confidence
Wrong baseline: Comparing to weak baseline
Leaky evaluation: Future data in training

Statistical Significance

python

# Compare two models with paired t-test
from scipy import stats

# Cross-validation scores for two models
scores_model_a = [0.82, 0.85, 0.83, 0.84, 0.86]
scores_model_b = [0.84, 0.87, 0.85, 0.86, 0.88]

# Paired t-test
t_stat, p_value = stats.ttest_rel(scores_model_a, scores_model_b)

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("No significant difference")

✅ Operational Checklist

Offline Evaluation Gate (Pre-Deployment)

Check	Required	Owner	Notes
Correct split strategy used	✓	ML Engineer	Temporal/Group as needed
Test set not touched during training	✓	ML Engineer	No peeking!
Business metric computed	✓	ML Engineer	Not just AUC
Baseline comparison done	✓	ML Engineer	vs current solution
Calibration checked	✓	ML Engineer	ECE < 0.05
Fairness metrics reviewed	✓	ML Lead	Subgroup performance
Confidence intervals reported	✓	ML Engineer	Cross-validation std
Leakage audit passed	✓	ML Lead	See leakage.md

Metric Reporting Standards

Metric Type	Report Format	Example
Classification	AUC, Precision@K, Recall@K	AUC: 0.85 ± 0.02
Regression	RMSE, MAE, R²	RMSE: 12.3 ± 1.1
Ranking	NDCG@K, MAP	NDCG@10: 0.72
Calibration	ECE, Brier Score	ECE: 0.03

📎 Cross-References

📎 Data Leakage & Evaluation Pitfalls - Evaluation pitfalls deep-dive
📎 Problem Framing - Định nghĩa metrics phù hợp với business objective
📎 Model Selection - Chọn model dựa trên evaluation results
📎 ML Monitoring - Online evaluation và drift detection
📎 Experimentation - A/B testing và statistical significance
📎 ML Governance - Approval gates

🧠 Quiz

Câu 1: Tại sao cần tách dữ liệu thành train/validation/test?

[ ] A) Để tăng kích thước dữ liệu huấn luyện
[ ] B) Để model học được nhiều patterns hơn
[x] C) Để đánh giá khách quan khả năng generalize của model trên dữ liệu chưa thấy
[ ] D) Để giảm thời gian training

💡 Giải thích: Việc tách dữ liệu đảm bảo model được đánh giá trên dữ liệu chưa từng thấy trong quá trình training, giúp ước lượng chính xác hiệu suất thực tế (generalization performance).

Câu 2: Stratified K-Fold cross-validation phù hợp nhất khi nào?

[x] A) Khi dữ liệu có class imbalance
[ ] B) Khi dữ liệu có time-series nature
[ ] C) Khi dữ liệu quá lớn để fit vào memory
[ ] D) Khi chỉ có 2 classes

💡 Giải thích: Stratified K-Fold đảm bảo mỗi fold giữ nguyên tỷ lệ phân bố classes giống như dataset gốc, đặc biệt quan trọng khi có class imbalance để tránh fold nào đó thiếu đại diện.

Câu 3: Calibration trong ML model evaluation có nghĩa là gì?

[ ] A) Điều chỉnh learning rate cho phù hợp
[ ] B) Cân bằng số lượng samples giữa các class
[x] C) Đảm bảo xác suất dự đoán phản ánh đúng xác suất thực tế
[ ] D) Tối ưu threshold cho classification

💡 Giải thích: Calibration đảm bảo rằng khi model dự đoán xác suất 80% cho một class, thì khoảng 80% các trường hợp đó thực sự thuộc class đó. Điều này quan trọng cho các quyết định dựa trên xác suất.

📊 Evaluation Fundamentals ​

Tại sao Evaluation quan trọng? ​

Data Splitting Strategies ​

Basic Train/Val/Test Split ​

Temporal Split (Time Series) ​

Stratified Split ​

Group Split ​

Cross-Validation ​

K-Fold Cross-Validation ​

CV Variants ​

Time Series Cross-Validation ​

Classification Metrics ​

Confusion Matrix ​

Key Metrics ​

ROC and AUC ​

Precision-Recall Curve ​

Regression Metrics ​

Metric Selection Guide ​

Calibration ​

What is Calibration? ​

Calibration Plot ​

Calibration Methods ​

Evaluation Best Practices ​

Evaluation Checklist ​

Common Pitfalls ​

Statistical Significance ​

✅ Operational Checklist ​

Offline Evaluation Gate (Pre-Deployment) ​

Metric Reporting Standards ​

📎 Cross-References ​

Câu 1: Tại sao cần tách dữ liệu thành train/validation/test? ​

Câu 2: Stratified K-Fold cross-validation phù hợp nhất khi nào? ​

Câu 3: Calibration trong ML model evaluation có nghĩa là gì? ​

📊 Evaluation Fundamentals

Tại sao Evaluation quan trọng?

Data Splitting Strategies

Basic Train/Val/Test Split

Temporal Split (Time Series)

Stratified Split

Group Split

Cross-Validation

K-Fold Cross-Validation

CV Variants

Time Series Cross-Validation

Classification Metrics

Confusion Matrix

Key Metrics

ROC and AUC

Precision-Recall Curve

Regression Metrics

Metric Selection Guide

Calibration

What is Calibration?

Calibration Plot

Calibration Methods

Evaluation Best Practices

Evaluation Checklist

Common Pitfalls

Statistical Significance

✅ Operational Checklist

Offline Evaluation Gate (Pre-Deployment)

Metric Reporting Standards

📎 Cross-References

Câu 1: Tại sao cần tách dữ liệu thành train/validation/test?

Câu 2: Stratified K-Fold cross-validation phù hợp nhất khi nào?

Câu 3: Calibration trong ML model evaluation có nghĩa là gì?