Imbalanced Data & Batch vs Stream — Thực tế mà model phải đối mặt

🎯 Mục tiêu

Sau khi hoàn thành bài này, bạn sẽ:

Nhận diện được dataset bị imbalanced và hiểu tại sao accuracy là metric dối trá trong trường hợp này
Chọn đúng chiến lược xử lý: class weights, SMOTE, undersampling, hay threshold tuning
Nắm rõ sự khác biệt giữa hai paradigm xử lý dữ liệu: Batch và Stream
Biết khi nào dùng batch pipeline, khi nào cần stream pipeline, và kiến trúc Lambda kết hợp cả hai

Tại sao Accuracy nói dối?

Hãy tưởng tượng bạn xây dựng hệ thống phát hiện gian lận cho ngân hàng. Trong 10.000 giao dịch, chỉ có 50 giao dịch là gian lận — tức 0.5%.

Bạn train một model, chạy evaluate, và thấy:

Accuracy: 99.5% 🎉

Tuyệt vời? Không. Model của bạn đang predict "không gian lận" cho mọi giao dịch:

              Predicted
              Normal  Fraud
Actual
Normal        9950      0
Fraud           50      0

Accuracy: 99.5%   ← looks great!
Recall:   0%       ← catches NO fraud
Precision: N/A    ← never predicted fraud
F1-Score: 0%      ← useless model

⚠️ Accuracy Paradox

Model không học được gì cả — nó chỉ đoán class chiếm đa số. Accuracy 99.5% nghe hay nhưng thực tế model này vô dụng hoàn toàn cho mục đích phát hiện gian lận.

Metrics đúng cho Imbalanced Data

Metric	Công thức	Ý nghĩa	Khi nào ưu tiên?
Precision	TP / (TP + FP)	Trong số những cái model nói "fraud", bao nhiêu thật sự fraud?	Khi false positive tốn kém (gửi cảnh sát nhầm)
Recall	TP / (TP + FN)	Trong số fraud thật, model bắt được bao nhiêu?	Khi bỏ sót tốn kém (bỏ lọt gian lận)
F1-Score	2 × (P × R) / (P + R)	Trung bình hài hòa Precision & Recall	Cần cân bằng cả hai
AUC-ROC	Area under ROC curve	Khả năng phân biệt giữa 2 class ở mọi threshold	Đánh giá tổng quan model
PR-AUC	Area under Precision-Recall curve	Hiệu quả trên minority class	Imbalanced data — metric tin cậy nhất

Business Impact — Mỗi sai lầm đều có giá

Mỗi giao dịch gian lận bị bỏ sót = $500 thiệt hại trung bình

50 fraud bị miss × $500 = $25,000 mất trắng mỗi batch
365 ngày × $25,000 = $9,125,000 / năm

→ Model accuracy 99.5% nhưng gây thiệt hại $9M/năm!

💡 Quy tắc vàng

Luôn hỏi: "Sai nhầm nào tốn kém hơn?" — False Positive (báo nhầm) hay False Negative (bỏ sót)? Câu trả lời quyết định metric nào bạn tối ưu.

Strategy 1: Class Weights — Đơn giản nhưng hiệu quả

Ý tưởng cốt lõi

Thay vì thay đổi data, ta thay đổi cách model học — bảo model "quan tâm hơn" đến minority class bằng cách tăng penalty khi phân loại sai class đó.

Loss function bình thường:
  Sai "Normal"  → penalty = 1
  Sai "Fraud"   → penalty = 1

Loss function với class weights (balanced):
  Sai "Normal"  → penalty = 1
  Sai "Fraud"   → penalty = 199  (tỉ lệ 9950/50)

→ Model bị "phạt nặng" hơn 199 lần khi bỏ sót fraud

Code: Class Weights trong sklearn

python

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score

# --- Cách 1: balanced (sklearn tự tính weight) ---
lr_balanced = LogisticRegression(
    class_weight='balanced',  # weight = n_samples / (n_classes * n_class_samples)
    max_iter=1000,
    random_state=42
)
lr_balanced.fit(X_train, y_train)

y_pred = lr_balanced.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))

# --- Cách 2: Custom weights (kiểm soát trực tiếp) ---
# Fraud (class 1) quan trọng gấp 100 lần Normal (class 0)
rf_weighted = RandomForestClassifier(
    class_weight={0: 1, 1: 100},
    n_estimators=200,
    random_state=42
)
rf_weighted.fit(X_train, y_train)

y_pred_rf = rf_weighted.predict(X_test)
print(classification_report(y_test, y_pred_rf, target_names=['Normal', 'Fraud']))

Khi nào dùng Class Weights?

✅ Luôn thử đầu tiên — zero cost, không thay đổi data
✅ Hoạt động với mọi model hỗ trợ: LogisticRegression, SVM, RandomForest, XGBoost, Neural Networks
✅ Không gây data leakage
⚠️ Có thể không đủ khi imbalance cực kỳ nghiêm trọng (< 0.01%)

Strategy 2: SMOTE — Synthetic Minority Oversampling

Trực giác: Tạo mẫu "tổng hợp" từ mẫu thiểu số

SMOTE (Synthetic Minority Over-sampling Technique) hoạt động bằng cách:

Chọn một mẫu minority class
Tìm k láng giềng gần nhất (k-NN) cũng thuộc minority class
Tạo mẫu mới bằng cách nội suy giữa mẫu gốc và láng giềng

Feature Space:
● = minority class    ○ = majority class    ★ = synthetic (SMOTE)

TRƯỚC SMOTE:                      SAU SMOTE:
○ ○ ○ ○ ○ ○                      ○ ○ ○ ○ ○ ○
○ ○ ○ ○ ○                        ○ ○ ○ ○ ○
  ● ●               →              ● ★ ●
    ●                                 ★ ●
                                        ★

Công thức synthetic sample:
  x_new = x_i + λ × (x_nn - x_i)
  với λ ∈ [0, 1] random

Code: SMOTE với imbalanced-learn

python

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# ⚠️ QUAN TRỌNG: Split TRƯỚC, SMOTE SAU
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Kiểm tra imbalance trước SMOTE
print(f"Trước SMOTE: {dict(zip(*np.unique(y_train, return_counts=True)))}")
# → {0: 7960, 1: 40}

# Áp dụng SMOTE CHỈ trên training data
smote = SMOTE(k_neighbors=5, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Sau SMOTE: {dict(zip(*np.unique(y_train_resampled, return_counts=True)))}")
# → {0: 7960, 1: 7960}  ← balanced!

# Train trên data đã balanced
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train_resampled, y_train_resampled)

# Evaluate trên ORIGINAL test data (không SMOTE!)
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")

🚫 SAI LẦM CHẾT NGƯỜI

python

# ❌ TUYỆT ĐỐI KHÔNG LÀM THẾ NÀY
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)  # SMOTE TRƯỚC khi split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)
# → Data leakage! Synthetic test samples dựa trên training data!

SMOTE phải được áp dụng SAU khi split train/test. Nếu SMOTE trước → synthetic samples trong test set chứa thông tin từ training data → metrics bị inflate giả tạo.

Các biến thể của SMOTE

Biến thể	Đặc điểm	Khi nào dùng?
SMOTE (gốc)	Nội suy k-NN đều	Default choice
BorderlineSMOTE	Chỉ tạo samples ở biên decision boundary	Khi cần tập trung vào vùng khó phân loại
ADASYN	Tạo nhiều samples hơn ở vùng khó học	Khi data phân bố không đều trong minority class
SMOTE-NC	Hỗ trợ features hỗn hợp (numeric + categorical)	Khi data có cả số và category

Strategy 3: Undersampling — Bớt đa số

Ý tưởng

Thay vì tăng minority class, giảm majority class để cân bằng.

python

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)

print(f"Sau undersampling: {dict(zip(*np.unique(y_train_under, return_counts=True)))}")
# → {0: 40, 1: 40}  ← balanced, nhưng chỉ còn 80 samples!

⚠️ Rủi ro của Undersampling

Bạn đang vứt bỏ data — 7920 samples bị loại! Với dataset nhỏ, điều này có thể khiến model mất đi thông tin quan trọng. Chỉ nên dùng khi dataset đủ lớn để việc giảm không ảnh hưởng đáng kể.

Kết hợp: SMOTE + Undersampling

Chiến lược thông minh hơn: tăng minority một phần + giảm majority một phần.

python

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Bước 1: SMOTE tăng minority lên 10% của majority
# Bước 2: Undersample giảm majority xuống 50% tổng
combined_pipeline = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.3, random_state=42)),   # minority = 30% majority
    ('under', RandomUnderSampler(sampling_strategy=0.5, random_state=42))  # majority giảm
])

X_combined, y_combined = combined_pipeline.fit_resample(X_train, y_train)
print(f"Kết hợp: {dict(zip(*np.unique(y_combined, return_counts=True)))}")

Strategy 4: Threshold Tuning — Điều chỉnh ngưỡng quyết định

Tại sao 0.5 là ngưỡng tùy tiện?

Mặc định, model phân loại dùng threshold = 0.5: nếu P(fraud) > 0.5 → predict "Fraud". Nhưng con số 0.5 này hoàn toàn tùy tiện — không có lý do toán học nào bắt buộc phải là 0.5.

python

import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score

# Lấy xác suất thay vì hard prediction
y_proba = model.predict_proba(X_test)[:, 1]

# Tìm threshold tối ưu theo F1
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"F1 tại threshold tối ưu: {f1_scores[optimal_idx]:.4f}")

# Áp dụng threshold mới
y_pred_tuned = (y_proba >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_tuned, target_names=['Normal', 'Fraud']))

💡 Threshold theo Business Logic

Fraud detection: hạ threshold (ví dụ 0.2) → bắt nhiều fraud hơn, chấp nhận nhiều false alarm hơn
Spam filter: nâng threshold (ví dụ 0.8) → chỉ đánh dấu spam khi rất chắc chắn, tránh mất email quan trọng
Y tế: hạ threshold → tốt hơn là kiểm tra thêm một bệnh nhân khỏe mạnh hơn bỏ sót một bệnh nhân ốm

Decision Matrix — Chọn Strategy nào?

Tiêu chí	Class Weights	SMOTE	Undersampling	Threshold Tuning
Dataset nhỏ (< 10K)	✅ Tốt	⚠️ Cẩn thận overfitting	❌ Mất data quý	✅ Tốt
Dataset lớn (> 100K)	✅ Tốt	✅ Tốt	✅ Tốt	✅ Tốt
Imbalance nhẹ (10:1)	✅ Đủ	✅ Tốt	✅ Tốt	✅ Tốt
Imbalance nặng (1000:1)	⚠️ Có thể không đủ	✅ Kết hợp undersampling	⚠️ Mất quá nhiều	✅ Rất hiệu quả
Tốc độ áp dụng	🚀 Ngay lập tức	🐢 Cần tính toán	🚀 Ngay lập tức	🚀 Sau khi train
Rủi ro data leakage	❌ Không có	⚠️ Nếu dùng sai	❌ Không có	❌ Không có
Thay đổi data	Không	Có (tạo synthetic)	Có (xóa samples)	Không

🧭 Quy trình khuyến nghị

Bước 1: Class Weights (luôn thử trước)
         ↓ (không đủ?)
Bước 2: Threshold Tuning (zero cost, no data change)
         ↓ (vẫn chưa tốt?)
Bước 3: SMOTE + Undersampling kết hợp
         ↓ (edge case?)
Bước 4: Custom loss function hoặc ensemble methods

Batch vs Stream — Hai Paradigm Xử lý Data

Trong thực tế, cách bạn xử lý data quan trọng không kém cách bạn train model. Hai paradigm chính:

Batch Processing — Xử lý theo lô

╔════════════════════════════════════════════════════════════╗
║  BATCH PIPELINE                                           ║
║                                                           ║
║  Data Lake  →  ETL Job  →  Feature Store  →  Model Train  ║
║  (S3/GCS)     (Spark)     (offline)         (daily/weekly) ║
║                                                           ║
║  Latency: phút → giờ                                     ║
║  Throughput: rất cao (TB dữ liệu)                        ║
╚════════════════════════════════════════════════════════════╝

Đặc điểm:

Xử lý toàn bộ data cùng lúc (hoặc theo partition)
Tools: pandas, Apache Spark, SQL queries, dbt
Latency: phút đến giờ — chấp nhận được vì không cần real-time
Use cases: training models, generating reports, tính features hàng ngày, ETL pipelines

python

# Ví dụ batch pipeline đơn giản
import pandas as pd
from sklearn.pipeline import Pipeline

# Đọc toàn bộ data một lần
df = pd.read_parquet("s3://data-lake/transactions/2024-01-15/")

# Feature engineering trên toàn bộ batch
features = compute_features(df)  # tính mean, std, count, etc.

# Train model trên toàn bộ data
model = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(class_weight='balanced'))
])
model.fit(features[X_cols], features['label'])

# Lưu model để deploy
joblib.dump(model, 'model_v2.pkl')

Stream Processing — Xử lý theo dòng

╔══════════════════════════════════════════════════════════════════╗
║  STREAM PIPELINE                                                ║
║                                                                 ║
║  Event  →  Message Queue  →  Feature Compute  →  Model Serve   ║
║  (API)     (Kafka/SQS)       (real-time)          (< 100ms)    ║
║                                                                 ║
║  Latency: milliseconds → seconds                               ║
║  Throughput: vừa phải (events/second)                          ║
╚══════════════════════════════════════════════════════════════════╝

Đặc điểm:

Xử lý từng event hoặc micro-batch khi data đến
Tools: Apache Kafka, Apache Flink, AWS Kinesis, Google Pub/Sub
Latency: milliseconds đến seconds — yêu cầu phản hồi tức thì
Use cases: fraud detection real-time, recommendations, monitoring, alerting

python

# Ví dụ stream processing (pseudo-code)
from kafka import KafkaConsumer
import joblib

model = joblib.load('model_v2.pkl')

consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers='kafka:9092',
    auto_offset_reset='latest'
)

for message in consumer:
    transaction = parse_transaction(message.value)

    # Tính features real-time (không có luxury tính mean toàn bộ dataset!)
    features = compute_realtime_features(transaction)

    # Predict ngay lập tức
    fraud_prob = model.predict_proba([features])[0][1]

    if fraud_prob > 0.3:  # threshold đã tune
        trigger_alert(transaction, fraud_prob)
        block_transaction(transaction)

So sánh trực tiếp

BATCH:   Data Lake → ETL → Feature Store → Model Train → Deploy
                                                  (daily)

STREAM:  Event → Kafka → Feature Compute → Model Serve → Response
                          (real-time)          (ms)

Tiêu chí	Batch	Stream
Latency	Phút → giờ	Milliseconds → giây
Data volume	TB dữ liệu	Events/giây
Complexity	Thấp → trung bình	Cao
Fault tolerance	Dễ (rerun job)	Khó (ordering, exactly-once)
State management	Đơn giản (toàn bộ data sẵn)	Phức tạp (windowed state)
Cost	Trả theo job	Trả liên tục (infra always-on)
Feature computation	Mean/std toàn bộ data	Rolling window, incremental stats
Debug	Dễ (deterministic)	Khó (race conditions, ordering)

Lambda Architecture — Kết hợp Batch + Stream

Trong thực tế, phần lớn hệ thống cần cả hai:

                    ┌─────────────────┐
                    │   Data Source    │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    ▼                 ▼
           ┌──────────────┐  ┌──────────────┐
           │  Batch Layer  │  │ Stream Layer  │
           │  (accuracy)   │  │  (speed)      │
           │               │  │               │
           │  Spark/SQL    │  │  Kafka/Flink  │
           │  Daily retrain│  │  Real-time    │
           └──────┬───────┘  └──────┬───────┘
                  │                  │
                  ▼                  ▼
           ┌─────────────────────────────┐
           │       Serving Layer          │
           │  (merge batch + stream)      │
           │                              │
           │  Model trained on batch data │
           │  Features from both layers   │
           └─────────────────────────────┘

💡 Khi nào dùng gì?

Chỉ Batch: Training models, weekly reports, A/B test analysis
Chỉ Stream: Real-time fraud alerts, live dashboards, chat applications
Lambda (cả hai): Fraud detection (stream) + model retraining (batch), recommendations (stream serving) + feature engineering (batch)

Decision Framework: Batch hay Stream?

Hỏi 3 câu:

1. "Kết quả cần trong bao lâu?"
   → > 1 phút chấp nhận được  → Batch
   → < 1 giây là bắt buộc     → Stream

2. "Data có pattern thay đổi liên tục?"
   → Tĩnh, ít thay đổi       → Batch đủ
   → Liên tục thay đổi        → Stream cần thiết

3. "Budget cho infrastructure?"
   → Tiết kiệm               → Batch (chạy rồi tắt)
   → Đầu tư được              → Stream (infra always-on)

🔥 GPU Paragraph

🔥 Góc nhìn GPU — Weights vs SMOTE trên phần cứng

Training với class weights rất thân thiện với GPU — weight chỉ là một hệ số nhân trong loss function, không thay đổi data hay computation graph. Mọi framework (PyTorch, TensorFlow) đều hỗ trợ class_weight trực tiếp trên GPU.

SMOTE, ngược lại, là hoàn toàn CPU-bound — thuật toán dựa trên k-NN trong feature space, đòi hỏi tính khoảng cách giữa tất cả các điểm, không tận dụng được GPU parallelism hiệu quả.

Khuyến nghị phần cứng:

Dataset lớn (>1M samples): class-weighted loss trên GPU thay vì CPU-bound SMOTE
SMOTE chỉ hiệu quả trên dataset vừa (10K-500K samples) trên CPU
Stream processing cho inference thường chạy trên CPU vì cần latency ổn định (GPU có overhead khi transfer data)
GPU batch inference phù hợp cho high-throughput systems (hàng triệu predictions/giây) khi latency từng request không critical

Performance comparison:
┌─────────────────────────────────────────┐
│  Class Weights (GPU):  ~0 overhead      │
│  SMOTE 100K samples:   ~30 giây (CPU)   │
│  SMOTE 1M samples:     ~15 phút (CPU)   │
│  SMOTE 10M samples:    ❌ impractical    │
└─────────────────────────────────────────┘

🧠 Hiểu lầm phổ biến của người mới

🧠 "Luôn luôn oversample minority class" — SAI

Nhiều tutorial dạy bạn SMOTE mọi thứ — nhưng đó là oversimplification nguy hiểm:

SMOTE tạo mẫu giả — synthetic samples có thể không phản ánh phân phối thật của data. Trong fraud detection, pattern gian lận thật có thể rất khác so với mẫu nội suy.
Risk of overfitting — SMOTE tăng training set nhưng không tăng information. Model có thể overfit trên synthetic patterns.
Data leakage — Sai lầm phổ biến nhất: áp dụng SMOTE trước khi split train/test → synthetic test samples chứa thông tin từ training data → metrics bị inflate.

Thứ tự đúng:

✅ Đúng: Split → SMOTE trên train → Train → Evaluate trên test gốc
❌ Sai:  SMOTE → Split → Train → Evaluate (data leakage!)

Bắt đầu với class weights (zero cost, không thay đổi data), rồi threshold tuning, rồi mới sampling nếu cần.

⚡ Fast Exercise — Fraud Detection với Class Weights

⚡ Bài tập nhanh — 10 phút

Scenario: Dataset giao dịch ngân hàng, 0.2% là fraud.

Yêu cầu:

Load data và kiểm tra tỉ lệ imbalance
Train model KHÔNG có class weights → đánh giá accuracy, recall, F1
Train model VỚI class_weight='balanced' → đánh giá cùng metrics
So sánh kết quả

python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Tạo dataset imbalanced (0.2% minority)
X, y = make_classification(
    n_samples=50000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    weights=[0.998, 0.002],
    random_state=42,
    flip_y=0.01
)

print(f"Class distribution: {dict(zip(*np.unique(y, return_counts=True)))}")

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Model 1: Không class weights
lr_default = LogisticRegression(max_iter=1000, random_state=42)
lr_default.fit(X_train, y_train)
y_pred_default = lr_default.predict(X_test)

print("=" * 50)
print("MODEL 1: Không class weights")
print("=" * 50)
print(classification_report(y_test, y_pred_default, target_names=['Normal', 'Fraud']))
print(f"AUC-ROC: {roc_auc_score(y_test, lr_default.predict_proba(X_test)[:, 1]):.4f}")

# Model 2: Có class weights
lr_balanced = LogisticRegression(
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)
lr_balanced.fit(X_train, y_train)
y_pred_balanced = lr_balanced.predict(X_test)

print("=" * 50)
print("MODEL 2: Class weight = balanced")
print("=" * 50)
print(classification_report(y_test, y_pred_balanced, target_names=['Normal', 'Fraud']))
print(f"AUC-ROC: {roc_auc_score(y_test, lr_balanced.predict_proba(X_test)[:, 1]):.4f}")

# 🎯 Kỳ vọng:
# Model 1: accuracy cao (~99.8%) nhưng recall thấp cho Fraud
# Model 2: accuracy có thể giảm nhẹ nhưng recall Fraud tăng đáng kể

Câu hỏi suy ngẫm:

Accuracy của Model 2 có giảm so với Model 1 không? Tại sao?
Recall cho class Fraud thay đổi như thế nào?
Trong business context, model nào bạn chọn deploy?

🪤 Gotcha — SMOTE trước Split = Data Leakage

🪤 Spot the Bug

python

# ❌ CODE CÓ LỖI — Tìm ra chỗ sai!
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

# Bước 1: SMOTE trên TOÀN BỘ data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Bước 2: Split sau
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

# Bước 3: Train & evaluate
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
print(f"F1: {f1_score(y_test, model.predict(X_test)):.4f}")  # 0.95+ ← GIẢ!

Bug: SMOTE được áp dụng TRƯỚC khi split. Synthetic samples trong test set được tạo dựa trên data points mà cũng có mặt trong training set → model đã "thấy" thông tin test trong quá trình train.

Fix đúng:

python

# ✅ Split TRƯỚC → SMOTE CHỈ trên training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

model = GradientBoostingClassifier()
model.fit(X_train_resampled, y_train_resampled)
print(f"F1: {f1_score(y_test, model.predict(X_test)):.4f}")  # F1 thật, thấp hơn nhưng trung thực

📊 Performance Note — SMOTE ở quy mô lớn

📊 Hiệu năng SMOTE

SMOTE dựa trên k-NN — thuật toán có complexity O(n²) cho brute-force hoặc O(n log n) với KD-Tree. Trên dataset lớn, điều này trở thành bottleneck.

Benchmarks thực tế:

Samples	k_neighbors=5	k_neighbors=3	Tiết kiệm
10K	~2 giây	~1.5 giây	25%
100K	~45 giây	~30 giây	33%
1M	~25 phút	~15 phút	40%
10M	❌ OOM	❌ OOM	—

Tips tối ưu:

python

# Giảm k_neighbors để tăng tốc
smote = SMOTE(k_neighbors=3, random_state=42)  # thay vì default k=5

# Kết hợp: undersample majority TRƯỚC, SMOTE minority SAU
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

fast_pipeline = ImbPipeline([
    ('under', RandomUnderSampler(sampling_strategy=0.1)),  # majority giảm 10x
    ('smote', SMOTE(k_neighbors=3, sampling_strategy=0.5))  # SMOTE trên tập nhỏ hơn
])

Với dataset > 1M: bỏ SMOTE, dùng class weights + threshold tuning cho kết quả tương đương mà nhanh hơn 100x.

🚫 Production Anti-pattern — SMOTE trong Inference

🚫 Anti-pattern: SMOTE trong Production Pipeline

python

# ❌ SAI HOÀN TOÀN — ĐỪNG BAO GIỜ LÀM THẾ NÀY
def predict_fraud(transaction):
    # Thêm transaction vào "buffer", chạy SMOTE, rồi predict??
    buffer.append(transaction)
    X_balanced = smote.fit_resample(buffer)  # ← NONSENSE
    return model.predict(X_balanced[-1])

Tại sao sai?

SMOTE là kỹ thuật training-time — mục đích là giúp model học tốt hơn trên imbalanced data. Trong production:

Model đã được train trên SMOTE-balanced data → đã "biết" cách nhận diện minority class
Incoming data là real data — imbalanced là bản chất tự nhiên
Bạn classify trực tiếp mỗi transaction mà không cần SMOTE

python

# ✅ ĐÚNG — Production inference
def predict_fraud(transaction):
    features = extract_features(transaction)
    fraud_prob = model.predict_proba([features])[0][1]
    return fraud_prob > THRESHOLD  # threshold đã tune offline

Nhớ: SMOTE cho Training. Threshold tuning cho Inference.

🎮 Playground — Workflow hoàn chỉnh

🎮 Playground: Detect Imbalance → Class Weights → SMOTE → Compare

python

"""Complete Imbalanced Data Handling Workflow — Detect → Weights → SMOTE → Compare"""
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, f1_score, average_precision_score
from imblearn.over_sampling import SMOTE

np.random.seed(42)

# BƯỚC 1: Tạo dataset imbalanced (3% minority)
X, y = make_classification(
    n_samples=20000, n_features=15, n_informative=8, n_redundant=2,
    n_classes=2, weights=[0.97, 0.03], random_state=42, flip_y=0.02
)
unique, counts = np.unique(y, return_counts=True)
print(f"Class distribution: {dict(zip(unique, counts))}")
print(f"Imbalance ratio: {counts[0]/counts[1]:.1f}:1")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

def evaluate(name, model, X_tr, y_tr):
    """Train, predict, và in metrics."""
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    print(f"\n{'=' * 50}\n{name}\n{'=' * 50}")
    print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
    auc = roc_auc_score(y_test, y_proba)
    prauc = average_precision_score(y_test, y_proba)
    f1 = f1_score(y_test, y_pred)
    print(f"AUC-ROC: {auc:.4f} | PR-AUC: {prauc:.4f}")
    return {'Approach': name, 'AUC-ROC': auc, 'PR-AUC': prauc, 'F1 (Fraud)': f1}

# BƯỚC 2-4: Train 3 approaches
results = []
results.append(evaluate("Baseline", LogisticRegression(max_iter=1000, random_state=42), X_train, y_train))
results.append(evaluate("Class Weights", LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42), X_train, y_train))

smote = SMOTE(k_neighbors=5, random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"\nSau SMOTE: {dict(zip(*np.unique(y_smote, return_counts=True)))}")
results.append(evaluate("SMOTE", LogisticRegression(max_iter=1000, random_state=42), X_smote, y_smote))

# BƯỚC 5: So sánh
df = pd.DataFrame(results)
print(f"\n{'=' * 50}\nCOMPARISON\n{'=' * 50}")
print(df.to_string(index=False))
best = df.loc[df['PR-AUC'].idxmax(), 'Approach']
print(f"\n🎯 Approach tốt nhất (PR-AUC): {best}")

Tóm tắt

IMBALANCED DATA CHEAT SHEET:
═══════════════════════════════════════════

1. DETECT:  np.unique(y, return_counts=True)
            → Ratio > 10:1? Cần xử lý

2. METRIC:  Bỏ accuracy. Dùng:
            → PR-AUC (imbalanced data)
            → F1-Score (balance P & R)
            → Recall (khi bỏ sót tốn kém)

3. FIX (theo thứ tự):
   ① Class Weights  → zero cost, thử trước
   ② Threshold Tune → dựa trên business need
   ③ SMOTE + Under  → kết hợp, dataset vừa
   ④ Custom Loss    → advanced, neural nets

4. RULE #1: SMOTE SAU split, KHÔNG BAO GIỜ trước!

5. BATCH vs STREAM:
   → Cần ngay?     → Stream (Kafka/Flink)
   → Hàng ngày?    → Batch (Spark/pandas)
   → Cả hai?       → Lambda Architecture

Bước tiếp theo

📚 Đọc tiếp

Bài 07: Bridge to Classical ML → — Kết nối tất cả kiến thức Phase 1 để bước vào Machine Learning
Evaluation Metrics Deep Dive → — Tìm hiểu sâu hơn về ROC, PR curves, và cách chọn metric phù hợp

Imbalanced Data & Batch vs Stream — Thực tế mà model phải đối mặt ​

Tại sao Accuracy nói dối? ​

Metrics đúng cho Imbalanced Data ​

Business Impact — Mỗi sai lầm đều có giá ​

Strategy 1: Class Weights — Đơn giản nhưng hiệu quả ​

Ý tưởng cốt lõi ​

Code: Class Weights trong sklearn ​

Strategy 2: SMOTE — Synthetic Minority Oversampling ​

Trực giác: Tạo mẫu "tổng hợp" từ mẫu thiểu số ​

Code: SMOTE với imbalanced-learn ​

Các biến thể của SMOTE ​

Strategy 3: Undersampling — Bớt đa số ​

Ý tưởng ​

Kết hợp: SMOTE + Undersampling ​

Strategy 4: Threshold Tuning — Điều chỉnh ngưỡng quyết định ​

Tại sao 0.5 là ngưỡng tùy tiện? ​

Decision Matrix — Chọn Strategy nào? ​

Batch vs Stream — Hai Paradigm Xử lý Data ​

Batch Processing — Xử lý theo lô ​

Stream Processing — Xử lý theo dòng ​

So sánh trực tiếp ​

Lambda Architecture — Kết hợp Batch + Stream ​

Decision Framework: Batch hay Stream? ​

🔥 GPU Paragraph ​

🧠 Hiểu lầm phổ biến của người mới ​

⚡ Fast Exercise — Fraud Detection với Class Weights ​

🪤 Gotcha — SMOTE trước Split = Data Leakage ​

📊 Performance Note — SMOTE ở quy mô lớn ​

🚫 Production Anti-pattern — SMOTE trong Inference ​

🎮 Playground — Workflow hoàn chỉnh ​

Tóm tắt ​

Bước tiếp theo ​

Imbalanced Data & Batch vs Stream — Thực tế mà model phải đối mặt

Tại sao Accuracy nói dối?

Metrics đúng cho Imbalanced Data

Business Impact — Mỗi sai lầm đều có giá

Strategy 1: Class Weights — Đơn giản nhưng hiệu quả

Ý tưởng cốt lõi

Code: Class Weights trong sklearn

Strategy 2: SMOTE — Synthetic Minority Oversampling

Trực giác: Tạo mẫu "tổng hợp" từ mẫu thiểu số

Code: SMOTE với imbalanced-learn

Các biến thể của SMOTE

Strategy 3: Undersampling — Bớt đa số

Ý tưởng

Kết hợp: SMOTE + Undersampling

Strategy 4: Threshold Tuning — Điều chỉnh ngưỡng quyết định

Tại sao 0.5 là ngưỡng tùy tiện?

Decision Matrix — Chọn Strategy nào?

Batch vs Stream — Hai Paradigm Xử lý Data

Batch Processing — Xử lý theo lô

Stream Processing — Xử lý theo dòng

So sánh trực tiếp

Lambda Architecture — Kết hợp Batch + Stream

Decision Framework: Batch hay Stream?

🔥 GPU Paragraph

🧠 Hiểu lầm phổ biến của người mới

⚡ Fast Exercise — Fraud Detection với Class Weights

🪤 Gotcha — SMOTE trước Split = Data Leakage

📊 Performance Note — SMOTE ở quy mô lớn

🚫 Production Anti-pattern — SMOTE trong Inference

🎮 Playground — Workflow hoàn chỉnh

Tóm tắt

Bước tiếp theo