Skip to content

🎯 Problem Framing

Level: Foundation Solves: Định nghĩa đúng bài toán ML, tránh các pitfalls phổ biến như data leakage và misaligned objectives

Tại sao Problem Framing quan trọng?

💡 Giáo sư Tom

80% ML projects fail không phải vì model kém, mà vì problem được định nghĩa sai từ đầu. Một model hoàn hảo giải quyết sai vấn đề vẫn là thất bại. Dành thời gian ở bước này sẽ tiết kiệm hàng tháng refactoring sau này.

┌─────────────────────────────────────────────────────────────────┐
│                 ML PROJECT SUCCESS FACTORS                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              PROBLEM FRAMING (60%)                       │    │
│  │  • Clear objective definition                            │    │
│  │  • Correct success metrics                               │    │
│  │  • Leakage prevention                                    │    │
│  │  • Constraint identification                             │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              DATA QUALITY (25%)                          │    │
│  │  • Representative data                                   │    │
│  │  • Proper labeling                                       │    │
│  │  • Feature engineering                                   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              MODEL SELECTION (15%)                       │    │
│  │  • Algorithm choice                                      │    │
│  │  • Hyperparameter tuning                                 │    │
│  │  • Architecture design                                   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Objective Definition Framework

Business Objective vs ML Objective

AspectBusiness ObjectiveML Objective
FocusBusiness outcomeMeasurable prediction
MetricRevenue, retentionAccuracy, AUC, RMSE
StakeholderBusiness teamML team
Example"Reduce churn""Predict churn probability"

The Translation Process

┌─────────────────────────────────────────────────────────────────┐
│           BUSINESS → ML OBJECTIVE TRANSLATION                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Business Goal          ML Problem Type        ML Objective     │
│  ─────────────          ───────────────        ────────────     │
│                                                                 │
│  "Reduce churn"    →    Classification    →   Predict P(churn) │
│                                                                 │
│  "Increase sales"  →    Recommendation    →   Rank products    │
│                                                                 │
│  "Detect fraud"    →    Anomaly Detection →   Score anomaly    │
│                                                                 │
│  "Forecast demand" →    Regression        →   Predict quantity │
│                                                                 │
│  "Segment users"   →    Clustering        →   Group similar    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Objective Definition Checklist

⚠️ Trước khi code bất kỳ model nào

Trả lời được TẤT CẢ các câu hỏi sau:

CategoryQuestionWhy It Matters
WhatPrediction target là gì?Defines label
WhenPrediction cần ở thời điểm nào?Defines feature availability
WhoPrediction cho ai/cái gì?Defines entity
HowPrediction được sử dụng như thế nào?Defines threshold/ranking
WhyBusiness impact của prediction?Defines success metric

Constraint Identification

Types of Constraints

┌─────────────────────────────────────────────────────────────────┐
│                    ML PROJECT CONSTRAINTS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   TECHNICAL     │  │    BUSINESS     │  │    ETHICAL      │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Latency       │  │ • Budget        │  │ • Fairness      │  │
│  │ • Throughput    │  │ • Timeline      │  │ • Privacy       │  │
│  │ • Memory        │  │ • Interpretability│ │ • Bias          │  │
│  │ • Compute       │  │ • Maintenance   │  │ • Transparency  │  │
│  │ • Data volume   │  │ • Team skills   │  │ • Consent       │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Constraint Trade-offs Matrix

ConstraintImpact on ModelMitigation Strategy
Low latency (<10ms)Simpler models, fewer featuresFeature caching, model distillation
High interpretabilityLinear models, decision treesSHAP, LIME for complex models
Limited labeled dataOverfitting riskTransfer learning, semi-supervised
Real-time featuresFeature engineering complexityFeature stores, streaming pipelines
Fairness requirementsConstrained optimizationFairness-aware training

Data Leakage Prevention

What is Data Leakage?

🚨 Data Leakage

Data leakage xảy ra khi thông tin từ tương lai (hoặc thông tin không available ở prediction time) "rò rỉ" vào training data, tạo ra model có performance ảo cao nhưng fail trong production.

Types of Leakage

┌─────────────────────────────────────────────────────────────────┐
│                    TYPES OF DATA LEAKAGE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TARGET LEAKAGE                                              │
│     ─────────────────                                           │
│     Feature chứa thông tin về target                            │
│                                                                 │
│     Example: Predicting loan default                            │
│     ❌ Feature: "collection_calls_count"                        │
│     → Chỉ có sau khi đã default!                                │
│                                                                 │
│  2. TRAIN-TEST CONTAMINATION                                    │
│     ─────────────────────────                                   │
│     Test data "nhìn thấy" training data                         │
│                                                                 │
│     Example: Time series forecasting                            │
│     ❌ Random split thay vì temporal split                      │
│     → Future data trong training set!                           │
│                                                                 │
│  3. FEATURE LEAKAGE                                             │
│     ───────────────                                             │
│     Feature computed với future information                     │
│                                                                 │
│     Example: Customer lifetime value                            │
│     ❌ Feature: "total_purchases_ever"                          │
│     → Includes future purchases!                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Leakage Detection Checklist

CheckQuestionRed Flag
TemporalFeature có available ở prediction time?Feature từ sau prediction point
CausalFeature có cause hay effect của target?Feature là consequence của target
ProxyFeature có proxy trực tiếp cho target?Near-perfect correlation
ProcessingPreprocessing có dùng test data?Fit scaler trên full data

Prevention Strategies

python
# ❌ WRONG: Leakage through preprocessing
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit on ALL data
X_train, X_test = train_test_split(X_scaled)

# ✅ CORRECT: Proper preprocessing
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit only on train
X_test_scaled = scaler.transform(X_test)  # Transform test

Problem Type Selection

Decision Framework

┌─────────────────────────────────────────────────────────────────┐
│              ML PROBLEM TYPE DECISION TREE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    Do you have labels?                          │
│                          │                                      │
│              ┌───────────┴───────────┐                          │
│              │                       │                          │
│             YES                      NO                         │
│              │                       │                          │
│     Is target continuous?    → UNSUPERVISED                     │
│              │                  • Clustering                    │
│     ┌────────┴────────┐         • Anomaly Detection             │
│     │                 │         • Dimensionality Reduction      │
│    YES               NO                                         │
│     │                 │                                         │
│ REGRESSION      Is it ranking?                                  │
│ • Linear            │                                           │
│ • Tree-based   ┌────┴────┐                                      │
│ • Neural       │         │                                      │
│               YES        NO                                     │
│                │         │                                      │
│           RANKING   CLASSIFICATION                              │
│           • LTR     • Binary                                    │
│           • Rec     • Multi-class                               │
│                     • Multi-label                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Common Reframing Patterns

Original ProblemReframed AsBenefit
Regression → ClassificationBucket continuous targetSimpler, more robust
Classification → RankingRank by probabilityBetter for recommendations
Multi-class → BinaryOne-vs-restEasier to interpret
Point prediction → DistributionPredict uncertaintyBetter decision making

Success Metrics Alignment

Metric Selection Framework

Business GoalML MetricThreshold Consideration
Minimize false positivesPrecisionHigh precision, lower recall
Minimize false negativesRecallHigh recall, lower precision
Balance bothF1-ScoreHarmonic mean
Ranking qualityAUC-ROC, NDCGThreshold-independent
Calibrated probabilitiesBrier Score, Log LossProbability accuracy

Metric-Business Alignment Example

┌─────────────────────────────────────────────────────────────────┐
│           FRAUD DETECTION: METRIC ALIGNMENT                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Business Context:                                              │
│  • False Positive: Block legitimate transaction → Customer angry│
│  • False Negative: Miss fraud → Financial loss                  │
│                                                                 │
│  Cost Analysis:                                                 │
│  • Cost of FP: $10 (customer service, friction)                 │
│  • Cost of FN: $500 (average fraud amount)                      │
│                                                                 │
│  Optimal Threshold:                                             │
│  • FN is 50x more costly than FP                                │
│  • Optimize for high recall (catch more fraud)                  │
│  • Accept lower precision (more false alarms)                   │
│                                                                 │
│  Recommended Metric: Recall @ Precision = 0.5                   │
│  Or: Custom cost-weighted metric                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices

Problem Framing Document Template

markdown
## ML Problem Framing Document

### 1. Business Context
- Business objective: [What business outcome?]
- Current solution: [How is it solved today?]
- Success criteria: [What defines success?]

### 2. ML Formulation
- Prediction target: [What to predict?]
- Entity: [Predict for what/whom?]
- Prediction time: [When is prediction needed?]
- Problem type: [Classification/Regression/etc.]

### 3. Data Availability
- Label source: [Where do labels come from?]
- Feature sources: [What data is available?]
- Temporal constraints: [What's available at prediction time?]

### 4. Constraints
- Latency: [Max prediction time?]
- Interpretability: [Need explanations?]
- Fairness: [Protected attributes?]

### 5. Risks
- Leakage risks: [Potential leakage sources?]
- Distribution shift: [How might data change?]
- Feedback loops: [Does prediction affect future data?]

Operational Checklist

Problem Framing Gate (Gate 0)

CheckRequiredOwnerNotes
Business objective documentedPMClear success criteria
ML objective definedML LeadPrediction target, entity, timing
Problem type selectedML EngineerClassification/Regression/Ranking
Constraints identifiedTech LeadLatency, interpretability, fairness
Data availability confirmedData EngineerLabels, features, volume
Leakage risks assessedML LeadSee leakage.md
Success metrics alignedPM + ML LeadBusiness ↔ ML metric mapping
Baseline definedML EngineerWhat to compare against

Pre-Experimentation Gate

CheckRequiredOwner
Problem framing doc approvedML Lead
Data access grantedData Governance
Compute resources allocatedPlatform
Experiment tracking configuredML Engineer

📎 Cross-References