🧪 Experimentation

Level: Advanced Solves: Quản lý experiments một cách có hệ thống, đảm bảo reproducibility và collaboration

Tại sao Experimentation Management quan trọng?

💡 Giáo sư Tom

"Nếu bạn không thể reproduce kết quả, bạn không có kết quả." ML experiments mà không có proper tracking giống như làm khoa học mà không ghi chép. Một tháng sau, bạn sẽ không nhớ hyperparameters nào cho kết quả tốt nhất.

┌─────────────────────────────────────────────────────────────────┐
│              EXPERIMENTATION CHAOS vs ORDER                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WITHOUT TRACKING:                                              │
│  ─────────────────                                              │
│  • "Which notebook had the best model?"                         │
│  • "What hyperparameters did I use?"                            │
│  • "Why did this model perform better?"                         │
│  • "Can we reproduce last week's results?"                      │
│  • model_v2_final_FINAL_v3.pkl                                  │
│                                                                 │
│  WITH PROPER TRACKING:                                          │
│  ─────────────────────                                          │
│  • Every experiment logged with parameters                      │
│  • Metrics tracked over time                                    │
│  • Artifacts versioned and stored                               │
│  • Easy comparison between runs                                 │
│  • One-click reproduction                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Experiment Tracking Components

What to Track

┌─────────────────────────────────────────────────────────────────┐
│              EXPERIMENT TRACKING COMPONENTS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    PARAMETERS                            │    │
│  │  • Hyperparameters (learning_rate, n_estimators)         │    │
│  │  • Data parameters (train_size, features_used)           │    │
│  │  • Environment (python_version, library_versions)        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                     METRICS                              │    │
│  │  • Training metrics (loss, accuracy per epoch)           │    │
│  │  • Validation metrics (AUC, F1, RMSE)                    │    │
│  │  • Custom business metrics                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    ARTIFACTS                             │    │
│  │  • Model files (model.pkl, model.h5)                     │    │
│  │  • Plots (confusion_matrix.png, roc_curve.png)           │    │
│  │  • Data samples (predictions.csv)                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                     CODE                                 │    │
│  │  • Git commit hash                                       │    │
│  │  • Notebook snapshot                                     │    │
│  │  • Config files                                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tracking Hierarchy

Level	Description	Example
Project	High-level ML initiative	"Customer Churn Prediction"
Experiment	Specific approach/hypothesis	"XGBoost with new features"
Run	Single training execution	"Run #42 with lr=0.01"
Metric	Measured value	"val_auc=0.85"

MLflow

MLflow Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MLFLOW COMPONENTS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   MLflow        │  │   MLflow        │  │   MLflow        │  │
│  │   Tracking      │  │   Projects      │  │   Models        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Log params    │  │ • Package code  │  │ • Model format  │  │
│  │ • Log metrics   │  │ • Dependencies  │  │ • Model registry│  │
│  │ • Log artifacts │  │ • Entry points  │  │ • Deployment    │  │
│  │ • Compare runs  │  │ • Reproducibility│ │ • Versioning    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                  MLflow Model Registry                   │    │
│  │  • Stage transitions (Staging → Production)              │    │
│  │  • Model versioning                                      │    │
│  │  • Approval workflows                                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

MLflow Tracking Example

python

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Set experiment
mlflow.set_experiment("churn-prediction")

# Start run
with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5,
        "random_state": 42
    }
    mlflow.log_params(params)
    
    # Train model
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("auc", roc_auc_score(y_test, y_proba))
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log artifacts
    mlflow.log_artifact("feature_importance.png")
    
    # Log tags
    mlflow.set_tag("model_type", "random_forest")
    mlflow.set_tag("data_version", "v2.1")

MLflow Model Registry

python

# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-model")

# Transition to staging
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="churn-model",
    version=1,
    stage="Staging"
)

# Load model from registry
model = mlflow.pyfunc.load_model("models:/churn-model/Staging")

Weights & Biases (W&B)

W&B Features

┌─────────────────────────────────────────────────────────────────┐
│                    W&B FEATURES                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   Experiments   │  │   Artifacts     │  │   Sweeps        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Run tracking  │  │ • Dataset       │  │ • Hyperparameter│  │
│  │ • Comparison    │  │   versioning    │  │   optimization  │  │
│  │ • Visualization │  │ • Model storage │  │ • Distributed   │  │
│  │ • Collaboration │  │ • Lineage       │  │   search        │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   Reports       │  │   Tables        │  │   Alerts        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Shareable     │  │ • Data          │  │ • Metric        │  │
│  │   dashboards    │  │   exploration   │  │   thresholds    │  │
│  │ • Markdown      │  │ • Predictions   │  │ • Run failures  │  │
│  │ • Embedding     │  │ • Comparisons   │  │ • Notifications │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

W&B Example

python

import wandb
from sklearn.ensemble import RandomForestClassifier

# Initialize run
wandb.init(
    project="churn-prediction",
    name="rf-baseline",
    config={
        "n_estimators": 100,
        "max_depth": 10,
        "learning_rate": 0.01,
    }
)

# Access config
config = wandb.config

# Train model
model = RandomForestClassifier(
    n_estimators=config.n_estimators,
    max_depth=config.max_depth
)
model.fit(X_train, y_train)

# Log metrics
wandb.log({
    "accuracy": accuracy_score(y_test, y_pred),
    "auc": roc_auc_score(y_test, y_proba),
    "confusion_matrix": wandb.plot.confusion_matrix(
        y_true=y_test, preds=y_pred, class_names=["No Churn", "Churn"]
    )
})

# Log model artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("model.pkl")
wandb.log_artifact(artifact)

wandb.finish()

W&B Sweeps (Hyperparameter Tuning)

python

# Define sweep config
sweep_config = {
    "method": "bayes",  # or "grid", "random"
    "metric": {"name": "val_auc", "goal": "maximize"},
    "parameters": {
        "n_estimators": {"values": [100, 200, 500]},
        "max_depth": {"min": 3, "max": 15},
        "learning_rate": {"min": 0.001, "max": 0.1, "distribution": "log_uniform"},
    }
}

# Create sweep
sweep_id = wandb.sweep(sweep_config, project="churn-prediction")

# Define training function
def train():
    wandb.init()
    config = wandb.config
    
    model = XGBClassifier(
        n_estimators=config.n_estimators,
        max_depth=config.max_depth,
        learning_rate=config.learning_rate
    )
    model.fit(X_train, y_train)
    
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    wandb.log({"val_auc": auc})

# Run sweep
wandb.agent(sweep_id, train, count=50)

Reproducibility

Reproducibility Checklist

┌─────────────────────────────────────────────────────────────────┐
│              REPRODUCIBILITY CHECKLIST                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ☐ CODE VERSIONING                                              │
│    • Git commit hash logged                                     │
│    • No uncommitted changes                                     │
│    • Branch/tag recorded                                        │
│                                                                 │
│  ☐ DATA VERSIONING                                              │
│    • Data version/hash logged                                   │
│    • Data source documented                                     │
│    • Preprocessing steps recorded                               │
│                                                                 │
│  ☐ ENVIRONMENT                                                  │
│    • requirements.txt / environment.yml                         │
│    • Python version                                             │
│    • CUDA/GPU versions (if applicable)                          │
│                                                                 │
│  ☐ RANDOM SEEDS                                                 │
│    • All random seeds set and logged                            │
│    • numpy, random, torch seeds                                 │
│                                                                 │
│  ☐ CONFIGURATION                                                │
│    • All hyperparameters logged                                 │
│    • Config files versioned                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Setting Random Seeds

python

import random
import numpy as np
import torch

def set_seed(seed: int = 42):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    # For CUDA determinism (may impact performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # Set environment variable
    os.environ['PYTHONHASHSEED'] = str(seed)

# Use at the start of every experiment
set_seed(42)

Data Versioning with DVC

bash

# Initialize DVC
dvc init

# Track data file
dvc add data/train.csv

# Commit to git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data v1"

# Push data to remote storage
dvc push

# Pull data in another environment
dvc pull

Experiment Organization

Project Structure

ml-project/
├── configs/
│   ├── base.yaml           # Base configuration
│   ├── experiment_1.yaml   # Experiment-specific overrides
│   └── experiment_2.yaml
├── data/
│   ├── raw/                # Original data (DVC tracked)
│   ├── processed/          # Processed data
│   └── features/           # Feature store
├── models/
│   └── artifacts/          # Saved models
├── notebooks/
│   ├── exploration/        # EDA notebooks
│   └── experiments/        # Experiment notebooks
├── src/
│   ├── data/               # Data processing
│   ├── features/           # Feature engineering
│   ├── models/             # Model definitions
│   └── evaluation/         # Evaluation utilities
├── tests/
├── mlflow.db               # MLflow tracking database
├── requirements.txt
└── README.md

Configuration Management with Hydra

python

# config/config.yaml
defaults:
  - model: xgboost
  - data: default

experiment:
  name: "baseline"
  seed: 42

# config/model/xgboost.yaml
model:
  name: xgboost
  params:
    n_estimators: 100
    max_depth: 6
    learning_rate: 0.1

# train.py
import hydra
from omegaconf import DictConfig

@hydra.main(config_path="config", config_name="config")
def train(cfg: DictConfig):
    print(f"Experiment: {cfg.experiment.name}")
    print(f"Model: {cfg.model.name}")
    print(f"Params: {cfg.model.params}")
    
    # Training logic here...

if __name__ == "__main__":
    train()

# Run with overrides
# python train.py model.params.n_estimators=200 experiment.name=exp_v2

Best Practices

Experiment Naming Convention

Component	Convention	Example
Project	`{team}-{problem}`	`ml-churn-prediction`
Experiment	`{approach}-{version}`	`xgboost-v2`
Run	`{date}-{description}`	`20240315-new-features`

Comparison and Analysis

python

# MLflow: Compare runs
import mlflow

# Get all runs from experiment
experiment = mlflow.get_experiment_by_name("churn-prediction")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

# Filter and sort
best_runs = runs[runs["metrics.auc"] > 0.8].sort_values(
    "metrics.auc", ascending=False
)

# Compare parameters of top runs
print(best_runs[["params.n_estimators", "params.max_depth", "metrics.auc"]])

Common Pitfalls

🚨 Experimentation Anti-Patterns

No tracking: Running experiments without logging
Incomplete logging: Missing parameters or metrics
No versioning: Can't reproduce results
Messy organization: Experiments scattered across notebooks
No comparison: Not analyzing what worked and why

📎 Cross-References

📎 Evaluation Fundamentals - Metrics để track
📎 Model Selection - Experiments cho model comparison
📎 Deployment Patterns - From experiment to production
📎 ML Monitoring - Production experiment tracking

🧠 Quiz

Câu 1: Reproducibility trong ML experiments quan trọng vì lý do gì?

[ ] A) Chỉ để tuân thủ quy trình công ty
[x] B) Đảm bảo kết quả có thể tái tạo, so sánh công bằng, và debug khi có vấn đề
[ ] C) Để tăng tốc độ training
[ ] D) Để giảm chi phí compute

💡 Giải thích: Reproducibility cho phép tái tạo kết quả chính xác, so sánh experiments một cách công bằng, và trace back khi model production gặp vấn đề. Không có reproducibility, việc debug và cải thiện model trở nên gần như không thể.

Câu 2: Experiment tracking tool (MLflow, W&B) nên log những gì?

[ ] A) Chỉ final metrics
[ ] B) Chỉ hyperparameters
[x] C) Hyperparameters, metrics, artifacts (model, data), code version, và environment
[ ] D) Chỉ model weights

💡 Giải thích: Experiment tracking toàn diện cần log: hyperparameters, training/validation metrics theo thời gian, model artifacts, dataset version, code commit hash, và environment (library versions, hardware). Thiếu bất kỳ thành phần nào đều có thể gây khó khăn khi reproduce.

🧪 Experimentation ​

Tại sao Experimentation Management quan trọng? ​

Experiment Tracking Components ​

What to Track ​

Tracking Hierarchy ​

MLflow ​

MLflow Architecture ​

MLflow Tracking Example ​

MLflow Model Registry ​

Weights & Biases (W&B) ​

W&B Features ​

W&B Example ​

W&B Sweeps (Hyperparameter Tuning) ​

Reproducibility ​

Reproducibility Checklist ​

Setting Random Seeds ​

Data Versioning with DVC ​

Experiment Organization ​

Project Structure ​

Configuration Management with Hydra ​

Best Practices ​

Experiment Naming Convention ​

Comparison and Analysis ​

Common Pitfalls ​

📎 Cross-References ​

Câu 1: Reproducibility trong ML experiments quan trọng vì lý do gì? ​

Câu 2: Experiment tracking tool (MLflow, W&B) nên log những gì? ​

🧪 Experimentation

Tại sao Experimentation Management quan trọng?

Experiment Tracking Components

What to Track

Tracking Hierarchy

MLflow

MLflow Architecture

MLflow Tracking Example

MLflow Model Registry

Weights & Biases (W&B)

W&B Features

W&B Example

W&B Sweeps (Hyperparameter Tuning)

Reproducibility

Reproducibility Checklist

Setting Random Seeds

Data Versioning with DVC

Experiment Organization

Project Structure

Configuration Management with Hydra

Best Practices

Experiment Naming Convention

Comparison and Analysis

Common Pitfalls

📎 Cross-References

Câu 1: Reproducibility trong ML experiments quan trọng vì lý do gì?

Câu 2: Experiment tracking tool (MLflow, W&B) nên log những gì?