Skip to content

🧪 Experimentation

Level: Advanced Solves: Quản lý experiments một cách có hệ thống, đảm bảo reproducibility và collaboration

Tại sao Experimentation Management quan trọng?

💡 Giáo sư Tom

"Nếu bạn không thể reproduce kết quả, bạn không có kết quả." ML experiments mà không có proper tracking giống như làm khoa học mà không ghi chép. Một tháng sau, bạn sẽ không nhớ hyperparameters nào cho kết quả tốt nhất.

┌─────────────────────────────────────────────────────────────────┐
│              EXPERIMENTATION CHAOS vs ORDER                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WITHOUT TRACKING:                                              │
│  ─────────────────                                              │
│  • "Which notebook had the best model?"                         │
│  • "What hyperparameters did I use?"                            │
│  • "Why did this model perform better?"                         │
│  • "Can we reproduce last week's results?"                      │
│  • model_v2_final_FINAL_v3.pkl                                  │
│                                                                 │
│  WITH PROPER TRACKING:                                          │
│  ─────────────────────                                          │
│  • Every experiment logged with parameters                      │
│  • Metrics tracked over time                                    │
│  • Artifacts versioned and stored                               │
│  • Easy comparison between runs                                 │
│  • One-click reproduction                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Experiment Tracking Components

What to Track

┌─────────────────────────────────────────────────────────────────┐
│              EXPERIMENT TRACKING COMPONENTS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    PARAMETERS                            │    │
│  │  • Hyperparameters (learning_rate, n_estimators)         │    │
│  │  • Data parameters (train_size, features_used)           │    │
│  │  • Environment (python_version, library_versions)        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                     METRICS                              │    │
│  │  • Training metrics (loss, accuracy per epoch)           │    │
│  │  • Validation metrics (AUC, F1, RMSE)                    │    │
│  │  • Custom business metrics                               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    ARTIFACTS                             │    │
│  │  • Model files (model.pkl, model.h5)                     │    │
│  │  • Plots (confusion_matrix.png, roc_curve.png)           │    │
│  │  • Data samples (predictions.csv)                        │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                     CODE                                 │    │
│  │  • Git commit hash                                       │    │
│  │  • Notebook snapshot                                     │    │
│  │  • Config files                                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Tracking Hierarchy

LevelDescriptionExample
ProjectHigh-level ML initiative"Customer Churn Prediction"
ExperimentSpecific approach/hypothesis"XGBoost with new features"
RunSingle training execution"Run #42 with lr=0.01"
MetricMeasured value"val_auc=0.85"

MLflow

MLflow Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MLFLOW COMPONENTS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   MLflow        │  │   MLflow        │  │   MLflow        │  │
│  │   Tracking      │  │   Projects      │  │   Models        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Log params    │  │ • Package code  │  │ • Model format  │  │
│  │ • Log metrics   │  │ • Dependencies  │  │ • Model registry│  │
│  │ • Log artifacts │  │ • Entry points  │  │ • Deployment    │  │
│  │ • Compare runs  │  │ • Reproducibility│ │ • Versioning    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                  MLflow Model Registry                   │    │
│  │  • Stage transitions (Staging → Production)              │    │
│  │  • Model versioning                                      │    │
│  │  • Approval workflows                                    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

MLflow Tracking Example

python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Set experiment
mlflow.set_experiment("churn-prediction")

# Start run
with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5,
        "random_state": 42
    }
    mlflow.log_params(params)
    
    # Train model
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("auc", roc_auc_score(y_test, y_proba))
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log artifacts
    mlflow.log_artifact("feature_importance.png")
    
    # Log tags
    mlflow.set_tag("model_type", "random_forest")
    mlflow.set_tag("data_version", "v2.1")

MLflow Model Registry

python
# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-model")

# Transition to staging
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="churn-model",
    version=1,
    stage="Staging"
)

# Load model from registry
model = mlflow.pyfunc.load_model("models:/churn-model/Staging")

Weights & Biases (W&B)

W&B Features

┌─────────────────────────────────────────────────────────────────┐
│                    W&B FEATURES                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   Experiments   │  │   Artifacts     │  │   Sweeps        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Run tracking  │  │ • Dataset       │  │ • Hyperparameter│  │
│  │ • Comparison    │  │   versioning    │  │   optimization  │  │
│  │ • Visualization │  │ • Model storage │  │ • Distributed   │  │
│  │ • Collaboration │  │ • Lineage       │  │   search        │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   Reports       │  │   Tables        │  │   Alerts        │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • Shareable     │  │ • Data          │  │ • Metric        │  │
│  │   dashboards    │  │   exploration   │  │   thresholds    │  │
│  │ • Markdown      │  │ • Predictions   │  │ • Run failures  │  │
│  │ • Embedding     │  │ • Comparisons   │  │ • Notifications │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

W&B Example

python
import wandb
from sklearn.ensemble import RandomForestClassifier

# Initialize run
wandb.init(
    project="churn-prediction",
    name="rf-baseline",
    config={
        "n_estimators": 100,
        "max_depth": 10,
        "learning_rate": 0.01,
    }
)

# Access config
config = wandb.config

# Train model
model = RandomForestClassifier(
    n_estimators=config.n_estimators,
    max_depth=config.max_depth
)
model.fit(X_train, y_train)

# Log metrics
wandb.log({
    "accuracy": accuracy_score(y_test, y_pred),
    "auc": roc_auc_score(y_test, y_proba),
    "confusion_matrix": wandb.plot.confusion_matrix(
        y_true=y_test, preds=y_pred, class_names=["No Churn", "Churn"]
    )
})

# Log model artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("model.pkl")
wandb.log_artifact(artifact)

wandb.finish()

W&B Sweeps (Hyperparameter Tuning)

python
# Define sweep config
sweep_config = {
    "method": "bayes",  # or "grid", "random"
    "metric": {"name": "val_auc", "goal": "maximize"},
    "parameters": {
        "n_estimators": {"values": [100, 200, 500]},
        "max_depth": {"min": 3, "max": 15},
        "learning_rate": {"min": 0.001, "max": 0.1, "distribution": "log_uniform"},
    }
}

# Create sweep
sweep_id = wandb.sweep(sweep_config, project="churn-prediction")

# Define training function
def train():
    wandb.init()
    config = wandb.config
    
    model = XGBClassifier(
        n_estimators=config.n_estimators,
        max_depth=config.max_depth,
        learning_rate=config.learning_rate
    )
    model.fit(X_train, y_train)
    
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    wandb.log({"val_auc": auc})

# Run sweep
wandb.agent(sweep_id, train, count=50)

Reproducibility

Reproducibility Checklist

┌─────────────────────────────────────────────────────────────────┐
│              REPRODUCIBILITY CHECKLIST                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ☐ CODE VERSIONING                                              │
│    • Git commit hash logged                                     │
│    • No uncommitted changes                                     │
│    • Branch/tag recorded                                        │
│                                                                 │
│  ☐ DATA VERSIONING                                              │
│    • Data version/hash logged                                   │
│    • Data source documented                                     │
│    • Preprocessing steps recorded                               │
│                                                                 │
│  ☐ ENVIRONMENT                                                  │
│    • requirements.txt / environment.yml                         │
│    • Python version                                             │
│    • CUDA/GPU versions (if applicable)                          │
│                                                                 │
│  ☐ RANDOM SEEDS                                                 │
│    • All random seeds set and logged                            │
│    • numpy, random, torch seeds                                 │
│                                                                 │
│  ☐ CONFIGURATION                                                │
│    • All hyperparameters logged                                 │
│    • Config files versioned                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Setting Random Seeds

python
import random
import numpy as np
import torch

def set_seed(seed: int = 42):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    # For CUDA determinism (may impact performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # Set environment variable
    os.environ['PYTHONHASHSEED'] = str(seed)

# Use at the start of every experiment
set_seed(42)

Data Versioning with DVC

bash
# Initialize DVC
dvc init

# Track data file
dvc add data/train.csv

# Commit to git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data v1"

# Push data to remote storage
dvc push

# Pull data in another environment
dvc pull

Experiment Organization

Project Structure

ml-project/
├── configs/
│   ├── base.yaml           # Base configuration
│   ├── experiment_1.yaml   # Experiment-specific overrides
│   └── experiment_2.yaml
├── data/
│   ├── raw/                # Original data (DVC tracked)
│   ├── processed/          # Processed data
│   └── features/           # Feature store
├── models/
│   └── artifacts/          # Saved models
├── notebooks/
│   ├── exploration/        # EDA notebooks
│   └── experiments/        # Experiment notebooks
├── src/
│   ├── data/               # Data processing
│   ├── features/           # Feature engineering
│   ├── models/             # Model definitions
│   └── evaluation/         # Evaluation utilities
├── tests/
├── mlflow.db               # MLflow tracking database
├── requirements.txt
└── README.md

Configuration Management with Hydra

python
# config/config.yaml
defaults:
  - model: xgboost
  - data: default

experiment:
  name: "baseline"
  seed: 42

# config/model/xgboost.yaml
model:
  name: xgboost
  params:
    n_estimators: 100
    max_depth: 6
    learning_rate: 0.1

# train.py
import hydra
from omegaconf import DictConfig

@hydra.main(config_path="config", config_name="config")
def train(cfg: DictConfig):
    print(f"Experiment: {cfg.experiment.name}")
    print(f"Model: {cfg.model.name}")
    print(f"Params: {cfg.model.params}")
    
    # Training logic here...

if __name__ == "__main__":
    train()

# Run with overrides
# python train.py model.params.n_estimators=200 experiment.name=exp_v2

Best Practices

Experiment Naming Convention

ComponentConventionExample
Project{team}-{problem}ml-churn-prediction
Experiment{approach}-{version}xgboost-v2
Run{date}-{description}20240315-new-features

Comparison and Analysis

python
# MLflow: Compare runs
import mlflow

# Get all runs from experiment
experiment = mlflow.get_experiment_by_name("churn-prediction")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

# Filter and sort
best_runs = runs[runs["metrics.auc"] > 0.8].sort_values(
    "metrics.auc", ascending=False
)

# Compare parameters of top runs
print(best_runs[["params.n_estimators", "params.max_depth", "metrics.auc"]])

Common Pitfalls

🚨 Experimentation Anti-Patterns

  • No tracking: Running experiments without logging
  • Incomplete logging: Missing parameters or metrics
  • No versioning: Can't reproduce results
  • Messy organization: Experiments scattered across notebooks
  • No comparison: Not analyzing what worked and why

📎 Cross-References