Giao diện
🎓 Training Foundations
Level: Foundation Solves: Hiểu sâu các fundamentals của neural network training, từ optimization đến regularization
Tại sao Training Foundations quan trọng?
💡 Giáo sư Tom
Deep learning không phải magic - nó là applied optimization. Hiểu rõ gradient descent, loss landscapes, và regularization sẽ giúp bạn debug training issues nhanh hơn 10x so với việc chỉ copy-paste code từ tutorials.
┌─────────────────────────────────────────────────────────────────┐
│ DEEP LEARNING TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Data │ → │ Forward │ → │ Loss │ → │Backward │ │
│ │ Batch │ │ Pass │ │ Compute │ │ Pass │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ │ ┌─────────────────────────┐ │ │
│ └─────────│ Weight Update │←───────┘ │
│ │ (Optimizer Step) │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Optimization Fundamentals
Gradient Descent Variants
| Variant | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | Full dataset | Stable gradients | Slow, memory intensive |
| Stochastic GD | 1 sample | Fast updates | Noisy gradients |
| Mini-batch GD | 32-512 | Balance speed/stability | Hyperparameter tuning |
Modern Optimizers
┌─────────────────────────────────────────────────────────────────┐
│ OPTIMIZER EVOLUTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SGD → SGD+Momentum → RMSprop → Adam → AdamW │
│ │ │ │ │ │ │
│ │ │ │ │ └─ Weight decay fix │
│ │ │ │ └─ Momentum + Adaptive LR │
│ │ │ └─ Adaptive learning rates │
│ │ └─ Accelerated convergence │
│ └─ Basic gradient descent │
│ │
└─────────────────────────────────────────────────────────────────┘Optimizer Selection Guide
| Optimizer | Best For | Learning Rate | Notes |
|---|---|---|---|
| SGD+Momentum | CNNs, well-tuned models | 0.01-0.1 | Often best final performance |
| Adam | Transformers, quick prototyping | 1e-4 to 3e-4 | Good default choice |
| AdamW | Transformers with regularization | 1e-4 to 3e-4 | Proper weight decay |
| LAMB | Large batch training | Scaled | For batch sizes > 8K |
Learning Rate Schedules
┌─────────────────────────────────────────────────────────────────┐
│ LEARNING RATE SCHEDULE PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step Decay Cosine Annealing Warmup + Decay │
│ ────────── ──────────────── ────────────── │
│ LR│ ▄▄▄▄ LR│ ╭──╮ LR│ ╭────╮ │
│ │ ▄▄▄▄ │ ╱ ╲ │ ╱ ╲ │
│ │ ▄▄▄▄ │ ╱ ╲ │ ╱ ╲ │
│ │ ▄▄ │ ╱ ╲ │╱ ╲ │
│ └──────────────→ └───────────→ └────────────→ │
│ Epochs Epochs Epochs │
│ │
│ Use: CNNs, stable Use: Transformers Use: Large models │
│ training fine-tuning from scratch │
│ │
└─────────────────────────────────────────────────────────────────┘Loss Functions
Classification Losses
| Loss | Formula | Use Case |
|---|---|---|
| Cross-Entropy | -Σ y·log(ŷ) | Multi-class classification |
| Binary CE | -y·log(ŷ) - (1-y)·log(1-ŷ) | Binary classification |
| Focal Loss | -α(1-ŷ)^γ·log(ŷ) | Imbalanced datasets |
| Label Smoothing | CE with soft targets | Regularization, calibration |
Regression Losses
| Loss | Formula | Characteristics |
|---|---|---|
| MSE (L2) | (y - ŷ)² | Penalizes large errors heavily |
| MAE (L1) | ` | y - ŷ |
| Huber | L2 if small, L1 if large | Best of both worlds |
| Log-Cosh | log(cosh(y - ŷ)) | Smooth approximation of Huber |
Specialized Losses
┌─────────────────────────────────────────────────────────────────┐
│ SPECIALIZED LOSS FUNCTIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CONTRASTIVE LEARNING │
│ ──────────────────── │
│ • InfoNCE: Self-supervised representation learning │
│ • Triplet Loss: Metric learning (anchor, positive, negative) │
│ • NT-Xent: SimCLR-style contrastive loss │
│ │
│ SEQUENCE MODELING │
│ ───────────────── │
│ • CTC Loss: Speech recognition, OCR │
│ • Sequence CE: Language modeling │
│ │
│ GENERATIVE MODELS │
│ ───────────────── │
│ • Reconstruction Loss: Autoencoders │
│ • KL Divergence: VAEs │
│ • Adversarial Loss: GANs │
│ │
└─────────────────────────────────────────────────────────────────┘Weight Initialization
Why Initialization Matters
🚨 Bad Initialization
Sai initialization có thể dẫn đến:
- Vanishing gradients: Weights quá nhỏ → gradients → 0
- Exploding gradients: Weights quá lớn → gradients → ∞
- Dead neurons: ReLU neurons stuck at 0
- Slow convergence: Training takes forever
Initialization Strategies
| Method | Formula | Best For |
|---|---|---|
| Xavier/Glorot | U(-√(6/(n_in+n_out)), √(6/(n_in+n_out))) | Tanh, Sigmoid activations |
| He/Kaiming | N(0, √(2/n_in)) | ReLU activations |
| Orthogonal | QR decomposition | RNNs, deep networks |
| LSUV | Layer-sequential unit-variance | Very deep networks |
Initialization by Layer Type
┌─────────────────────────────────────────────────────────────────┐
│ INITIALIZATION BY LAYER TYPE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer Type Recommended Init Notes │
│ ────────── ──────────────── ───── │
│ │
│ Linear + ReLU He (Kaiming) fan_in mode │
│ Linear + Tanh Xavier (Glorot) Uniform or Normal │
│ Conv2d + ReLU He (Kaiming) fan_out for backprop│
│ LSTM/GRU Orthogonal For recurrent weights│
│ Embedding Normal(0, 0.02) Or uniform small │
│ LayerNorm weight=1, bias=0 Standard practice │
│ BatchNorm weight=1, bias=0 Standard practice │
│ │
└─────────────────────────────────────────────────────────────────┘Regularization Techniques
Overview of Regularization
┌─────────────────────────────────────────────────────────────────┐
│ REGULARIZATION TAXONOMY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ EXPLICIT │ │ IMPLICIT │ │ DATA-BASED │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • L1/L2 penalty │ │ • Early stopping│ │ • Data augment │ │
│ │ • Dropout │ │ • Batch norm │ │ • Mixup/CutMix │ │
│ │ • Weight decay │ │ • Noise inject │ │ • Label smooth │ │
│ │ • Max-norm │ │ • Gradient clip │ │ • Curriculum │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Dropout
| Variant | Description | Use Case |
|---|---|---|
| Standard Dropout | Random zero-out during training | Fully connected layers |
| Spatial Dropout | Drop entire feature maps | CNNs |
| DropConnect | Drop weights instead of activations | Alternative to dropout |
| DropBlock | Drop contiguous regions | CNNs, better than spatial |
Weight Decay vs L2 Regularization
⚠️ Subtle Difference
Weight decay và L2 regularization KHÔNG giống nhau với adaptive optimizers (Adam):
- L2 Reg: Adds
λ·w²to loss → gradient includesλ·w - Weight Decay: Directly decays weights →
w = w - lr·λ·w
AdamW implements proper weight decay, Adam với L2 reg có behavior khác!
Batch Normalization
┌─────────────────────────────────────────────────────────────────┐
│ BATCH NORMALIZATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Training: │
│ ───────── │
│ 1. Compute batch mean: μ_B = (1/m) Σ x_i │
│ 2. Compute batch var: σ²_B = (1/m) Σ (x_i - μ_B)² │
│ 3. Normalize: x̂_i = (x_i - μ_B) / √(σ²_B + ε) │
│ 4. Scale & shift: y_i = γ·x̂_i + β │
│ │
│ Inference: │
│ ────────── │
│ Use running mean/var computed during training │
│ │
│ Benefits: │
│ • Faster training (higher learning rates) │
│ • Reduces internal covariate shift │
│ • Acts as regularization │
│ │
│ Pitfalls: │
│ • Small batch sizes → noisy statistics │
│ • Different behavior train vs inference │
│ │
└─────────────────────────────────────────────────────────────────┘Layer Normalization vs Batch Normalization
| Aspect | Batch Norm | Layer Norm |
|---|---|---|
| Normalizes over | Batch dimension | Feature dimension |
| Batch size dependency | Yes | No |
| Best for | CNNs | Transformers, RNNs |
| Inference behavior | Uses running stats | Same as training |
Gradient Flow
Vanishing/Exploding Gradients
┌─────────────────────────────────────────────────────────────────┐
│ GRADIENT FLOW PROBLEMS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VANISHING GRADIENTS │
│ ─────────────────── │
│ Symptoms: │
│ • Early layers don't learn │
│ • Loss plateaus early │
│ • Gradient norms → 0 │
│ │
│ Causes: │
│ • Sigmoid/Tanh saturation │
│ • Too many layers │
│ • Poor initialization │
│ │
│ Solutions: │
│ • ReLU activations │
│ • Skip connections (ResNet) │
│ • Proper initialization │
│ • Batch/Layer normalization │
│ │
│ EXPLODING GRADIENTS │
│ ─────────────────── │
│ Symptoms: │
│ • NaN losses │
│ • Weights → ∞ │
│ • Unstable training │
│ │
│ Solutions: │
│ • Gradient clipping │
│ • Lower learning rate │
│ • Proper initialization │
│ │
└─────────────────────────────────────────────────────────────────┘Gradient Clipping
python
# Gradient clipping by norm (recommended)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Gradient clipping by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)| Method | When to Use | Typical Values |
|---|---|---|
| Clip by Norm | RNNs, Transformers | 1.0 - 5.0 |
| Clip by Value | Specific gradient control | 0.5 - 1.0 |
Training Best Practices
Hyperparameter Priority
┌─────────────────────────────────────────────────────────────────┐
│ HYPERPARAMETER TUNING PRIORITY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Priority 1 (Tune First): │
│ • Learning rate │
│ • Batch size │
│ • Number of epochs │
│ │
│ Priority 2 (Tune Second): │
│ • Model architecture (depth, width) │
│ • Optimizer choice │
│ • Learning rate schedule │
│ │
│ Priority 3 (Fine-tune): │
│ • Regularization strength │
│ • Dropout rate │
│ • Weight decay │
│ │
│ Priority 4 (Usually defaults work): │
│ • Optimizer betas (Adam) │
│ • Epsilon values │
│ • Initialization details │
│ │
└─────────────────────────────────────────────────────────────────┘Training Checklist
✅ Before Training
- Data: Verify data loading, augmentation, normalization
- Model: Check parameter count, forward pass works
- Loss: Verify loss computation on dummy data
- Overfit: Can model overfit small batch? (sanity check)
- Baseline: Compare against simple baseline
📎 Cross-References
- 📎 Architectures Overview - Neural network architectures
- 📎 Debugging Training - Troubleshooting training issues
- 📎 Scaling Training - Distributed training strategies
- 📎 ML Evaluation - Evaluation fundamentals
- 📎 ML Feature Engineering - Feature engineering for ML
- 📎 LLM Architecture - LLM training specifics