Giao diện
🎓 Training Foundations
Level: Foundation Solves: Hiểu sâu các fundamentals của neural network training, từ optimization đến regularization
Tại sao Training Foundations quan trọng?
💡 Giáo sư Tom
Deep learning không phải magic - nó là applied optimization. Hiểu rõ gradient descent, loss landscapes, và regularization sẽ giúp bạn debug training issues nhanh hơn 10x so với việc chỉ copy-paste code từ tutorials.
┌─────────────────────────────────────────────────────────────────┐
│ DEEP LEARNING TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Data │ → │ Forward │ → │ Loss │ → │Backward │ │
│ │ Batch │ │ Pass │ │ Compute │ │ Pass │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ │ ┌─────────────────────────┐ │ │
│ └─────────│ Weight Update │←───────┘ │
│ │ (Optimizer Step) │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Optimization Fundamentals
Gradient Descent Variants
| Variant | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | Full dataset | Stable gradients | Slow, memory intensive |
| Stochastic GD | 1 sample | Fast updates | Noisy gradients |
| Mini-batch GD | 32-512 | Balance speed/stability | Hyperparameter tuning |
Modern Optimizers
┌─────────────────────────────────────────────────────────────────┐
│ OPTIMIZER EVOLUTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SGD → SGD+Momentum → RMSprop → Adam → AdamW │
│ │ │ │ │ │ │
│ │ │ │ │ └─ Weight decay fix │
│ │ │ │ └─ Momentum + Adaptive LR │
│ │ │ └─ Adaptive learning rates │
│ │ └─ Accelerated convergence │
│ └─ Basic gradient descent │
│ │
└─────────────────────────────────────────────────────────────────┘Optimizer Selection Guide
| Optimizer | Best For | Learning Rate | Notes |
|---|---|---|---|
| SGD+Momentum | CNNs, well-tuned models | 0.01-0.1 | Often best final performance |
| Adam | Transformers, quick prototyping | 1e-4 to 3e-4 | Good default choice |
| AdamW | Transformers with regularization | 1e-4 to 3e-4 | Proper weight decay |
| LAMB | Large batch training | Scaled | For batch sizes > 8K |
Learning Rate Schedules
┌─────────────────────────────────────────────────────────────────┐
│ LEARNING RATE SCHEDULE PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step Decay Cosine Annealing Warmup + Decay │
│ ────────── ──────────────── ────────────── │
│ LR│ ▄▄▄▄ LR│ ╭──╮ LR│ ╭────╮ │
│ │ ▄▄▄▄ │ ╱ ╲ │ ╱ ╲ │
│ │ ▄▄▄▄ │ ╱ ╲ │ ╱ ╲ │
│ │ ▄▄ │ ╱ ╲ │╱ ╲ │
│ └──────────────→ └───────────→ └────────────→ │
│ Epochs Epochs Epochs │
│ │
│ Use: CNNs, stable Use: Transformers Use: Large models │
│ training fine-tuning from scratch │
│ │
└─────────────────────────────────────────────────────────────────┘Loss Functions
Classification Losses
| Loss | Formula | Use Case |
|---|---|---|
| Cross-Entropy | -Σ y·log(ŷ) | Multi-class classification |
| Binary CE | -y·log(ŷ) - (1-y)·log(1-ŷ) | Binary classification |
| Focal Loss | -α(1-ŷ)^γ·log(ŷ) | Imbalanced datasets |
| Label Smoothing | CE with soft targets | Regularization, calibration |
Regression Losses
| Loss | Formula | Characteristics |
|---|---|---|
| MSE (L2) | (y - ŷ)² | Penalizes large errors heavily |
| MAE (L1) | ` | y - ŷ |
| Huber | L2 if small, L1 if large | Best of both worlds |
| Log-Cosh | log(cosh(y - ŷ)) | Smooth approximation of Huber |
Specialized Losses
┌─────────────────────────────────────────────────────────────────┐
│ SPECIALIZED LOSS FUNCTIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CONTRASTIVE LEARNING │
│ ──────────────────── │
│ • InfoNCE: Self-supervised representation learning │
│ • Triplet Loss: Metric learning (anchor, positive, negative) │
│ • NT-Xent: SimCLR-style contrastive loss │
│ │
│ SEQUENCE MODELING │
│ ───────────────── │
│ • CTC Loss: Speech recognition, OCR │
│ • Sequence CE: Language modeling │
│ │
│ GENERATIVE MODELS │
│ ───────────────── │
│ • Reconstruction Loss: Autoencoders │
│ • KL Divergence: VAEs │
│ • Adversarial Loss: GANs │
│ │
└─────────────────────────────────────────────────────────────────┘Weight Initialization
Why Initialization Matters
🚨 Bad Initialization
Sai initialization có thể dẫn đến:
- Vanishing gradients: Weights quá nhỏ → gradients → 0
- Exploding gradients: Weights quá lớn → gradients → ∞
- Dead neurons: ReLU neurons stuck at 0
- Slow convergence: Training takes forever
Initialization Strategies
| Method | Formula | Best For |
|---|---|---|
| Xavier/Glorot | U(-√(6/(n_in+n_out)), √(6/(n_in+n_out))) | Tanh, Sigmoid activations |
| He/Kaiming | N(0, √(2/n_in)) | ReLU activations |
| Orthogonal | QR decomposition | RNNs, deep networks |
| LSUV | Layer-sequential unit-variance | Very deep networks |
Initialization by Layer Type
┌─────────────────────────────────────────────────────────────────┐
│ INITIALIZATION BY LAYER TYPE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer Type Recommended Init Notes │
│ ────────── ──────────────── ───── │
│ │
│ Linear + ReLU He (Kaiming) fan_in mode │
│ Linear + Tanh Xavier (Glorot) Uniform or Normal │
│ Conv2d + ReLU He (Kaiming) fan_out for backprop│
│ LSTM/GRU Orthogonal For recurrent weights│
│ Embedding Normal(0, 0.02) Or uniform small │
│ LayerNorm weight=1, bias=0 Standard practice │
│ BatchNorm weight=1, bias=0 Standard practice │
│ │
└─────────────────────────────────────────────────────────────────┘Regularization Techniques
Overview of Regularization
┌─────────────────────────────────────────────────────────────────┐
│ REGULARIZATION TAXONOMY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ EXPLICIT │ │ IMPLICIT │ │ DATA-BASED │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • L1/L2 penalty │ │ • Early stopping│ │ • Data augment │ │
│ │ • Dropout │ │ • Batch norm │ │ • Mixup/CutMix │ │
│ │ • Weight decay │ │ • Noise inject │ │ • Label smooth │ │
│ │ • Max-norm │ │ • Gradient clip │ │ • Curriculum │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Dropout
| Variant | Description | Use Case |
|---|---|---|
| Standard Dropout | Random zero-out during training | Fully connected layers |
| Spatial Dropout | Drop entire feature maps | CNNs |
| DropConnect | Drop weights instead of activations | Alternative to dropout |
| DropBlock | Drop contiguous regions | CNNs, better than spatial |
Weight Decay vs L2 Regularization
⚠️ Subtle Difference
Weight decay và L2 regularization KHÔNG giống nhau với adaptive optimizers (Adam):
- L2 Reg: Adds
λ·w²to loss → gradient includesλ·w - Weight Decay: Directly decays weights →
w = w - lr·λ·w
AdamW implements proper weight decay, Adam với L2 reg có behavior khác!
Batch Normalization
┌─────────────────────────────────────────────────────────────────┐
│ BATCH NORMALIZATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Training: │
│ ───────── │
│ 1. Compute batch mean: μ_B = (1/m) Σ x_i │
│ 2. Compute batch var: σ²_B = (1/m) Σ (x_i - μ_B)² │
│ 3. Normalize: x̂_i = (x_i - μ_B) / √(σ²_B + ε) │
│ 4. Scale & shift: y_i = γ·x̂_i + β │
│ │
│ Inference: │
│ ────────── │
│ Use running mean/var computed during training │
│ │
│ Benefits: │
│ • Faster training (higher learning rates) │
│ • Reduces internal covariate shift │
│ • Acts as regularization │
│ │
│ Pitfalls: │
│ • Small batch sizes → noisy statistics │
│ • Different behavior train vs inference │
│ │
└─────────────────────────────────────────────────────────────────┘Layer Normalization vs Batch Normalization
| Aspect | Batch Norm | Layer Norm |
|---|---|---|
| Normalizes over | Batch dimension | Feature dimension |
| Batch size dependency | Yes | No |
| Best for | CNNs | Transformers, RNNs |
| Inference behavior | Uses running stats | Same as training |
Gradient Flow
Vanishing/Exploding Gradients
┌─────────────────────────────────────────────────────────────────┐
│ GRADIENT FLOW PROBLEMS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VANISHING GRADIENTS │
│ ─────────────────── │
│ Symptoms: │
│ • Early layers don't learn │
│ • Loss plateaus early │
│ • Gradient norms → 0 │
│ │
│ Causes: │
│ • Sigmoid/Tanh saturation │
│ • Too many layers │
│ • Poor initialization │
│ │
│ Solutions: │
│ • ReLU activations │
│ • Skip connections (ResNet) │
│ • Proper initialization │
│ • Batch/Layer normalization │
│ │
│ EXPLODING GRADIENTS │
│ ─────────────────── │
│ Symptoms: │
│ • NaN losses │
│ • Weights → ∞ │
│ • Unstable training │
│ │
│ Solutions: │
│ • Gradient clipping │
│ • Lower learning rate │
│ • Proper initialization │
│ │
└─────────────────────────────────────────────────────────────────┘Gradient Clipping
python
# Gradient clipping by norm (recommended)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Gradient clipping by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)| Method | When to Use | Typical Values |
|---|---|---|
| Clip by Norm | RNNs, Transformers | 1.0 - 5.0 |
| Clip by Value | Specific gradient control | 0.5 - 1.0 |
Training Best Practices
Hyperparameter Priority
┌─────────────────────────────────────────────────────────────────┐
│ HYPERPARAMETER TUNING PRIORITY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Priority 1 (Tune First): │
│ • Learning rate │
│ • Batch size │
│ • Number of epochs │
│ │
│ Priority 2 (Tune Second): │
│ • Model architecture (depth, width) │
│ • Optimizer choice │
│ • Learning rate schedule │
│ │
│ Priority 3 (Fine-tune): │
│ • Regularization strength │
│ • Dropout rate │
│ • Weight decay │
│ │
│ Priority 4 (Usually defaults work): │
│ • Optimizer betas (Adam) │
│ • Epsilon values │
│ • Initialization details │
│ │
└─────────────────────────────────────────────────────────────────┘Training Checklist
✅ Before Training
- Data: Verify data loading, augmentation, normalization
- Model: Check parameter count, forward pass works
- Loss: Verify loss computation on dummy data
- Overfit: Can model overfit small batch? (sanity check)
- Baseline: Compare against simple baseline
📎 Cross-References
- 📎 Architectures Overview - Neural network architectures
- 📎 Debugging Training - Troubleshooting training issues
- 📎 Scaling Training - Distributed training strategies
- 📎 ML Evaluation - Evaluation fundamentals
- 📎 ML Feature Engineering - Feature engineering for ML
- 📎 LLM Architecture - LLM training specifics
🧠 Quiz
Câu 1: Tại sao weight initialization quan trọng trong deep learning?
- [ ] A) Chỉ ảnh hưởng đến tốc độ inference
- [x] B) Initialization không tốt có thể gây vanishing/exploding gradients, khiến model không thể train
- [ ] C) Chỉ quan trọng cho CNN, không quan trọng cho RNN
- [ ] D) Không ảnh hưởng đến kết quả cuối cùng
💡 Giải thích: Weight initialization ảnh hưởng trực tiếp đến gradient flow. Initialization quá nhỏ gây vanishing gradients (signal biến mất qua các layers), quá lớn gây exploding gradients. Các phương pháp như Xavier/Glorot và He initialization được thiết kế để giữ variance ổn định.
Câu 2: Regularization technique nào sau đây đặc trưng cho deep learning?
- [ ] A) L1 regularization
- [ ] B) L2 regularization
- [x] C) Dropout
- [ ] D) Early stopping
💡 Giải thích: Mặc dù L1, L2, và early stopping dùng được cho mọi ML model, Dropout là technique đặc trưng cho neural networks — randomly "tắt" một phần neurons trong training để buộc network học representations robust hơn.
Câu 3: Learning rate schedule phổ biến nào giúp model hội tụ tốt hơn?
- [ ] A) Giữ learning rate cố định suốt quá trình training
- [ ] B) Tăng learning rate liên tục
- [x] C) Cosine annealing hoặc warm-up rồi decay
- [ ] D) Random learning rate mỗi epoch
💡 Giải thích: Cosine annealing và warm-up + decay là các schedule phổ biến. Warm-up giúp ổn định training giai đoạn đầu, sau đó giảm dần learning rate cho phép model fine-tune và hội tụ đến minimum tốt hơn.