🎓 Training Foundations

Level: Foundation Solves: Hiểu sâu các fundamentals của neural network training, từ optimization đến regularization

Tại sao Training Foundations quan trọng?

💡 Giáo sư Tom

Deep learning không phải magic - nó là applied optimization. Hiểu rõ gradient descent, loss landscapes, và regularization sẽ giúp bạn debug training issues nhanh hơn 10x so với việc chỉ copy-paste code từ tutorials.

┌─────────────────────────────────────────────────────────────────┐
│              DEEP LEARNING TRAINING PIPELINE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │  Data   │ →  │ Forward │ →  │  Loss   │ →  │Backward │      │
│  │  Batch  │    │  Pass   │    │ Compute │    │  Pass   │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│       │                                            │            │
│       │         ┌─────────────────────────┐        │            │
│       └─────────│     Weight Update       │←───────┘            │
│                 │   (Optimizer Step)      │                     │
│                 └─────────────────────────┘                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Optimization Fundamentals

Gradient Descent Variants

Variant	Batch Size	Pros	Cons
Batch GD	Full dataset	Stable gradients	Slow, memory intensive
Stochastic GD	1 sample	Fast updates	Noisy gradients
Mini-batch GD	32-512	Balance speed/stability	Hyperparameter tuning

Modern Optimizers

┌─────────────────────────────────────────────────────────────────┐
│                    OPTIMIZER EVOLUTION                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SGD → SGD+Momentum → RMSprop → Adam → AdamW                   │
│   │         │            │        │       │                     │
│   │         │            │        │       └─ Weight decay fix   │
│   │         │            │        └─ Momentum + Adaptive LR     │
│   │         │            └─ Adaptive learning rates             │
│   │         └─ Accelerated convergence                          │
│   └─ Basic gradient descent                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Optimizer Selection Guide

Optimizer	Best For	Learning Rate	Notes
SGD+Momentum	CNNs, well-tuned models	0.01-0.1	Often best final performance
Adam	Transformers, quick prototyping	1e-4 to 3e-4	Good default choice
AdamW	Transformers with regularization	1e-4 to 3e-4	Proper weight decay
LAMB	Large batch training	Scaled	For batch sizes > 8K

Learning Rate Schedules

┌─────────────────────────────────────────────────────────────────┐
│              LEARNING RATE SCHEDULE PATTERNS                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step Decay          Cosine Annealing      Warmup + Decay       │
│  ──────────          ────────────────      ──────────────       │
│  LR│ ▄▄▄▄            LR│    ╭──╮           LR│   ╭────╮         │
│    │     ▄▄▄▄          │   ╱    ╲            │  ╱      ╲        │
│    │         ▄▄▄▄      │  ╱      ╲           │ ╱        ╲       │
│    │             ▄▄    │ ╱        ╲          │╱          ╲      │
│    └──────────────→    └───────────→         └────────────→     │
│         Epochs              Epochs               Epochs         │
│                                                                 │
│  Use: CNNs, stable     Use: Transformers     Use: Large models  │
│  training              fine-tuning           from scratch       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Loss Functions

Classification Losses

Loss	Formula	Use Case
Cross-Entropy	`-Σ y·log(ŷ)`	Multi-class classification
Binary CE	`-y·log(ŷ) - (1-y)·log(1-ŷ)`	Binary classification
Focal Loss	`-α(1-ŷ)^γ·log(ŷ)`	Imbalanced datasets
Label Smoothing	CE with soft targets	Regularization, calibration

Regression Losses

Loss	Formula	Characteristics
MSE (L2)	`(y - ŷ)²`	Penalizes large errors heavily
MAE (L1)	`	y - ŷ
Huber	L2 if small, L1 if large	Best of both worlds
Log-Cosh	`log(cosh(y - ŷ))`	Smooth approximation of Huber

Specialized Losses

┌─────────────────────────────────────────────────────────────────┐
│                  SPECIALIZED LOSS FUNCTIONS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CONTRASTIVE LEARNING                                           │
│  ────────────────────                                           │
│  • InfoNCE: Self-supervised representation learning             │
│  • Triplet Loss: Metric learning (anchor, positive, negative)   │
│  • NT-Xent: SimCLR-style contrastive loss                       │
│                                                                 │
│  SEQUENCE MODELING                                              │
│  ─────────────────                                              │
│  • CTC Loss: Speech recognition, OCR                            │
│  • Sequence CE: Language modeling                               │
│                                                                 │
│  GENERATIVE MODELS                                              │
│  ─────────────────                                              │
│  • Reconstruction Loss: Autoencoders                            │
│  • KL Divergence: VAEs                                          │
│  • Adversarial Loss: GANs                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Weight Initialization

Why Initialization Matters

🚨 Bad Initialization

Sai initialization có thể dẫn đến:

Vanishing gradients: Weights quá nhỏ → gradients → 0
Exploding gradients: Weights quá lớn → gradients → ∞
Dead neurons: ReLU neurons stuck at 0
Slow convergence: Training takes forever

Initialization Strategies

Method	Formula	Best For
Xavier/Glorot	`U(-√(6/(n_in+n_out)), √(6/(n_in+n_out)))`	Tanh, Sigmoid activations
He/Kaiming	`N(0, √(2/n_in))`	ReLU activations
Orthogonal	QR decomposition	RNNs, deep networks
LSUV	Layer-sequential unit-variance	Very deep networks

Initialization by Layer Type

┌─────────────────────────────────────────────────────────────────┐
│              INITIALIZATION BY LAYER TYPE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Layer Type          Recommended Init       Notes               │
│  ──────────          ────────────────       ─────               │
│                                                                 │
│  Linear + ReLU       He (Kaiming)           fan_in mode         │
│  Linear + Tanh       Xavier (Glorot)        Uniform or Normal   │
│  Conv2d + ReLU       He (Kaiming)           fan_out for backprop│
│  LSTM/GRU            Orthogonal             For recurrent weights│
│  Embedding           Normal(0, 0.02)        Or uniform small    │
│  LayerNorm           weight=1, bias=0       Standard practice   │
│  BatchNorm           weight=1, bias=0       Standard practice   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Regularization Techniques

Overview of Regularization

┌─────────────────────────────────────────────────────────────────┐
│                 REGULARIZATION TAXONOMY                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │   EXPLICIT      │  │   IMPLICIT      │  │   DATA-BASED    │  │
│  ├─────────────────┤  ├─────────────────┤  ├─────────────────┤  │
│  │ • L1/L2 penalty │  │ • Early stopping│  │ • Data augment  │  │
│  │ • Dropout       │  │ • Batch norm    │  │ • Mixup/CutMix  │  │
│  │ • Weight decay  │  │ • Noise inject  │  │ • Label smooth  │  │
│  │ • Max-norm      │  │ • Gradient clip │  │ • Curriculum    │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dropout

Variant	Description	Use Case
Standard Dropout	Random zero-out during training	Fully connected layers
Spatial Dropout	Drop entire feature maps	CNNs
DropConnect	Drop weights instead of activations	Alternative to dropout
DropBlock	Drop contiguous regions	CNNs, better than spatial

Weight Decay vs L2 Regularization

⚠️ Subtle Difference

Weight decay và L2 regularization KHÔNG giống nhau với adaptive optimizers (Adam):

L2 Reg: Adds λ·w² to loss → gradient includes λ·w
Weight Decay: Directly decays weights → w = w - lr·λ·w

AdamW implements proper weight decay, Adam với L2 reg có behavior khác!

Batch Normalization

┌─────────────────────────────────────────────────────────────────┐
│                  BATCH NORMALIZATION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Training:                                                      │
│  ─────────                                                      │
│  1. Compute batch mean: μ_B = (1/m) Σ x_i                       │
│  2. Compute batch var:  σ²_B = (1/m) Σ (x_i - μ_B)²             │
│  3. Normalize: x̂_i = (x_i - μ_B) / √(σ²_B + ε)                 │
│  4. Scale & shift: y_i = γ·x̂_i + β                             │
│                                                                 │
│  Inference:                                                     │
│  ──────────                                                     │
│  Use running mean/var computed during training                  │
│                                                                 │
│  Benefits:                                                      │
│  • Faster training (higher learning rates)                      │
│  • Reduces internal covariate shift                             │
│  • Acts as regularization                                       │
│                                                                 │
│  Pitfalls:                                                      │
│  • Small batch sizes → noisy statistics                         │
│  • Different behavior train vs inference                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Layer Normalization vs Batch Normalization

Aspect	Batch Norm	Layer Norm
Normalizes over	Batch dimension	Feature dimension
Batch size dependency	Yes	No
Best for	CNNs	Transformers, RNNs
Inference behavior	Uses running stats	Same as training

Gradient Flow

Vanishing/Exploding Gradients

┌─────────────────────────────────────────────────────────────────┐
│              GRADIENT FLOW PROBLEMS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  VANISHING GRADIENTS                                            │
│  ───────────────────                                            │
│  Symptoms:                                                      │
│  • Early layers don't learn                                     │
│  • Loss plateaus early                                          │
│  • Gradient norms → 0                                           │
│                                                                 │
│  Causes:                                                        │
│  • Sigmoid/Tanh saturation                                      │
│  • Too many layers                                              │
│  • Poor initialization                                          │
│                                                                 │
│  Solutions:                                                     │
│  • ReLU activations                                             │
│  • Skip connections (ResNet)                                    │
│  • Proper initialization                                        │
│  • Batch/Layer normalization                                    │
│                                                                 │
│  EXPLODING GRADIENTS                                            │
│  ───────────────────                                            │
│  Symptoms:                                                      │
│  • NaN losses                                                   │
│  • Weights → ∞                                                  │
│  • Unstable training                                            │
│                                                                 │
│  Solutions:                                                     │
│  • Gradient clipping                                            │
│  • Lower learning rate                                          │
│  • Proper initialization                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Gradient Clipping

python

# Gradient clipping by norm (recommended)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Gradient clipping by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Method	When to Use	Typical Values
Clip by Norm	RNNs, Transformers	1.0 - 5.0
Clip by Value	Specific gradient control	0.5 - 1.0

Training Best Practices

Hyperparameter Priority

┌─────────────────────────────────────────────────────────────────┐
│           HYPERPARAMETER TUNING PRIORITY                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Priority 1 (Tune First):                                       │
│  • Learning rate                                                │
│  • Batch size                                                   │
│  • Number of epochs                                             │
│                                                                 │
│  Priority 2 (Tune Second):                                      │
│  • Model architecture (depth, width)                            │
│  • Optimizer choice                                             │
│  • Learning rate schedule                                       │
│                                                                 │
│  Priority 3 (Fine-tune):                                        │
│  • Regularization strength                                      │
│  • Dropout rate                                                 │
│  • Weight decay                                                 │
│                                                                 │
│  Priority 4 (Usually defaults work):                            │
│  • Optimizer betas (Adam)                                       │
│  • Epsilon values                                               │
│  • Initialization details                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Training Checklist

✅ Before Training

Data: Verify data loading, augmentation, normalization
Model: Check parameter count, forward pass works
Loss: Verify loss computation on dummy data
Overfit: Can model overfit small batch? (sanity check)
Baseline: Compare against simple baseline

📎 Cross-References

📎 Architectures Overview - Neural network architectures
📎 Debugging Training - Troubleshooting training issues
📎 Scaling Training - Distributed training strategies
📎 ML Evaluation - Evaluation fundamentals
📎 ML Feature Engineering - Feature engineering for ML
📎 LLM Architecture - LLM training specifics

🧠 Quiz

Câu 1: Tại sao weight initialization quan trọng trong deep learning?

[ ] A) Chỉ ảnh hưởng đến tốc độ inference
[x] B) Initialization không tốt có thể gây vanishing/exploding gradients, khiến model không thể train
[ ] C) Chỉ quan trọng cho CNN, không quan trọng cho RNN
[ ] D) Không ảnh hưởng đến kết quả cuối cùng

💡 Giải thích: Weight initialization ảnh hưởng trực tiếp đến gradient flow. Initialization quá nhỏ gây vanishing gradients (signal biến mất qua các layers), quá lớn gây exploding gradients. Các phương pháp như Xavier/Glorot và He initialization được thiết kế để giữ variance ổn định.

Câu 2: Regularization technique nào sau đây đặc trưng cho deep learning?

[ ] A) L1 regularization
[ ] B) L2 regularization
[x] C) Dropout
[ ] D) Early stopping

💡 Giải thích: Mặc dù L1, L2, và early stopping dùng được cho mọi ML model, Dropout là technique đặc trưng cho neural networks — randomly "tắt" một phần neurons trong training để buộc network học representations robust hơn.

Câu 3: Learning rate schedule phổ biến nào giúp model hội tụ tốt hơn?

[ ] A) Giữ learning rate cố định suốt quá trình training
[ ] B) Tăng learning rate liên tục
[x] C) Cosine annealing hoặc warm-up rồi decay
[ ] D) Random learning rate mỗi epoch

💡 Giải thích: Cosine annealing và warm-up + decay là các schedule phổ biến. Warm-up giúp ổn định training giai đoạn đầu, sau đó giảm dần learning rate cho phép model fine-tune và hội tụ đến minimum tốt hơn.

🎓 Training Foundations ​

Tại sao Training Foundations quan trọng? ​

Optimization Fundamentals ​

Gradient Descent Variants ​

Modern Optimizers ​

Optimizer Selection Guide ​

Learning Rate Schedules ​

Loss Functions ​

Classification Losses ​

Regression Losses ​

Specialized Losses ​

Weight Initialization ​

Why Initialization Matters ​

Initialization Strategies ​

Initialization by Layer Type ​

Regularization Techniques ​

Overview of Regularization ​

Dropout ​

Weight Decay vs L2 Regularization ​

Batch Normalization ​

Layer Normalization vs Batch Normalization ​

Gradient Flow ​

Vanishing/Exploding Gradients ​

Gradient Clipping ​

Training Best Practices ​

Hyperparameter Priority ​

Training Checklist ​

📎 Cross-References ​

Câu 1: Tại sao weight initialization quan trọng trong deep learning? ​

Câu 2: Regularization technique nào sau đây đặc trưng cho deep learning? ​

Câu 3: Learning rate schedule phổ biến nào giúp model hội tụ tốt hơn? ​