Giao diện
🏗️ Architectures Overview
Level: Core Solves: Hiểu các neural network architectures phổ biến và khi nào sử dụng chúng
Architecture Selection Framework
💡 Giáo sư Tom
Đừng chọn architecture vì nó "hot" - chọn vì nó phù hợp với data và problem của bạn. CNNs vẫn beat Transformers cho nhiều vision tasks khi data limited. Hiểu trade-offs là key.
┌─────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE SELECTION GUIDE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Data Type Recommended Architecture │
│ ───────── ──────────────────────── │
│ │
│ Images (fixed size) CNN (ResNet, EfficientNet) │
│ Images (variable) ViT, CNN + adaptive pooling │
│ Sequences (short) RNN/LSTM, 1D CNN │
│ Sequences (long) Transformer, State Space Models │
│ Text Transformer (BERT, GPT) │
│ Audio CNN + RNN, Transformer │
│ Graphs GNN (GCN, GAT) │
│ Tabular MLP, TabNet, Tree-based (non-DL) │
│ │
└─────────────────────────────────────────────────────────────────┘Convolutional Neural Networks (CNNs)
Core Concepts
┌─────────────────────────────────────────────────────────────────┐
│ CNN BUILDING BLOCKS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CONVOLUTION LAYER │
│ ───────────────── │
│ • Learns local patterns (edges, textures, shapes) │
│ • Parameter sharing → translation equivariance │
│ • Key params: kernel_size, stride, padding, channels │
│ │
│ ┌─────────┐ ┌───┐ ┌─────────┐ │
│ │ Input │ * │ K │ = │ Feature │ │
│ │ H×W×C │ │3×3│ │ Map │ │
│ └─────────┘ └───┘ └─────────┘ │
│ │
│ POOLING LAYER │
│ ───────────── │
│ • Reduces spatial dimensions │
│ • Max pooling: takes max value (most common) │
│ • Average pooling: takes mean value │
│ • Global pooling: reduces to 1×1 (before classifier) │
│ │
│ NORMALIZATION │
│ ───────────── │
│ • BatchNorm: normalize across batch (standard for CNNs) │
│ • GroupNorm: normalize across groups (small batch sizes) │
│ │
└─────────────────────────────────────────────────────────────────┘CNN Architecture Evolution
| Architecture | Year | Key Innovation | Use Case |
|---|---|---|---|
| LeNet | 1998 | First practical CNN | Digit recognition |
| AlexNet | 2012 | ReLU, Dropout, GPU | ImageNet breakthrough |
| VGG | 2014 | Deep, 3×3 convs only | Feature extraction |
| ResNet | 2015 | Skip connections | Very deep networks |
| DenseNet | 2017 | Dense connections | Feature reuse |
| EfficientNet | 2019 | Compound scaling | Efficiency |
| ConvNeXt | 2022 | Modernized CNN | Compete with ViT |
ResNet: The Most Important CNN
┌─────────────────────────────────────────────────────────────────┐
│ RESIDUAL BLOCK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ x ─────────────────────────────────┐ │
│ │ │ │
│ ▼ │ │
│ ┌─────────┐ │ │
│ │ Conv 3×3│ │ Skip Connection │
│ │ BN, ReLU│ │ (Identity) │
│ └────┬────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────┐ │ │
│ │ Conv 3×3│ │ │
│ │ BN │ │ │
│ └────┬────┘ │ │
│ │ │ │
│ ▼ │ │
│ (+) ←───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ReLU │
│ │ │
│ ▼ │
│ F(x) + x │
│ │
│ Why it works: │
│ • Easier to learn F(x) = 0 than F(x) = x │
│ • Gradients flow directly through skip connections │
│ • Enables training of 100+ layer networks │
│ │
└─────────────────────────────────────────────────────────────────┘Recurrent Neural Networks (RNNs)
Core Concepts
┌─────────────────────────────────────────────────────────────────┐
│ RNN FUNDAMENTALS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VANILLA RNN │
│ ─────────── │
│ h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b) │
│ │
│ Problem: Vanishing gradients for long sequences │
│ │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ h │ → │ h │ → │ h │ → │ h │ │
│ └─▲─┘ └─▲─┘ └─▲─┘ └─▲─┘ │
│ │ │ │ │ │
│ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ │
│ │x_1│ │x_2│ │x_3│ │x_4│ │
│ └───┘ └───┘ └───┘ └───┘ │
│ │
└─────────────────────────────────────────────────────────────────┘LSTM (Long Short-Term Memory)
┌─────────────────────────────────────────────────────────────────┐
│ LSTM CELL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Gates: │
│ • Forget gate (f): What to forget from cell state │
│ • Input gate (i): What new info to store │
│ • Output gate (o): What to output │
│ │
│ Equations: │
│ f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # Forget gate │
│ i_t = σ(W_i · [h_{t-1}, x_t] + b_i) # Input gate │
│ C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) # Candidate │
│ C_t = f_t * C_{t-1} + i_t * C̃_t # Cell state │
│ o_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Output gate │
│ h_t = o_t * tanh(C_t) # Hidden state │
│ │
│ Key insight: Cell state C_t acts as "memory highway" │
│ allowing gradients to flow unchanged │
│ │
└─────────────────────────────────────────────────────────────────┘GRU (Gated Recurrent Unit)
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| Parameters | More | Fewer (~25% less) |
| Performance | Slightly better on long sequences | Comparable, faster training |
| Use case | Default choice | When efficiency matters |
Attention Mechanism
Self-Attention Explained
┌─────────────────────────────────────────────────────────────────┐
│ SELF-ATTENTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: Sequence of embeddings X = [x_1, x_2, ..., x_n] │
│ │
│ Step 1: Create Q, K, V │
│ ───────────────────────── │
│ Q = X · W_Q (Query: "What am I looking for?") │
│ K = X · W_K (Key: "What do I contain?") │
│ V = X · W_V (Value: "What do I provide?") │
│ │
│ Step 2: Compute attention scores │
│ ──────────────────────────── │
│ Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V │
│ │
│ ┌─────┐ ┌─────┐ │
│ │ Q │ × │ K^T │ → Attention Scores → softmax → × V │
│ └─────┘ └─────┘ │
│ │
│ Why √d_k? │
│ • Prevents dot products from getting too large │
│ • Keeps softmax in good gradient region │
│ │
└─────────────────────────────────────────────────────────────────┘Multi-Head Attention
┌─────────────────────────────────────────────────────────────────┐
│ MULTI-HEAD ATTENTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Idea: Run multiple attention operations in parallel │
│ Each "head" can focus on different aspects │
│ │
│ Input X │
│ │ │
│ ┌─────┼─────┬─────┬─────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ H1│ │ H2│ │ H3│ │ H4│ │...│ (h heads) │
│ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ │
│ │ │ │ │ │ │
│ └─────┴─────┴─────┴─────┘ │
│ │ │
│ Concatenate │
│ │ │
│ Linear W_O │
│ │ │
│ Output │
│ │
│ Typical: 8-16 heads, d_model/h dimensions per head │
│ │
└─────────────────────────────────────────────────────────────────┘Transformers
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ TRANSFORMER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ENCODER (BERT-style) DECODER (GPT-style) │
│ ──────────────────── ─────────────────── │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Multi-Head │ │ Masked Multi- │ │
│ │ Attention │ │ Head Attention │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ Add & Norm Add & Norm │
│ │ │ │
│ ┌────────┴────────┐ ┌────────┴────────┐ │
│ │ Feed Forward │ │ Cross-Attention│ (if enc-dec) │
│ │ Network │ │ (to encoder) │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ Add & Norm Add & Norm │
│ │ │ │
│ Output ┌────────┴────────┐ │
│ │ Feed Forward │ │
│ │ Network │ │
│ └────────┬────────┘ │
│ │ │
│ Add & Norm │
│ │ │
│ Output │
│ │
└─────────────────────────────────────────────────────────────────┘Positional Encoding
| Method | Description | Use Case |
|---|---|---|
| Sinusoidal | Fixed sin/cos functions | Original Transformer |
| Learned | Trainable embeddings | BERT, GPT |
| Rotary (RoPE) | Rotation-based | LLaMA, modern LLMs |
| ALiBi | Attention bias | Efficient long context |
Transformer Variants
| Model | Type | Key Feature | Use Case |
|---|---|---|---|
| BERT | Encoder | Bidirectional | Classification, NER |
| GPT | Decoder | Autoregressive | Text generation |
| T5 | Encoder-Decoder | Text-to-text | Translation, summarization |
| ViT | Encoder | Patches as tokens | Image classification |
Vision Transformers (ViT)
┌─────────────────────────────────────────────────────────────────┐
│ VISION TRANSFORMER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Patch Embedding │
│ ─────────────────────── │
│ Image (224×224) → 16×16 patches → 196 patches │
│ Each patch → Linear projection → Embedding │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ ┌───┬───┬───┬───┐ │ │
│ │ │ P1│ P2│ P3│...│ → Flatten → │ │
│ │ ├───┼───┼───┼───┤ Linear │ │
│ │ │ P5│ P6│ P7│...│ │ │
│ │ └───┴───┴───┴───┘ │ │
│ └─────────────────────────────────────┘ │
│ │
│ Step 2: Add [CLS] token + Position embeddings │
│ ───────────────────────────────────────────── │
│ [CLS, P1, P2, ..., P196] + [pos_0, pos_1, ..., pos_196] │
│ │
│ Step 3: Transformer Encoder │
│ ─────────────────────────── │
│ Standard transformer encoder blocks │
│ │
│ Step 4: Classification │
│ ───────────────────── │
│ [CLS] token → MLP head → Class prediction │
│ │
└─────────────────────────────────────────────────────────────────┘CNN vs ViT Trade-offs
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality, translation) | Weak (learns from data) |
| Data efficiency | Better with small data | Needs large datasets |
| Compute | Efficient | Quadratic attention |
| Long-range | Limited receptive field | Global attention |
| Interpretability | Feature maps | Attention maps |
Architecture Selection Checklist
✅ Choosing the Right Architecture
- Data size: Small → CNN/pretrained; Large → Transformer
- Sequence length: Short → RNN; Long → Transformer
- Compute budget: Limited → Efficient architectures
- Interpretability: Important → Attention visualization
- Latency: Critical → Smaller models, distillation
📎 Cross-References
- 📎 Training Foundations - How to train these architectures
- 📎 Debugging Training - Architecture-specific debugging
- 📎 Inference Optimization - Optimizing for production
- 📎 LLM Architecture - LLM-specific patterns
- 📎 ML Model Selection - When to use deep learning vs classical ML
- 📎 RAG Engineering - Transformer applications in RAG