🏗️ Architectures Overview

Level: Core Solves: Hiểu các neural network architectures phổ biến và khi nào sử dụng chúng

Architecture Selection Framework

💡 Giáo sư Tom

Đừng chọn architecture vì nó "hot" - chọn vì nó phù hợp với data và problem của bạn. CNNs vẫn beat Transformers cho nhiều vision tasks khi data limited. Hiểu trade-offs là key.

┌─────────────────────────────────────────────────────────────────┐
│              ARCHITECTURE SELECTION GUIDE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Type              Recommended Architecture                │
│  ─────────              ────────────────────────                │
│                                                                 │
│  Images (fixed size)    CNN (ResNet, EfficientNet)              │
│  Images (variable)      ViT, CNN + adaptive pooling             │
│  Sequences (short)      RNN/LSTM, 1D CNN                        │
│  Sequences (long)       Transformer, State Space Models         │
│  Text                   Transformer (BERT, GPT)                 │
│  Audio                  CNN + RNN, Transformer                  │
│  Graphs                 GNN (GCN, GAT)                          │
│  Tabular                MLP, TabNet, Tree-based (non-DL)        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Convolutional Neural Networks (CNNs)

Core Concepts

┌─────────────────────────────────────────────────────────────────┐
│                    CNN BUILDING BLOCKS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CONVOLUTION LAYER                                              │
│  ─────────────────                                              │
│  • Learns local patterns (edges, textures, shapes)              │
│  • Parameter sharing → translation equivariance                 │
│  • Key params: kernel_size, stride, padding, channels           │
│                                                                 │
│  ┌─────────┐     ┌───┐     ┌─────────┐                          │
│  │ Input   │  *  │ K │  =  │ Feature │                          │
│  │ H×W×C   │     │3×3│     │  Map    │                          │
│  └─────────┘     └───┘     └─────────┘                          │
│                                                                 │
│  POOLING LAYER                                                  │
│  ─────────────                                                  │
│  • Reduces spatial dimensions                                   │
│  • Max pooling: takes max value (most common)                   │
│  • Average pooling: takes mean value                            │
│  • Global pooling: reduces to 1×1 (before classifier)           │
│                                                                 │
│  NORMALIZATION                                                  │
│  ─────────────                                                  │
│  • BatchNorm: normalize across batch (standard for CNNs)        │
│  • GroupNorm: normalize across groups (small batch sizes)       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CNN Architecture Evolution

Architecture	Year	Key Innovation	Use Case
LeNet	1998	First practical CNN	Digit recognition
AlexNet	2012	ReLU, Dropout, GPU	ImageNet breakthrough
VGG	2014	Deep, 3×3 convs only	Feature extraction
ResNet	2015	Skip connections	Very deep networks
DenseNet	2017	Dense connections	Feature reuse
EfficientNet	2019	Compound scaling	Efficiency
ConvNeXt	2022	Modernized CNN	Compete with ViT

ResNet: The Most Important CNN

┌─────────────────────────────────────────────────────────────────┐
│                    RESIDUAL BLOCK                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│         x ─────────────────────────────────┐                    │
│         │                                  │                    │
│         ▼                                  │                    │
│    ┌─────────┐                             │                    │
│    │ Conv 3×3│                             │ Skip Connection    │
│    │ BN, ReLU│                             │ (Identity)         │
│    └────┬────┘                             │                    │
│         │                                  │                    │
│         ▼                                  │                    │
│    ┌─────────┐                             │                    │
│    │ Conv 3×3│                             │                    │
│    │   BN    │                             │                    │
│    └────┬────┘                             │                    │
│         │                                  │                    │
│         ▼                                  │                    │
│        (+) ←───────────────────────────────┘                    │
│         │                                                       │
│         ▼                                                       │
│       ReLU                                                      │
│         │                                                       │
│         ▼                                                       │
│      F(x) + x                                                   │
│                                                                 │
│  Why it works:                                                  │
│  • Easier to learn F(x) = 0 than F(x) = x                       │
│  • Gradients flow directly through skip connections             │
│  • Enables training of 100+ layer networks                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Recurrent Neural Networks (RNNs)

Core Concepts

┌─────────────────────────────────────────────────────────────────┐
│                    RNN FUNDAMENTALS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  VANILLA RNN                                                    │
│  ───────────                                                    │
│  h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)                    │
│                                                                 │
│  Problem: Vanishing gradients for long sequences                │
│                                                                 │
│  ┌───┐   ┌───┐   ┌───┐   ┌───┐                                  │
│  │ h │ → │ h │ → │ h │ → │ h │                                  │
│  └─▲─┘   └─▲─┘   └─▲─┘   └─▲─┘                                  │
│    │       │       │       │                                    │
│  ┌─┴─┐   ┌─┴─┐   ┌─┴─┐   ┌─┴─┐                                  │
│  │x_1│   │x_2│   │x_3│   │x_4│                                  │
│  └───┘   └───┘   └───┘   └───┘                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

LSTM (Long Short-Term Memory)

┌─────────────────────────────────────────────────────────────────┐
│                    LSTM CELL                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Gates:                                                         │
│  • Forget gate (f): What to forget from cell state              │
│  • Input gate (i): What new info to store                       │
│  • Output gate (o): What to output                              │
│                                                                 │
│  Equations:                                                     │
│  f_t = σ(W_f · [h_{t-1}, x_t] + b_f)     # Forget gate          │
│  i_t = σ(W_i · [h_{t-1}, x_t] + b_i)     # Input gate           │
│  C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)  # Candidate            │
│  C_t = f_t * C_{t-1} + i_t * C̃_t         # Cell state           │
│  o_t = σ(W_o · [h_{t-1}, x_t] + b_o)     # Output gate          │
│  h_t = o_t * tanh(C_t)                   # Hidden state         │
│                                                                 │
│  Key insight: Cell state C_t acts as "memory highway"           │
│  allowing gradients to flow unchanged                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

GRU (Gated Recurrent Unit)

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
Parameters	More	Fewer (~25% less)
Performance	Slightly better on long sequences	Comparable, faster training
Use case	Default choice	When efficiency matters

Attention Mechanism

Self-Attention Explained

┌─────────────────────────────────────────────────────────────────┐
│                    SELF-ATTENTION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input: Sequence of embeddings X = [x_1, x_2, ..., x_n]         │
│                                                                 │
│  Step 1: Create Q, K, V                                         │
│  ─────────────────────────                                      │
│  Q = X · W_Q    (Query: "What am I looking for?")               │
│  K = X · W_K    (Key: "What do I contain?")                     │
│  V = X · W_V    (Value: "What do I provide?")                   │
│                                                                 │
│  Step 2: Compute attention scores                               │
│  ────────────────────────────                                   │
│  Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V               │
│                                                                 │
│  ┌─────┐   ┌─────┐                                              │
│  │  Q  │ × │ K^T │  →  Attention Scores  →  softmax  →  × V     │
│  └─────┘   └─────┘                                              │
│                                                                 │
│  Why √d_k?                                                      │
│  • Prevents dot products from getting too large                 │
│  • Keeps softmax in good gradient region                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Multi-Head Attention

┌─────────────────────────────────────────────────────────────────┐
│                 MULTI-HEAD ATTENTION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Idea: Run multiple attention operations in parallel            │
│  Each "head" can focus on different aspects                     │
│                                                                 │
│       Input X                                                   │
│          │                                                      │
│    ┌─────┼─────┬─────┬─────┐                                    │
│    ▼     ▼     ▼     ▼     ▼                                    │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐                                  │
│  │ H1│ │ H2│ │ H3│ │ H4│ │...│  (h heads)                       │
│  └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘                                  │
│    │     │     │     │     │                                    │
│    └─────┴─────┴─────┴─────┘                                    │
│              │                                                  │
│         Concatenate                                             │
│              │                                                  │
│         Linear W_O                                              │
│              │                                                  │
│           Output                                                │
│                                                                 │
│  Typical: 8-16 heads, d_model/h dimensions per head             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Transformers

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                 TRANSFORMER ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ENCODER (BERT-style)          DECODER (GPT-style)              │
│  ────────────────────          ───────────────────              │
│                                                                 │
│  ┌─────────────────┐           ┌─────────────────┐              │
│  │   Multi-Head    │           │  Masked Multi-  │              │
│  │   Attention     │           │  Head Attention │              │
│  └────────┬────────┘           └────────┬────────┘              │
│           │                             │                       │
│      Add & Norm                    Add & Norm                   │
│           │                             │                       │
│  ┌────────┴────────┐           ┌────────┴────────┐              │
│  │   Feed Forward  │           │  Cross-Attention│ (if enc-dec) │
│  │   Network       │           │  (to encoder)   │              │
│  └────────┬────────┘           └────────┬────────┘              │
│           │                             │                       │
│      Add & Norm                    Add & Norm                   │
│           │                             │                       │
│        Output                  ┌────────┴────────┐              │
│                                │   Feed Forward  │              │
│                                │   Network       │              │
│                                └────────┬────────┘              │
│                                         │                       │
│                                    Add & Norm                   │
│                                         │                       │
│                                      Output                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Positional Encoding

Method	Description	Use Case
Sinusoidal	Fixed sin/cos functions	Original Transformer
Learned	Trainable embeddings	BERT, GPT
Rotary (RoPE)	Rotation-based	LLaMA, modern LLMs
ALiBi	Attention bias	Efficient long context

Transformer Variants

Model	Type	Key Feature	Use Case
BERT	Encoder	Bidirectional	Classification, NER
GPT	Decoder	Autoregressive	Text generation
T5	Encoder-Decoder	Text-to-text	Translation, summarization
ViT	Encoder	Patches as tokens	Image classification

Vision Transformers (ViT)

┌─────────────────────────────────────────────────────────────────┐
│                 VISION TRANSFORMER                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Step 1: Patch Embedding                                        │
│  ───────────────────────                                        │
│  Image (224×224) → 16×16 patches → 196 patches                  │
│  Each patch → Linear projection → Embedding                     │
│                                                                 │
│  ┌─────────────────────────────────────┐                        │
│  │ ┌───┬───┬───┬───┐                   │                        │
│  │ │ P1│ P2│ P3│...│  → Flatten →      │                        │
│  │ ├───┼───┼───┼───┤    Linear         │                        │
│  │ │ P5│ P6│ P7│...│                   │                        │
│  │ └───┴───┴───┴───┘                   │                        │
│  └─────────────────────────────────────┘                        │
│                                                                 │
│  Step 2: Add [CLS] token + Position embeddings                  │
│  ─────────────────────────────────────────────                  │
│  [CLS, P1, P2, ..., P196] + [pos_0, pos_1, ..., pos_196]        │
│                                                                 │
│  Step 3: Transformer Encoder                                    │
│  ───────────────────────────                                    │
│  Standard transformer encoder blocks                            │
│                                                                 │
│  Step 4: Classification                                         │
│  ─────────────────────                                          │
│  [CLS] token → MLP head → Class prediction                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CNN vs ViT Trade-offs

Aspect	CNN	ViT
Inductive bias	Strong (locality, translation)	Weak (learns from data)
Data efficiency	Better with small data	Needs large datasets
Compute	Efficient	Quadratic attention
Long-range	Limited receptive field	Global attention
Interpretability	Feature maps	Attention maps

Architecture Selection Checklist

✅ Choosing the Right Architecture

Data size: Small → CNN/pretrained; Large → Transformer
Sequence length: Short → RNN; Long → Transformer
Compute budget: Limited → Efficient architectures
Interpretability: Important → Attention visualization
Latency: Critical → Smaller models, distillation

📎 Cross-References

📎 Training Foundations - How to train these architectures
📎 Debugging Training - Architecture-specific debugging
📎 Inference Optimization - Optimizing for production
📎 LLM Architecture - LLM-specific patterns
📎 ML Model Selection - When to use deep learning vs classical ML
📎 RAG Engineering - Transformer applications in RAG

🧠 Quiz

Câu 1: Self-attention mechanism trong Transformer giải quyết hạn chế nào của RNN?

[ ] A) RNN quá nhanh khi training
[x] B) RNN xử lý sequential nên chậm và khó capture long-range dependencies
[ ] C) RNN cần quá ít data
[ ] D) RNN không thể dùng GPU

💡 Giải thích: RNN xử lý tokens tuần tự nên không thể parallelize và gặp khó khăn khi capture dependencies giữa các tokens cách xa nhau. Self-attention cho phép mỗi token "nhìn" trực tiếp đến tất cả tokens khác, giải quyết cả hai vấn đề.

Câu 2: CNN sử dụng ý tưởng nào để giảm số parameters so với fully-connected layers?

[ ] A) Dropout
[x] B) Weight sharing qua convolution filters và local connectivity
[ ] C) Batch normalization
[ ] D) Skip connections

💡 Giải thích: CNN dùng convolution filters (kernels) được chia sẻ trên toàn bộ input (weight sharing) và chỉ kết nối local (receptive field), giảm đáng kể số parameters so với fully-connected layers mà vẫn capture spatial patterns hiệu quả.

Câu 3: Residual connections (skip connections) trong deep networks giúp gì?

[ ] A) Tăng số parameters
[ ] B) Giảm kích thước model
[x] C) Cho phép gradient flow trực tiếp qua identity shortcut, giúp train networks rất sâu
[ ] D) Tăng tốc độ inference

💡 Giải thích: Residual connections tạo "đường tắt" cho gradient flow, giúp gradient không bị vanish khi truyền qua nhiều layers. Đây là breakthrough cho phép train networks hàng trăm layers (ResNet) mà trước đó là không thể.

🏗️ Architectures Overview ​

Architecture Selection Framework ​

Convolutional Neural Networks (CNNs) ​

Core Concepts ​

CNN Architecture Evolution ​

ResNet: The Most Important CNN ​

Recurrent Neural Networks (RNNs) ​

Core Concepts ​

LSTM (Long Short-Term Memory) ​

GRU (Gated Recurrent Unit) ​

Attention Mechanism ​

Self-Attention Explained ​

Multi-Head Attention ​

Transformers ​

Architecture Overview ​

Positional Encoding ​

Transformer Variants ​

Vision Transformers (ViT) ​

CNN vs ViT Trade-offs ​

Architecture Selection Checklist ​

📎 Cross-References ​

Câu 1: Self-attention mechanism trong Transformer giải quyết hạn chế nào của RNN? ​

Câu 2: CNN sử dụng ý tưởng nào để giảm số parameters so với fully-connected layers? ​

Câu 3: Residual connections (skip connections) trong deep networks giúp gì? ​

🏗️ Architectures Overview

Architecture Selection Framework

Convolutional Neural Networks (CNNs)

Core Concepts

CNN Architecture Evolution

ResNet: The Most Important CNN

Recurrent Neural Networks (RNNs)

Core Concepts

LSTM (Long Short-Term Memory)

GRU (Gated Recurrent Unit)

Attention Mechanism

Self-Attention Explained

Multi-Head Attention

Transformers

Architecture Overview

Positional Encoding

Transformer Variants

Vision Transformers (ViT)

CNN vs ViT Trade-offs

Architecture Selection Checklist

📎 Cross-References

Câu 1: Self-attention mechanism trong Transformer giải quyết hạn chế nào của RNN?

Câu 2: CNN sử dụng ý tưởng nào để giảm số parameters so với fully-connected layers?

Câu 3: Residual connections (skip connections) trong deep networks giúp gì?