Vectors, Matrices, Tensors — Nền tảng toán cho ML Foundation

"Nếu bạn không hiểu shape của dữ liệu, bạn không hiểu model của mình."

🎯 Mục tiêu

Sau bài học này, bạn sẽ:

Hiểu vector, matrix, tensor là gì trong ngữ cảnh Machine Learning — không phải toán trừu tượng
Tạo và thao tác chúng bằng NumPy một cách tự tin
Nắm vững shape và dtype — hai thuộc tính quyết định 90% lỗi ML
Biết cách dữ liệu thực tế (user profile, ảnh, time series) được biểu diễn dưới dạng tensor

Tại sao kỹ sư phần mềm cần biết điều này?

Bạn không cần chứng minh định lý. Bạn cần biết dữ liệu đi vào model có hình dạng gì, và khi nó sai — bạn biết sửa ở đâu.

Bạn nghĩ ML là...	Thực tế ML là...
Thuật toán phức tạp	90% xử lý dữ liệu, 10% model
Toán cao cấp	Nhân ma trận + đạo hàm cơ bản
Phải giỏi toán	Phải giỏi debug shape mismatch

1. Vector = Feature Vector — "Hồ sơ số" của mỗi mẫu dữ liệu

Trực giác: Mỗi user là một dãy số

Tưởng tượng bạn đang xây hệ thống gợi ý sản phẩm cho sàn e-commerce. Mỗi user được biểu diễn bằng số:

┌─────────────────────────────────────────────────┐
│  user_42 = [ 28,  15000000,  47,  180 ]         │
│             ───   ────────   ──   ───            │
│             age   income    purch  days           │
│                                                   │
│  → 1 vector, 4 chiều (4 features)                │
└─────────────────────────────────────────────────┘

NumPy: Tạo và thao tác vector

python

import numpy as np

# ✅ Tạo feature vector cho một user
user_42 = np.array([28, 15_000_000, 47, 180], dtype=np.float32)

print(user_42.shape)   # (4,) — 1 chiều, 4 phần tử
print(user_42.dtype)   # float32
print(user_42.ndim)    # 1

# Truy cập từng feature
age = user_42[0]           # 28.0
last_two = user_42[2:]     # array([47., 180.])

Business scenario: Feature vector cho recommendation

python

feature_names = [
    "age", "monthly_income", "total_purchases",
    "days_since_signup", "avg_rating_given", "cart_abandonment_rate"
]
user_profile = np.array([28, 15e6, 47, 180, 4.2, 0.15], dtype=np.float32)

# Luôn kiểm tra shape trước khi đưa vào model
assert user_profile.shape == (6,), f"Expected 6 features, got {user_profile.shape}"

💡 Quy ước ngành

Trong ML, mỗi hàng là một mẫu (sample), mỗi cột là một đặc trưng (feature). Vector là trường hợp đặc biệt: một hàng duy nhất.

📌 Tại sao dùng float32?

float32 là "đồng tiền chung" của ML. Đủ chính xác, tiết kiệm bộ nhớ gấp đôi so với float64, và GPU chỉ thực sự nhanh với float32 (hoặc float16/bfloat16).

2. Matrix = Batch of Vectors — Khi bạn có nhiều users

Trực giác: Xếp chồng các vector lại

┌───────────────────────────────────────────────────┐
│  Matrix: 1000 users × 4 features                  │
│                                                     │
│  user_0   →  [ 28,  15000000,  47,  180 ]          │
│  user_1   →  [ 35,  22000000,  12,   90 ]          │
│  user_2   →  [ 22,   8000000,  85,  365 ]          │
│  ...                                                │
│  user_999 →  [ 41,  30000000,   5,   14 ]          │
│                                                     │
│  shape = (1000, 4)  →  1000 samples, 4 features    │
└───────────────────────────────────────────────────┘

NumPy: Tạo và thao tác matrix

python

users = np.array([
    [28, 15e6, 47, 180],
    [35, 22e6, 12,  90],
    [22,  8e6, 85, 365],
    [41, 30e6,  5,  14],
    [19,  5e6,  3,   7],
], dtype=np.float32)

print(users.shape)    # (5, 4) — 5 samples, 4 features

# Truy cập user thứ 2
user_1 = users[1]           # array([35., 22000000., 12., 90.])

# Cột "income" của tất cả users
all_incomes = users[:, 1]   # array([15e6, 22e6, 8e6, 30e6, 5e6])
avg_income = users[:, 1].mean()  # 16_000_000.0

User-Product Interaction Matrix

python

# 4 users × 5 products, giá trị = rating (0 = chưa rate)
interaction = np.array([
    [5, 3, 0, 1, 0],
    [4, 0, 0, 1, 0],
    [0, 0, 5, 4, 5],
    [0, 3, 4, 0, 4],
], dtype=np.float32)

rated = np.count_nonzero(interaction)
sparsity = 1 - rated / interaction.size
print(f"Sparsity: {sparsity:.0%}")  # 45%

⚠️ Shape convention — Ghi nhớ một lần, dùng cả đời

(samples, features)     ← Tabular data
(batch_size, features)  ← Khi training

Hàng = mẫu dữ liệu. Cột = đặc trưng. Nhầm → model học sai mà không báo lỗi.

3. Tensor = Higher Dimensions — Khi matrix không đủ

Tổng quan shape theo loại dữ liệu

Dữ liệu	Shape	Ndim
Một user	`(4,)`	1D — vector
1000 users	`(1000, 4)`	2D — matrix
Ảnh grayscale	`(28, 28)`	2D
Ảnh màu	`(224, 224, 3)`	3D — H × W × RGB
Batch 32 ảnh màu	`(32, 224, 224, 3)`	4D
Time series batch	`(64, 100, 5)`	3D — batch × steps × features

Image = 3D Tensor

python

image = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8)
print(image.shape)    # (224, 224, 3)

red_channel = image[:, :, 0]       # shape: (224, 224)
pixel = image[100, 150]            # array([r, g, b])

┌─────────────────────────────────────────────────┐
│  Image Tensor: (224, 224, 3)                     │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐        │
│  │  Red     │ │  Green   │ │  Blue    │         │
│  │  224×224 │ │  224×224 │ │  224×224 │         │
│  └──────────┘ └──────────┘ └──────────┘        │
│  channel 0    channel 1    channel 2             │
└─────────────────────────────────────────────────┘

Batch of Images = 4D Tensor

python

batch = np.random.rand(32, 224, 224, 3).astype(np.float32)
print(batch.shape)    # (32, 224, 224, 3)
print(f"Memory: {batch.nbytes / 1e6:.1f} MB")  # ~14.2 MB

💡 PyTorch vs TensorFlow — Thứ tự channels khác nhau!

TensorFlow / NumPy:  (batch, height, width, channels)  — "channels last"
PyTorch:             (batch, channels, height, width)   — "channels first"

Đây là nguồn gốc vô số bug khi chuyển model giữa hai framework. Luôn kiểm tra .shape!

Time Series = 3D Tensor

python

# 64 phiên giao dịch, 100 bước thời gian, 5 chỉ số
time_series = np.random.rand(64, 100, 5).astype(np.float32)
#                            ──  ───  ─
#                           batch steps features

Reshape, Squeeze, Expand_dims

python

matrix = np.random.rand(28, 28).astype(np.float32)
flat = matrix.reshape(-1)             # (784,) — flatten trước Dense layer

single_user = np.array([28, 15e6, 47, 180], dtype=np.float32)
batched = np.expand_dims(single_user, axis=0)  # (1, 4) — thêm batch dim

prediction = np.array([[[0.87]]])     # (1, 1, 1)
clean = prediction.squeeze()          # scalar 0.87

data = np.arange(24).reshape(2, 3, 4) # (2,3,4) — 2×3×4=24 ✅

4. Shape, dtype và Memory — Kiểm tra TRƯỚC KHI làm bất cứ gì

python

data = np.random.rand(1000, 50).astype(np.float32)
print(f"Shape: {data.shape}  Dtype: {data.dtype}  Bytes: {data.nbytes:,}")

`float32` vs `float64`

Dtype	Bytes/số	Dùng khi
`float16`	2	Inference trên GPU, mixed precision
`float32`	4	Mặc định cho ML training
`float64`	8	Hiếm khi cần trong ML

🔴 Memory Killer — Dataset lớn + dtype sai

Với 1M samples × 500 features: float64 = 4 GB, float32 = 2 GB, float16 = 1 GB. Trên GPU (8-16GB VRAM), sai dtype = không fit = training chậm 10-50x vì phải dùng CPU.

Tại sao hiểu shape ngăn 90% lỗi ML

python

A = np.random.rand(100, 50)
W_wrong = np.random.rand(10, 50)   # Sai chiều → A @ W_wrong = ValueError ❌

new_sample = np.array([1.0, 2.0, 3.0])  # (3,) — model cần (1, 3)

# Nhầm channels → output rác, KHÔNG báo lỗi!
image_tf = np.random.rand(224, 224, 3)  # TF: channels last
image_pt = np.random.rand(3, 224, 224)  # PyTorch: channels first

5. 🔥 GPU và Tensor Shape — Tại sao shape ảnh hưởng tốc độ

GPU xử lý dữ liệu song song theo tiles. Shape quyết định:

Memory layout: Row-major vs column-major ảnh hưởng cache hit rate.
Parallelism: GPU chia tensor thành tile 32×32. Shape không chia hết → padding → lãng phí compute.
Coalesced access: GPU đọc 128 bytes/lần. Truy cập rời rạc → throughput giảm 10-100x.

Hệ quả: Batch size thường là bội của 32 vì GPU warp size = 32 threads. Batch = 33 lãng phí 31/64 slots.

python

good_batch_sizes = [32, 64, 128, 256, 512]   # ✅ Bội của 32
bad_batch_sizes = [33, 100, 127, 300]         # ⚠️ Kém hiệu quả

6. 🧠 Common Beginner Misconception

⚠️ "Tensor chỉ là mảng đa chiều fancy"

Sai. NumPy arrays là mảng thuần túy. Tensors trong PyTorch/TensorFlow mang thêm:

Gradient tracking: requires_grad=True → tự động tính đạo hàm
Device placement: tensor.to('cuda') → tính toán trên GPU
Computational graph: Mỗi phép tính được ghi lại cho backpropagation

python

import torch
np_arr = np.array([1.0, 2.0, 3.0])                              # Chỉ là số
pt_tensor = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)   # Số + gradient + graph
pt_tensor.sum().backward()
print(pt_tensor.grad)   # tensor([1., 1., 1.]) — tự động tính!

NumPy = training wheels. Học concepts ở đây, nhưng đừng nghĩ chúng giống PyTorch tensor.

7. ⚡ Fast Exercise — Tự kiểm tra

Bài 1: Tạo Feature Matrix

python

import numpy as np

# TODO: Tạo feature matrix shape (3, 5), dtype float32
# Customer 0: 25 tuổi, 12M income, 30 purchases, 90 days, 4.5 rating
# Customer 1: 35 tuổi, 25M income, 15 purchases, 365 days, 3.8 rating
# Customer 2: 42 tuổi, 40M income, 8 purchases, 730 days, 4.9 rating
customers = ???

assert customers.shape == (3, 5)
assert customers.dtype == np.float32
print("✅ Correct!")

Đáp án

python

customers = np.array([
    [25, 12e6, 30, 90, 4.5],
    [35, 25e6, 15, 365, 3.8],
    [42, 40e6,  8, 730, 4.9],
], dtype=np.float32)

customers[:, 0] lấy cột age của tất cả rows — cách truy cập features trong ML pipelines.

Bài 2: Spot the Bug

python

features = np.random.rand(100, 10).astype(np.float32)
weights = np.random.rand(10, 5).astype(np.float32)
new_sample = np.array([0.5, 0.3, 0.8, 0.1, 0.9, 0.4, 0.7, 0.2, 0.6, 0.35])
prediction = new_sample @ weights  # Bug ở đâu?

Đáp án

(1) Thiếu dtype=np.float32 → upcast lên float64. (2) Shape (10,) thiếu batch dim → cần reshape(1, -1) thành (1, 10).

8. 🪤 Gotcha — Broadcasting "thầm lặng"

🔴 Quên kiểm tra .shape → Broadcasting cho kết quả sai

python

A = np.random.rand(100, 50)
B = np.random.rand(50)       # 1D — OK, broadcast thành (1, 50)
C = A - B                    # (100, 50) ✅ feature normalization

B_wrong = np.random.rand(100)  # Nhầm thành sample direction
# Nếu A là (100, 100): KHÔNG báo lỗi, nhưng output hoàn toàn sai!

Quy tắc: Print .shape của mọi operand trước phép tính. Tốn 2 giây, tiết kiệm 2 giờ debug.

9. 📊 Performance Note — Pre-allocate vs Append

python

n_users, n_features = 100_000, 50

# ❌ CHẬM: Append vào list rồi convert (~2.5s)
rows = []
for i in range(n_users):
    rows.append(np.random.rand(n_features).astype(np.float32))
matrix_slow = np.array(rows)

# ✅ Pre-allocate (~1.8s)
matrix_fast = np.zeros((n_users, n_features), dtype=np.float32)
for i in range(n_users):
    matrix_fast[i] = np.random.rand(n_features)

# ✅✅ Vectorized (~0.15s — 15x nhanh hơn!)
matrix_best = np.random.rand(n_users, n_features).astype(np.float32)

💡 Vectorized nhanh hơn loop 10-100x

NumPy gọi thẳng C code, bypass Python interpreter overhead hoàn toàn.

10. 🚫 Production Anti-pattern — Row-by-row loop

🔴 Anti-pattern: Xây feature matrix bằng iterrows

python

import pandas as pd
df = pd.read_csv("users.csv")  # 1M rows

# ❌ Loop từng row — ~45 giây cho 1M rows
feature_matrix = np.zeros((len(df), 4), dtype=np.float32)
for idx, row in df.iterrows():
    feature_matrix[idx, 0] = row["age"]
    feature_matrix[idx, 1] = row["income"]
    feature_matrix[idx, 2] = row["purchases"]
    feature_matrix[idx, 3] = row["days_active"]

# ✅ Vectorized — ~0.3 giây, 150x nhanh hơn
feature_matrix = df[["age", "income", "purchases", "days_active"]].to_numpy(dtype=np.float32)

Rule: Nếu bạn viết for idx, row in df.iterrows() trong ML code → bạn đang làm sai. Luôn vectorize.

11. 🧪 Playground — Code hoàn chỉnh

Copy và chạy (pip install numpy):

python

"""Penalgo AI Phase 1 — Lesson 01: Vectors, Matrices, Tensors"""
import numpy as np

# 1. VECTOR
user = np.array([28, 15e6, 47, 180], dtype=np.float32)
print(f"Vector:  {user}  shape={user.shape}  dtype={user.dtype}")

# 2. MATRIX
users = np.array([
    [28, 15e6, 47, 180], [35, 22e6, 12, 90],
    [22,  8e6, 85, 365], [41, 30e6,  5, 14],
], dtype=np.float32)
print(f"Matrix:  shape={users.shape}  avg_age={users[:, 0].mean():.1f}")

# 3. TENSOR — Batch of images
images = np.random.rand(32, 224, 224, 3).astype(np.float32)
print(f"Tensor:  shape={images.shape}  size={images.nbytes / 1e6:.1f} MB")

# 4. RESHAPE
flat = np.arange(12)
print(f"Reshape: {flat.shape} → (3,4)={flat.reshape(3,4).shape} → (2,2,3)={flat.reshape(2,2,3).shape}")

# 5. DTYPE memory impact
for dt in [np.float16, np.float32, np.float64]:
    mb = np.zeros((10000, 100), dtype=dt).nbytes / 1e6
    print(f"  {str(dt):20s} → {mb:.1f} MB")

print("\n✅ Hoàn thành! Bạn đã hiểu vector, matrix, tensor trong ML.")

12. Quiz — Kiểm tra hiểu biết

🧠 Quiz

Câu 1: 5000 ảnh màu 64×64. Shape của batch tensor?

A) (64, 64, 3, 5000)
B) (5000, 64, 64, 3)
C) (5000, 3, 64, 64)
D) B hoặc C, tùy framework

Đáp án

D — TensorFlow: channels-last (5000, 64, 64, 3). PyTorch: channels-first (5000, 3, 64, 64). Batch dimension luôn đứng đầu.

🧠 Quiz

Câu 2: np.array([1.0, 2.0, 3.0]) có dtype mặc định là gì?

A) float32 — không vấn đề
B) float64 — tốn gấp đôi bộ nhớ, GPU không tối ưu
C) int64 — không dùng được cho gradient
D) float16 — mất precision

Đáp án

B — NumPy mặc định float64. Trong ML, float32 là chuẩn: tiết kiệm 50% bộ nhớ, GPU tensor cores tối ưu cho float32/float16.

🧠 Quiz

Câu 3: features shape (200, 30) @ weights shape (30, 10) → shape gì?

A) (200, 10)
B) (30, 30)
C) (200, 30, 10)
D) ValueError

Đáp án

A — (m, k) @ (k, n) → (m, n). Inner dimensions khớp (30=30), output (200, 10).

Checklist ghi nhớ

✅ Checklist triển khai

[ ] Vector = 1D array = feature representation của một sample
[ ] Matrix = 2D array = batch of vectors, shape (samples, features)
[ ] Tensor = n-D array = ảnh, video, time series
[ ] Luôn dùng dtype=np.float32 cho ML data
[ ] Luôn kiểm tra .shape trước mọi phép tính
[ ] TF: (batch, H, W, C) · PyTorch: (batch, C, H, W)
[ ] Pre-allocate hoặc vectorize — không append trong loop
[ ] NumPy array ≠ PyTorch tensor (gradient, device, graph)
[ ] Batch size nên là bội của 32 để tối ưu GPU

Vectors, Matrices, Tensors — Nền tảng toán cho ML Foundation ​

Tại sao kỹ sư phần mềm cần biết điều này? ​

1. Vector = Feature Vector — "Hồ sơ số" của mỗi mẫu dữ liệu ​

Trực giác: Mỗi user là một dãy số ​

NumPy: Tạo và thao tác vector ​

Business scenario: Feature vector cho recommendation ​

2. Matrix = Batch of Vectors — Khi bạn có nhiều users ​

Trực giác: Xếp chồng các vector lại ​

NumPy: Tạo và thao tác matrix ​

User-Product Interaction Matrix ​

3. Tensor = Higher Dimensions — Khi matrix không đủ ​

Tổng quan shape theo loại dữ liệu ​

Image = 3D Tensor ​

Batch of Images = 4D Tensor ​

Time Series = 3D Tensor ​

Reshape, Squeeze, Expand_dims ​

4. Shape, dtype và Memory — Kiểm tra TRƯỚC KHI làm bất cứ gì ​

float32 vs float64 ​

Tại sao hiểu shape ngăn 90% lỗi ML ​

5. 🔥 GPU và Tensor Shape — Tại sao shape ảnh hưởng tốc độ ​

6. 🧠 Common Beginner Misconception ​

7. ⚡ Fast Exercise — Tự kiểm tra ​

Bài 1: Tạo Feature Matrix ​

Bài 2: Spot the Bug ​

8. 🪤 Gotcha — Broadcasting "thầm lặng" ​

9. 📊 Performance Note — Pre-allocate vs Append ​

10. 🚫 Production Anti-pattern — Row-by-row loop ​

11. 🧪 Playground — Code hoàn chỉnh ​

12. Quiz — Kiểm tra hiểu biết ​

Checklist ghi nhớ ​

Liên kết học tiếp ​

Vectors, Matrices, Tensors — Nền tảng toán cho ML Foundation

Tại sao kỹ sư phần mềm cần biết điều này?

1. Vector = Feature Vector — "Hồ sơ số" của mỗi mẫu dữ liệu

Trực giác: Mỗi user là một dãy số

NumPy: Tạo và thao tác vector

Business scenario: Feature vector cho recommendation

2. Matrix = Batch of Vectors — Khi bạn có nhiều users

Trực giác: Xếp chồng các vector lại

NumPy: Tạo và thao tác matrix

User-Product Interaction Matrix

3. Tensor = Higher Dimensions — Khi matrix không đủ

Tổng quan shape theo loại dữ liệu

Image = 3D Tensor

Batch of Images = 4D Tensor

Time Series = 3D Tensor

Reshape, Squeeze, Expand_dims

4. Shape, dtype và Memory — Kiểm tra TRƯỚC KHI làm bất cứ gì

`float32` vs `float64`

Tại sao hiểu shape ngăn 90% lỗi ML

5. 🔥 GPU và Tensor Shape — Tại sao shape ảnh hưởng tốc độ

6. 🧠 Common Beginner Misconception

7. ⚡ Fast Exercise — Tự kiểm tra

Bài 1: Tạo Feature Matrix

Bài 2: Spot the Bug

8. 🪤 Gotcha — Broadcasting "thầm lặng"

9. 📊 Performance Note — Pre-allocate vs Append

10. 🚫 Production Anti-pattern — Row-by-row loop

11. 🧪 Playground — Code hoàn chỉnh

12. Quiz — Kiểm tra hiểu biết

Checklist ghi nhớ

Liên kết học tiếp