NumPy Internals — Hiểu để viết code nhanh hơn 100 lần

Một pipeline xử lý 50 triệu dòng sensor data mỗi ngày, chạy trong 4 giờ với pandas + Python loop. Sau khi team chuyển sang NumPy vectorized operations — cùng logic, cùng kết quả — thời gian giảm xuống 12 phút. Speedup 20x chỉ bằng cách loại bỏ Python loop và hiểu cách NumPy layout memory.

Điều khiến NumPy nhanh không phải phép thuật — nó nhanh vì bypass CPython overhead hoàn toàn. Khi bạn viết a + b với hai ndarray, NumPy không chạy Python interpreter cho mỗi phần tử. Nó gọi thẳng compiled C code, xử lý data contiguous trong memory, tận dụng CPU cache và SIMD instructions. Hiểu cơ chế này là chìa khóa để viết NumPy code thực sự hiệu quả.

Bức tranh tư duy

Hãy tưởng tượng Python list như chợ truyền thống: mỗi gian hàng (phần tử) ở vị trí khác nhau, bạn phải đi bộ đến từng gian để mua. ndarray của NumPy như dây chuyền sản xuất công nghiệp: tất cả nguyên liệu xếp thẳng hàng trên băng chuyền, máy xử lý tuần tự không cần di chuyển.

Python list (pointer array):
┌────┐  ┌────┐  ┌────┐  ┌────┐
│ *  │→ │1.0 │  │ *  │→ │2.0 │  ...  Mỗi phần tử là
│ptr │  │    │  │ptr │  │    │       PyObject riêng biệt,
└────┘  └────┘  └────┘  └────┘       nằm rải rác trong RAM

NumPy ndarray (contiguous buffer):
┌──────┬──────┬──────┬──────┬──────┐
│ 1.0  │ 2.0  │ 3.0  │ 4.0  │ 5.0  │  Raw bytes liền kề,
│8bytes│8bytes│8bytes│8bytes│8bytes│  CPU đọc 1 cache line
└──────┴──────┴──────┴──────┴──────┘  = xử lý nhiều phần tử

Python list: mỗi phần tử tốn ~28 bytes (PyObject float) + 8 bytes pointer = ~36 bytes/phần tử. ndarray float64: 8 bytes/phần tử. Tiết kiệm 4.5x bộ nhớ, và quan trọng hơn — data locality tốt hơn → CPU cache hit rate cao hơn → xử lý nhanh hơn.

Analogy này breakdown khi array quá lớn không fit CPU cache (lúc đó memory bandwidth là bottleneck, không phải cache miss), hoặc khi bạn cần xử lý từng phần tử theo logic phức tạp mà NumPy không hỗ trợ vectorized.

Cốt lõi kỹ thuật

ndarray memory layout: shape, strides, dtype

Mỗi ndarray gồm 3 thành phần cốt lõi: data buffer (vùng nhớ liên tục), shape (kích thước mỗi chiều), và strides (số bytes nhảy để đến phần tử tiếp theo trên mỗi chiều).

python

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)

print(f"Shape:    {arr.shape}")      # (2, 3) — 2 hàng, 3 cột
print(f"Strides:  {arr.strides}")    # (24, 8) — nhảy 24 bytes qua hàng, 8 bytes qua cột
print(f"Dtype:    {arr.dtype}")      # float64 — 8 bytes/phần tử
print(f"Itemsize: {arr.itemsize}")   # 8 bytes
print(f"Nbytes:   {arr.nbytes}")     # 48 bytes tổng
print(f"Flags C:  {arr.flags['C_CONTIGUOUS']}")  # True — row-major

Strides giải thích cách truy cập arr[i, j]:

offset = i * strides[0] + j * strides[1]
       = i * 24 + j * 8  (bytes từ đầu buffer)

arr[1, 2] → offset = 1*24 + 2*8 = 40 bytes → giá trị 6.0

C-order vs Fortran-order

python

import numpy as np

# C-order (row-major) — mặc định
arr_c = np.array([[1, 2, 3], [4, 5, 6]], order="C")
# Memory: [1, 2, 3, 4, 5, 6]  (hàng liên tiếp)
print(f"C strides: {arr_c.strides}")  # (24, 8)

# Fortran-order (column-major) — dùng trong LAPACK, Fortran
arr_f = np.array([[1, 2, 3], [4, 5, 6]], order="F")
# Memory: [1, 4, 2, 5, 3, 6]  (cột liên tiếp)
print(f"F strides: {arr_f.strides}")  # (8, 16)

Ý nghĩa thực tế: Lặp qua hàng → dùng C-order. Lặp qua cột → dùng Fortran-order. Sai thứ tự = cache miss liên tục = chậm gấp nhiều lần.

View vs Copy — Nguyên tắc sống còn

python

import numpy as np

original = np.array([10, 20, 30, 40, 50])

# Slicing tạo VIEW — chia sẻ memory
view = original[1:4]
view[0] = 999
print(original)  # [10, 999, 30, 40, 50] — bị thay đổi!

# .copy() tạo COPY — memory riêng
copy = original[1:4].copy()
copy[0] = 0
print(original)  # [10, 999, 30, 40, 50] — không ảnh hưởng

# Kiểm tra view hay copy
print(view.base is original)  # True → view
print(copy.base is None)      # True → copy (owns data)

Fancy indexing (arr[[0, 2, 4]]) luôn tạo copy. Slicing cơ bản (arr[1:4], arr[::2]) tạo view.

Vectorization — Thay Python loop bằng C loop

python

import numpy as np
import timeit

n = 1_000_000
data = np.random.random(n)

# ❌ Python loop — chậm vì mỗi iteration qua CPython interpreter
def python_loop(data: np.ndarray) -> np.ndarray:
    result = np.empty_like(data)
    for i in range(len(data)):
        result[i] = data[i] ** 2 + 2 * data[i] + 1
    return result

# ✅ Vectorized — 1 lần gọi C, xử lý toàn bộ array
def vectorized(data: np.ndarray) -> np.ndarray:
    return data ** 2 + 2 * data + 1

t_loop = timeit.timeit(lambda: python_loop(data), number=5)
t_vec = timeit.timeit(lambda: vectorized(data), number=5)
print(f"Python loop: {t_loop:.3f}s")
print(f"Vectorized:  {t_vec:.3f}s")
print(f"Speedup:     {t_loop / t_vec:.0f}x")
# Ước lượng: 50-100x speedup

Broadcasting rules — Ba quy tắc vàng

Broadcasting cho phép NumPy thực hiện phép tính trên array có shape khác nhau mà không cần copy data.

Quy tắc 1: Nếu số chiều khác nhau, thêm 1 vào đầu shape nhỏ hơn
Quy tắc 2: Chiều có kích thước 1 được "kéo giãn" bằng chiều lớn hơn
Quy tắc 3: Hai chiều tương thích nếu bằng nhau hoặc một trong hai = 1

python

import numpy as np

# Scalar + Array: () broadcast thành (3,)
a = np.array([1, 2, 3])
result = a + 10  # [11, 12, 13]

# (2,3) + (3,) → (3,) thành (1,3) → broadcast thành (2,3)
matrix = np.array([[1, 2, 3], [4, 5, 6]])   # (2, 3)
row = np.array([10, 20, 30])                 # (3,)
result = matrix + row
# [[11, 22, 33],
#  [14, 25, 36]]

# (3,1) + (1,4) → broadcast thành (3,4)
col = np.array([[1], [2], [3]])       # (3, 1)
row = np.array([[10, 20, 30, 40]])    # (1, 4)
outer = col + row
# [[11, 21, 31, 41],
#  [12, 22, 32, 42],
#  [13, 23, 33, 43]]

Ứng dụng thực tế phổ biến:

python

import numpy as np

# Chuẩn hóa cột (trừ trung bình mỗi cột)
data = np.random.random((1000, 50))
col_means = data.mean(axis=0)         # shape (50,)
normalized = data - col_means         # (1000,50) - (50,) → broadcast

# Ma trận khoảng cách Euclidean (không cần loop)
points = np.random.random((100, 3))   # 100 điểm 3D
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]  # (100,100,3)
distances = np.sqrt(np.sum(diff ** 2, axis=2))               # (100,100)

SIMD và NumPy

NumPy sử dụng SIMD (Single Instruction Multiple Data) qua các thư viện BLAS/LAPACK (OpenBLAS, MKL). Một instruction CPU xử lý 4 float64 hoặc 8 float32 cùng lúc.

python

import numpy as np

# Matrix multiply tận dụng SIMD qua BLAS
A = np.random.random((1000, 1000))
B = np.random.random((1000, 1000))
C = A @ B  # Gọi dgemm (BLAS level 3) — tận dụng AVX2/AVX-512

# Kiểm tra NumPy build config
np.show_config()
# Hiển thị BLAS/LAPACK library đang dùng (openblas, mkl, accelerate...)

Tránh tạo intermediate array

python

import numpy as np

a = np.random.random(1_000_000)
b = np.random.random(1_000_000)
c = np.random.random(1_000_000)

# ❌ Tạo 2 intermediate array tạm
result = a + b + c  # temp1 = a+b (1M float64 = 8MB), result = temp1+c (8MB)

# ✅ Dùng out parameter — 0 intermediate
result = np.empty_like(a)
np.add(a, b, out=result)
np.add(result, c, out=result)

# ✅ Hoặc dùng numexpr cho biểu thức phức tạp
# import numexpr as ne
# result = ne.evaluate("a + b + c")  # Không tạo intermediate, tận dụng multi-thread

Thực chiến

Tình huống: Tối ưu data pipeline sensor IoT

Bối cảnh: Hệ thống IoT nhận 50 triệu dòng sensor data/ngày (timestamp, device_id, temperature, humidity, pressure). Pipeline hiện tại dùng pandas + Python loop xử lý 4 giờ. Yêu cầu: giảm xuống dưới 30 phút.

Mục tiêu: Chuyển sang NumPy vectorized operations, giữ nguyên logic xử lý.

python

import numpy as np
from typing import NamedTuple
import time

class PipelineResult(NamedTuple):
    anomaly_count: int
    mean_temperature: float
    processing_time_ms: float

def process_sensor_data(
    temperature: np.ndarray,     # shape (N,), float32
    humidity: np.ndarray,        # shape (N,), float32
    pressure: np.ndarray,        # shape (N,), float32
    device_ids: np.ndarray,      # shape (N,), int32
) -> PipelineResult:
    """Pipeline xử lý sensor data hoàn toàn vectorized."""
    start = time.perf_counter()

    # Bước 1: Chuẩn hóa (vectorized, không loop)
    temp_mean = np.mean(temperature)
    temp_std = np.std(temperature)
    temp_z = (temperature - temp_mean) / (temp_std + 1e-8)

    # Bước 2: Phát hiện anomaly (boolean indexing, không loop)
    anomaly_mask = (
        (np.abs(temp_z) > 3.0)
        | (humidity < 0) | (humidity > 100)
        | (pressure < 800) | (pressure > 1200)
    )
    anomaly_count = int(np.count_nonzero(anomaly_mask))

    # Bước 3: Heat index (vectorized formula)
    # Simplified heat index: HI = -42.379 + 2.049*T + 10.143*RH - ...
    valid_mask = ~anomaly_mask
    valid_temp = temperature[valid_mask]
    valid_humidity = humidity[valid_mask]

    heat_index = (
        -42.379
        + 2.04901523 * valid_temp
        + 10.14333127 * valid_humidity
        - 0.22475541 * valid_temp * valid_humidity
    )

    # Bước 4: Thống kê per-device (dùng np.unique + bincount)
    unique_devices = np.unique(device_ids[valid_mask])

    elapsed_ms = (time.perf_counter() - start) * 1000
    return PipelineResult(
        anomaly_count=anomaly_count,
        mean_temperature=float(np.mean(valid_temp)),
        processing_time_ms=elapsed_ms,
    )

# Benchmark
N = 50_000_000
rng = np.random.default_rng(42)
temp = rng.normal(25.0, 5.0, N).astype(np.float32)
hum = rng.uniform(20.0, 80.0, N).astype(np.float32)
pres = rng.normal(1013.0, 20.0, N).astype(np.float32)
dev = rng.integers(0, 10_000, N, dtype=np.int32)

result = process_sensor_data(temp, hum, pres, dev)
print(f"Anomalies:   {result.anomaly_count:,}")
print(f"Mean temp:   {result.mean_temperature:.2f}°C")
print(f"Time:        {result.processing_time_ms:.0f}ms")
# Ước lượng: ~2-5 giây cho 50M rows (so với 4 giờ bằng loop)

Phân tích:

Không có Python loop — mọi phép tính đều vectorized
float32 thay vì float64 — tiết kiệm 50% RAM, đủ chính xác cho sensor data
Boolean mask thay vì if/else từng phần tử — NumPy xử lý mask bằng SIMD
Memory: 50M × 4 bytes × 4 trường = ~800MB — vừa RAM container 2GB
Trade-off: code khó debug hơn loop (khó set breakpoint trên từng phần tử)

Sai lầm điển hình

❌ Sai lầm 1: np.append trong loop

Vấn đề: np.append tạo array mới mỗi lần gọi — O(n²) tổng.

python

import numpy as np

# SAI: O(n²) — mỗi append copy toàn bộ array cũ
result = np.array([])
for i in range(100_000):
    result = np.append(result, i ** 2)
# 100K iterations × copy trung bình 50K phần tử = 5 tỷ operations

Tại sao sai: Khác với Python list (amortized O(1) append), np.append không có pre-allocated buffer — nó tạo array mới mỗi lần, copy toàn bộ data cũ rồi thêm phần tử mới.

python

import numpy as np

# ĐÚNG: Pre-allocate hoặc vectorize
# Cách 1: Pre-allocate
result = np.empty(100_000)
for i in range(100_000):
    result[i] = i ** 2

# Cách 2: Vectorize (tốt nhất)
result = np.arange(100_000) ** 2

# Cách 3: Nếu không biết trước kích thước
chunks = []
for batch in data_stream():
    chunks.append(process(batch))  # Python list append O(1)
result = np.concatenate(chunks)    # 1 lần concatenate

❌ Sai lầm 2: Dùng object dtype

Vấn đề: Array với dtype=object chạy chậm như Python list — mất toàn bộ lợi thế NumPy.

python

import numpy as np

# SAI: Object dtype — mỗi phần tử là PyObject, không SIMD
arr = np.array([1, "two", 3.0])  # dtype=object (tự suy ra)
result = arr * 2  # Gọi Python __mul__ cho TỪNG phần tử!

Tại sao sai: Object dtype lưu pointer đến PyObject — giống hệt Python list. NumPy không thể dùng SIMD, không thể batch xử lý, không có lợi thế hiệu năng nào.

python

import numpy as np

# ĐÚNG: Dùng native dtype
arr_int = np.array([1, 2, 3], dtype=np.int64)
arr_float = np.array([1.0, 2.0, 3.0], dtype=np.float32)  # float32 đủ nếu không cần precision cao

# Nếu cần mixed types → dùng structured array
dt = np.dtype([("id", "i4"), ("value", "f8")])
records = np.array([(1, 3.14), (2, 2.72)], dtype=dt)
print(records["value"].sum())  # Vectorized trên native float64

❌ Sai lầm 3: Broadcasting sai shape gây kết quả sai im lặng

Vấn đề: Broadcasting tự động mở rộng dimension — nếu shape sai, NumPy không báo lỗi mà cho kết quả sai.

python

import numpy as np

# SAI: Muốn trừ trung bình mỗi HÀNG, nhưng shape sai
data = np.array([[1, 2, 3], [4, 5, 6]])  # (2, 3)
row_means = data.mean(axis=1)             # (2,) — [2.0, 5.0]

# data - row_means → lỗi! (2,3) - (2,) → trailing dimension 3 vs 2
# result = data - row_means  # ValueError

Tại sao sai: row_means shape (2,) không broadcast được với (2,3) vì trailing dimension không khớp (3 ≠ 2).

python

import numpy as np

# ĐÚNG: Reshape để broadcast đúng
data = np.array([[1, 2, 3], [4, 5, 6]])  # (2, 3)
row_means = data.mean(axis=1, keepdims=True)  # (2, 1) ← keepdims!
result = data - row_means  # (2,3) - (2,1) → broadcast OK
# [[-1, 0, 1], [-1, 0, 1]]

# Hoặc dùng np.newaxis
row_means = data.mean(axis=1)[:, np.newaxis]  # (2,) → (2,1)
result = data - row_means

❌ Sai lầm 4: NaN lan truyền im lặng

Vấn đề: Một NaN trong array làm hỏng toàn bộ aggregation mà không cảnh báo.

python

import numpy as np

# SAI: NaN lan truyền
data = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
print(np.mean(data))  # nan — không phải 3.0!
print(np.sum(data))   # nan
print(np.max(data))   # nan

Tại sao sai: Theo IEEE 754, mọi phép tính với NaN đều cho NaN. NumPy tuân thủ chuẩn này — không tự ý bỏ qua NaN.

python

import numpy as np

# ĐÚNG: Dùng nan-safe functions
data = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
print(np.nanmean(data))  # 3.0
print(np.nansum(data))   # 12.0
print(np.nanmax(data))   # 5.0

# Hoặc filter NaN trước
clean = data[~np.isnan(data)]
print(np.mean(clean))    # 3.0

Under the Hood

Tại sao vectorization nhanh hơn?

Ba yếu tố tạo nên tốc độ NumPy:

1. Loại bỏ interpreter overhead: Mỗi iteration trong Python loop phải qua: LOAD_FAST → BINARY_SUBSCR → BINARY_POWER → STORE_SUBSCR → JUMP_ABSOLUTE. Vectorized operation gọi 1 lần C function, loop trong compiled C.

2. Data locality: ndarray contiguous → CPU prefetch hiệu quả. Python list có pointer chasing → cache miss liên tục.

3. SIMD: CPU xử lý 4 double (AVX2) hoặc 8 double (AVX-512) trong 1 instruction cycle.

Chi phí đo thực tế

Phép tính (1M float64)	Python loop	NumPy	Speedup
`a + b`	~150ms	~1.5ms	100x
`a * b + c`	~300ms	~3ms	100x
`np.sum(a)`	~80ms	~0.8ms	100x
`a @ b` (1000×1000)	~30s	~30ms	1000x
Boolean filter	~200ms	~5ms	40x

Ước lượng trên CPU hiện đại, single thread. Matrix multiply (@) nhanh 1000x vì dùng BLAS optimized (cache blocking + SIMD + multi-level tiling).

Contiguous vs Non-contiguous performance

python

import numpy as np
import timeit

arr = np.random.random((10_000, 10_000))

# Row access (C-contiguous) — nhanh
def sum_rows():
    return np.sum(arr, axis=1)

# Column access (non-contiguous) — chậm hơn
def sum_cols():
    return np.sum(arr, axis=0)

t_rows = timeit.timeit(sum_rows, number=10)
t_cols = timeit.timeit(sum_cols, number=10)
print(f"Sum rows: {t_rows:.3f}s")
print(f"Sum cols: {t_cols:.3f}s")
# Ước lượng: sum rows nhanh hơn 1.5-3x vì đọc tuần tự trong memory

Trade-offs khi dùng NumPy

Ưu điểm	Hạn chế
50-1000x nhanh hơn Python loop	Đòi hỏi tư duy vectorized
Memory-efficient với native dtype	Cần load toàn bộ array vào RAM
SIMD/BLAS tự động	Không hiệu quả cho conditional logic phức tạp
Broadcasting giảm code boilerplate	Broadcasting sai shape → lỗi im lặng
View mechanism tiết kiệm copy	View mutation ảnh hưởng original

Checklist ghi nhớ

✅ Checklist triển khai

Thiết kế array hiệu quả

[ ] Chọn dtype nhỏ nhất đủ dùng (float32 cho sensor, int16 cho count nhỏ)
[ ] Tạo array với shape chính xác từ đầu — không dùng np.append trong loop
[ ] Kiểm tra arr.flags['C_CONTIGUOUS'] trước khi truyền vào C extension
[ ] Dùng np.ascontiguousarray() khi cần contiguous memory

Vectorization

[ ] Thay Python for-loop bằng NumPy operation khi xử lý array
[ ] Dùng np.where() thay vì if/else trong loop
[ ] Dùng boolean indexing cho filtering: arr[arr > 0]
[ ] Dùng out= parameter để tránh intermediate array

Broadcasting an toàn

[ ] Luôn in shape trước khi broadcast: print(f"a: {a.shape}, b: {b.shape}")
[ ] Dùng keepdims=True trong aggregation để giữ dimension cho broadcast
[ ] Dùng np.newaxis rõ ràng thay vì phụ thuộc implicit broadcasting

Xử lý NaN và edge case

[ ] Dùng np.nanmean/nansum/nanmax khi data có thể chứa NaN
[ ] Dùng np.isclose/allclose thay vì == cho so sánh float
[ ] Kiểm tra overflow khi dùng integer dtype nhỏ

Bài tập luyện tập

Bài 1: Strides và Memory Layout — Foundation

Đề bài: Cho array arr = np.arange(24).reshape(2, 3, 4) với dtype int64 (8 bytes). Tính strides cho C-order và truy cập phần tử arr[1, 2, 3] bằng offset thủ công.

🧠 Quiz

Câu hỏi: Strides của array shape (2, 3, 4) dtype int64 ở C-order là gì?

[ ] A. (8, 16, 32)
[ ] B. (32, 16, 8)
[x] C. (96, 32, 8)
[ ] D. (8, 24, 96) Giải thích: Stride cuối = 8 bytes (1 int64). Stride giữa = 4 × 8 = 32 bytes (nhảy qua 4 phần tử). Stride đầu = 3 × 4 × 8 = 96 bytes (nhảy qua 1 "plane" 3×4). Offset cho [1,2,3] = 1×96 + 2×32 + 3×8 = 96 + 64 + 24 = 184 bytes → phần tử thứ 23 (giá trị 23).

Bài 2: Vectorize conditional logic — Intermediate

Đề bài: Cho array nhiệt độ temps (shape N), viết hàm hoàn toàn vectorized (không dùng loop) phân loại: dưới 0 → "freezing", 0-25 → "normal", trên 25 → "hot". Trả về array int: 0, 1, 2 tương ứng.

💡 Gợi ý

np.where lồng nhau hoặc np.select cho nhiều điều kiện
np.digitize cũng hoạt động cho binning

✅ Lời giải

python

import numpy as np

def classify_temperature(temps: np.ndarray) -> np.ndarray:
    """Phân loại nhiệt độ, hoàn toàn vectorized."""
    conditions = [temps < 0, (temps >= 0) & (temps <= 25), temps > 25]
    choices = [0, 1, 2]
    return np.select(conditions, choices, default=1)

# Cách 2: np.digitize
def classify_temperature_v2(temps: np.ndarray) -> np.ndarray:
    return np.digitize(temps, bins=[0, 25])
    # < 0 → 0, [0,25) → 1, >= 25 → 2

# Test
temps = np.array([-5.0, 10.0, 0.0, 30.0, 25.0, -1.0])
print(classify_temperature(temps))     # [0, 1, 1, 2, 1, 0]
print(classify_temperature_v2(temps))  # [0, 1, 1, 2, 2, 0]
# Lưu ý: boundary handling khác nhau giữa 2 cách

Phân tích: np.select cho phép nhiều condition/choice pairs — linh hoạt nhất. np.digitize ngắn gọn hơn cho binning đơn giản. Cả hai đều O(N) time, O(N) space, không loop Python.

Bài 3: Tối ưu pairwise distance computation — Advanced

Đề bài: Tính ma trận khoảng cách Euclidean giữa N điểm (shape (N, D)) mà KHÔNG dùng loop. So sánh hiệu năng với loop naive cho N=5000, D=3.

💡 Gợi ý

Dùng broadcasting: points[:, np.newaxis, :] - points[np.newaxis, :, :]
Hoặc dùng công thức: ||a-b||² = ||a||² + ||b||² - 2·a·b
scipy.spatial.distance.cdist là baseline so sánh

✅ Lời giải

python

import numpy as np
import timeit

def pairwise_naive(points: np.ndarray) -> np.ndarray:
    """Loop naive — O(N²D) với Python overhead."""
    n = points.shape[0]
    dist = np.empty((n, n))
    for i in range(n):
        for j in range(n):
            dist[i, j] = np.sqrt(np.sum((points[i] - points[j]) ** 2))
    return dist

def pairwise_broadcast(points: np.ndarray) -> np.ndarray:
    """Broadcasting — O(N²D) nhưng vectorized."""
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    return np.sqrt(np.sum(diff ** 2, axis=2))

def pairwise_algebraic(points: np.ndarray) -> np.ndarray:
    """Algebraic trick — tận dụng BLAS matmul."""
    sq_norms = np.sum(points ** 2, axis=1)
    dot_product = points @ points.T
    sq_dist = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * dot_product
    np.maximum(sq_dist, 0, out=sq_dist)  # Tránh sqrt(negative) do floating point
    return np.sqrt(sq_dist)

# Benchmark
N, D = 5000, 3
points = np.random.random((N, D))

t_bc = timeit.timeit(lambda: pairwise_broadcast(points), number=1)
t_alg = timeit.timeit(lambda: pairwise_algebraic(points), number=1)
print(f"Broadcast: {t_bc:.3f}s")
print(f"Algebraic: {t_alg:.3f}s")
# Ước lượng: broadcast ~2s, algebraic ~0.3s (dùng BLAS matmul)
# Loop naive: ~60s (không nên chạy!)

Phân tích: Phương pháp algebraic nhanh nhất vì biến đổi bài toán thành matrix multiplication → tận dụng BLAS (O(N²D) nhưng với SIMD + cache blocking). Broadcast version tạo intermediate array (N, N, D) — tốn N²×D×8 bytes RAM. Với N=5000, D=3: broadcast cần ~600MB intermediate, algebraic cần ~200MB.

Liên kết học tiếp

Từ khóa glossary: ndarray, vectorization, broadcasting, strides, C-contiguous, SIMD, BLAS, view, copy, dtype, memory layout

Tìm kiếm liên quan: numpy tối ưu hiệu năng, numpy broadcasting quy tắc, numpy vectorization, ndarray memory layout, numpy thay thế python loop

NumPy Internals — Hiểu để viết code nhanh hơn 100 lần ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

ndarray memory layout: shape, strides, dtype ​

C-order vs Fortran-order ​

View vs Copy — Nguyên tắc sống còn ​

Vectorization — Thay Python loop bằng C loop ​

Broadcasting rules — Ba quy tắc vàng ​

SIMD và NumPy ​

Tránh tạo intermediate array ​

Thực chiến ​

Tình huống: Tối ưu data pipeline sensor IoT ​

Sai lầm điển hình ​

❌ Sai lầm 1: np.append trong loop ​

❌ Sai lầm 2: Dùng object dtype ​

❌ Sai lầm 3: Broadcasting sai shape gây kết quả sai im lặng ​

❌ Sai lầm 4: NaN lan truyền im lặng ​

Under the Hood ​

Tại sao vectorization nhanh hơn? ​

Chi phí đo thực tế ​

Contiguous vs Non-contiguous performance ​

Trade-offs khi dùng NumPy ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Strides và Memory Layout — Foundation ​

Bài 2: Vectorize conditional logic — Intermediate ​

Bài 3: Tối ưu pairwise distance computation — Advanced ​

Liên kết học tiếp ​

NumPy Internals — Hiểu để viết code nhanh hơn 100 lần

Bức tranh tư duy

Cốt lõi kỹ thuật

ndarray memory layout: shape, strides, dtype

C-order vs Fortran-order

View vs Copy — Nguyên tắc sống còn

Vectorization — Thay Python loop bằng C loop

Broadcasting rules — Ba quy tắc vàng

SIMD và NumPy

Tránh tạo intermediate array

Thực chiến

Tình huống: Tối ưu data pipeline sensor IoT

Sai lầm điển hình

❌ Sai lầm 1: np.append trong loop

❌ Sai lầm 2: Dùng object dtype

❌ Sai lầm 3: Broadcasting sai shape gây kết quả sai im lặng

❌ Sai lầm 4: NaN lan truyền im lặng

Under the Hood

Tại sao vectorization nhanh hơn?

Chi phí đo thực tế

Contiguous vs Non-contiguous performance

Trade-offs khi dùng NumPy

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Strides và Memory Layout — Foundation

Bài 2: Vectorize conditional logic — Intermediate

Bài 3: Tối ưu pairwise distance computation — Advanced

Liên kết học tiếp