Profiling Python — Tìm đúng chỗ đau trước khi chữa

Một hệ thống e-commerce xử lý 2.000 request/giây bỗng tăng latency p99 từ 120ms lên 800ms sau lần deploy thứ Sáu. Đội ngũ dành cả cuối tuần refactor database queries — nhưng latency vẫn không giảm. Khi cuối cùng chạy py-spy trên production, nguyên nhân thực sự là một hàm serialize JSON lồng 6 cấp, chiếm 73% CPU time. Hai dòng code thay đổi, latency về lại 110ms.

Bài học nhức nhối: không đo thì không biết. Profiling không phải kỹ thuật "nice-to-have" — nó là kỷ luật bắt buộc của kỹ sư hiệu năng. Mọi phút bạn optimize mà không có dữ liệu profiling là phút bạn đang đoán mò.

Trong bài này, bạn sẽ nắm trọn bộ công cụ profiling từ deterministic (cProfile) đến sampling (py-spy), từ CPU đến memory, từ text output đến flame graph — và quan trọng nhất, biết chọn đúng công cụ cho đúng tình huống.

Bức tranh tư duy

Hãy tưởng tượng bạn là bác sĩ cấp cứu. Bệnh nhân (ứng dụng) vào phòng cấp cứu nói "tôi đau". Bạn không thể mổ ngay — phải chụp X-quang trước. Profiling chính là bước chụp X-quang đó.

Triệu chứng: "App chậm"
         │
         ▼
┌─────────────────┐     ┌─────────────────┐
│  cProfile       │────▶│  Xác định hàm   │
│  (X-quang toàn  │     │  nào ngốn CPU   │
│   thân)         │     │  nhiều nhất      │
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
┌─────────────────┐     ┌─────────────────┐
│  line_profiler   │────▶│  Xác định dòng  │
│  (CT scan chi   │     │  nào trong hàm   │
│   tiết)         │     │  là thủ phạm     │
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
┌─────────────────┐     ┌─────────────────┐
│  py-spy / flame  │────▶│  Trực quan hóa  │
│  graph (MRI cho  │     │  toàn bộ call   │
│  production)     │     │  stack theo thời │
└─────────────────┘     │  gian thực       │
                        └─────────────────┘

Nguyên tắc Pareto áp dụng rõ rệt: 80% thời gian chạy thường tập trung ở 20% code. Profiler giúp bạn tìm đúng 20% đó — thay vì dàn trải effort optimize lung tung. Analogy này hơi đơn giản ở chỗ: trong thực tế, bottleneck có thể phân tán hoặc thay đổi theo tải, nên bạn cần profile ở nhiều điều kiện khác nhau.

Cốt lõi kỹ thuật

cProfile — Profiler deterministic có sẵn

cProfile theo dõi mọi lời gọi hàm, ghi lại số lần gọi, thời gian tự thân (tottime), và thời gian tích lũy (cumtime). Nó là công cụ đầu tiên bạn nên dùng khi chưa biết bottleneck ở đâu.

python

import cProfile
import pstats
from io import StringIO

def compute_squares(n: int) -> list[int]:
    return [i ** 2 for i in range(n)]

def aggregate(data: list[int]) -> int:
    total = 0
    for val in data:
        total += val
    return total

def main() -> None:
    for _ in range(50):
        data = compute_squares(100_000)
        aggregate(data)

# Cách 1: dòng lệnh
# python -m cProfile -s cumtime script.py

# Cách 2: trong code — kiểm soát phạm vi profile
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()

stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.strip_dirs().sort_stats("cumulative").print_stats(15)
print(stream.getvalue())

Kết quả đọc như sau:

Cột	Ý nghĩa
`ncalls`	Số lần hàm được gọi
`tottime`	Thời gian tự thân trong hàm (không tính sub-call)
`cumtime`	Thời gian tích lũy (bao gồm mọi sub-call)
`percall`	Thời gian trung bình mỗi lần gọi

Mẹo thực tế: sort theo cumtime để tìm call chain tốn nhất, sau đó sort theo tottime để tìm hàm tự nó chậm nhất.

python

import cProfile
import pstats
from contextlib import contextmanager
from typing import Iterator

@contextmanager
def profile_block(label: str = "block") -> Iterator[cProfile.Profile]:
    """Context manager profile một đoạn code cụ thể."""
    profiler = cProfile.Profile()
    profiler.enable()
    try:
        yield profiler
    finally:
        profiler.disable()
        stats = pstats.Stats(profiler)
        stats.sort_stats("cumulative")
        print(f"\n{'='*20} Profile: {label} {'='*20}")
        stats.print_stats(10)

# Sử dụng
# with profile_block("xử lý đơn hàng"):
#     process_orders(pending_orders)

line_profiler — Profile từng dòng code

Khi cProfile chỉ ra hàm nào chậm, line_profiler cho biết dòng nào trong hàm đó là thủ phạm.

python

# order_processor.py
from line_profiler import profile

@profile
def process_order(order: dict) -> dict:
    validated = validate_fields(order)           # Dòng 1
    enriched = enrich_with_user_data(validated)  # Dòng 2
    priced = calculate_pricing(enriched)         # Dòng 3
    serialized = serialize_to_json(priced)       # Dòng 4
    return serialized                            # Dòng 5

# Chạy: kernprof -l -v order_processor.py

Output mẫu:

Line #   Hits     Time  Per Hit  % Time  Line Contents
=======================================================
     4      1      0.1     0.1     0.0   validated = validate_fields(order)
     5      1     12.3    12.3     5.2   enriched = enrich_with_user_data(validated)
     6      1      3.4     3.4     1.4   priced = calculate_pricing(enriched)
     7      1    220.5   220.5    93.3   serialized = serialize_to_json(priced)
     8      1      0.1     0.1     0.0   return serialized

Nhìn cột % Time: serialize chiếm 93.3% — đây là nơi cần tập trung optimize.

python

from line_profiler import LineProfiler

def profile_functions(*target_funcs):
    """Profile nhiều hàm cùng lúc bằng code."""
    lp = LineProfiler()
    for func in target_funcs:
        lp.add_function(func)
    # Wrap hàm đầu tiên làm entry point
    wrapped = lp(target_funcs[0])
    return wrapped, lp

# wrapped_main, profiler = profile_functions(main, helper_a, helper_b)
# wrapped_main()
# profiler.print_stats()

memory_profiler — Theo dõi bộ nhớ từng dòng

Không chỉ CPU, memory cũng cần profile — đặc biệt với service chạy lâu dài.

python

from memory_profiler import profile as mem_profile

@mem_profile
def build_report(records: list[dict]) -> str:
    transformed = [transform(r) for r in records]   # Cấp phát lớn
    grouped = group_by_category(transformed)         # Cấp phát thêm
    del transformed                                  # Giải phóng sớm
    output = render_html(grouped)
    return output

# Chạy: python -m memory_profiler report.py

Output hiển thị Mem usage (tổng bộ nhớ) và Increment (thay đổi so với dòng trước) — giúp bạn thấy chính xác dòng nào cấp phát bao nhiêu MB.

py-spy — Sampling profiler cho production

Khác với cProfile (deterministic, overhead 10-20%), py-spy dùng sampling với overhead dưới 1%. Nó attach vào process đang chạy mà không cần restart hay sửa code.

bash

# Profile process đang chạy, xuất flame graph SVG
py-spy record -o profile.svg --pid 12345

# Chế độ top-like xem realtime
py-spy top --pid 12345

# Profile từ đầu
py-spy record -o profile.svg -- python app.py

Flame graph — Đọc hiểu trực quan

┌────────────────────────────────────────────────────────┐
│                      main()                            │
├──────────────────────────────┬─────────────────────────┤
│       process_data()         │     save_results()      │
├──────────┬───────────────────┤                         │
│ parse()  │  serialize_json() │                         │
├──────────┴───────────────────┴─────────────────────────┤
│  Chiều rộng = thời gian chiếm  │  Chiều cao = độ sâu  │
│  (càng rộng = càng tốn CPU)    │  call stack           │
└────────────────────────────────────────────────────────┘

Quy tắc đọc: khối rộng nhất ở đáy là bottleneck. Nếu khối rộng nhưng ở tầng cao (gần root), nghĩa là tổng các sub-call tốn — cần đào sâu xuống con của nó.

Thực chiến

Tình huống: Tìm bottleneck API endpoint chậm trong production

Bối cảnh: API endpoint /api/v2/reports/monthly của hệ thống analytics trả về sau 4.2 giây trung bình. SLA yêu cầu dưới 1 giây. Server chạy Python 3.12, gunicorn + uvicorn, 8 worker.

Mục tiêu: Xác định chính xác bottleneck và giảm latency xuống dưới 800ms.

python

import cProfile
import pstats
from io import StringIO
from functools import wraps
from typing import Callable, Any
import time
import logging

logger = logging.getLogger(__name__)

def profile_endpoint(threshold_ms: float = 1000.0):
    """Decorator tự động profile endpoint khi vượt ngưỡng."""
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            start = time.perf_counter()
            result = func(*args, **kwargs)
            elapsed_ms = (time.perf_counter() - start) * 1000

            if elapsed_ms > threshold_ms:
                # Chạy lại với profiler để lấy chi tiết
                profiler = cProfile.Profile()
                profiler.enable()
                _ = func(*args, **kwargs)
                profiler.disable()

                stream = StringIO()
                stats = pstats.Stats(profiler, stream=stream)
                stats.strip_dirs().sort_stats("cumulative")
                stats.print_stats(20)

                logger.warning(
                    "SLOW ENDPOINT %s: %.0fms\n%s",
                    func.__name__,
                    elapsed_ms,
                    stream.getvalue(),
                )
            return result
        return wrapper
    return decorator

@profile_endpoint(threshold_ms=800.0)
def generate_monthly_report(month: int, year: int) -> dict:
    raw_data = fetch_raw_metrics(month, year)       # DB query
    aggregated = aggregate_by_dimension(raw_data)    # CPU
    formatted = format_report_payload(aggregated)    # Serialize
    return formatted

Phân tích kết quả profile thực tế:

Hàm	cumtime	% tổng	Nhận định
`format_report_payload`	3.1s	73%	Bottleneck chính — serialize lồng sâu
`fetch_raw_metrics`	0.8s	19%	DB query hợp lý với dataset lớn
`aggregate_by_dimension`	0.3s	8%	Chấp nhận được

Fix: thay json.dumps tùy chỉnh bằng orjson.dumps, giảm format_report_payload từ 3.1s xuống 0.15s. Tổng latency: 0.6s — đạt SLA.

python

# Dùng py-spy cho production (không cần sửa code)
# Bước 1: Tìm PID gunicorn worker
# ps aux | grep gunicorn

# Bước 2: Attach và record 30 giây
# py-spy record --pid 54321 -d 30 -o monthly_report.svg

# Bước 3: Mở SVG trong browser, tìm khối rộng nhất

Sai lầm điển hình

❌ Sai lầm 1: Optimize mà không profile

Vấn đề: Đoán bottleneck rồi optimize lung tung.

python

# SAI: Tối ưu hàm transform vì "cảm giác" nó chậm
def process_files(paths: list[str]) -> list[dict]:
    results = []
    for path in paths:
        data = read_file(path)       # 95% thời gian (I/O)
        result = transform(data)      # 5% thời gian (CPU)
        results.append(result)
    return results

# Dành 2 ngày tối ưu transform() bằng Cython → tiết kiệm 3% tổng

Tại sao sai: I/O chiếm 95% nhưng developer tối ưu phần CPU 5%. Trong production, điều này xảy ra thường xuyên vì người ta hay optimize phần "nhìn thấy" (code logic) thay vì phần "vô hình" (I/O wait).

python

# ĐÚNG: Profile trước, optimize đúng chỗ
import cProfile

# Bước 1: chạy cProfile
# python -m cProfile -s cumtime process.py
# → Phát hiện read_file chiếm 95%

# Bước 2: optimize I/O
import asyncio
import aiofiles

async def process_files_optimized(paths: list[str]) -> list[dict]:
    async def process_one(path: str) -> dict:
        async with aiofiles.open(path) as f:
            data = await f.read()
        return transform(data)

    return await asyncio.gather(
        *(process_one(p) for p in paths)
    )

❌ Sai lầm 2: Benchmark với dữ liệu không thực tế

Vấn đề: Benchmark trên mảng 10 phần tử, kết luận cho production triệu phần tử.

python

import timeit

# SAI: Dữ liệu quá nhỏ, kết quả sai lệch
data_tiny = [1, 2, 3, 4, 5]
time_a = timeit.timeit(lambda: sorted(data_tiny), number=100_000)
time_b = timeit.timeit(lambda: list(sorted(data_tiny)), number=100_000)
# Kết luận: "sorted nhanh hơn 2%" → vô nghĩa ở scale lớn

Tại sao sai: Với n=5, overhead hàm gọi lấn át chi phí thuật toán. Kết luận không áp dụng được cho n=100.000.

python

import timeit
import random

# ĐÚNG: Benchmark với kích thước thực tế
data_real = [random.random() for _ in range(100_000)]

def bench(label: str, stmt, setup_data):
    t = timeit.timeit(lambda: stmt(setup_data.copy()), number=50)
    print(f"{label}: {t/50*1000:.1f}ms/lần")

bench("sorted()", sorted, data_real)
bench("list.sort()", lambda d: d.sort(), data_real)

❌ Sai lầm 3: Dùng deterministic profiler trên production

Vấn đề: Bật cProfile cho production server, overhead 15-20% làm tăng latency cho tất cả user.

python

# SAI: cProfile trên production
import cProfile

profiler = cProfile.Profile()
profiler.enable()
app.run()  # Tất cả request bị slow down 15-20%
profiler.disable()

Tại sao sai: cProfile hook vào mọi function call, tạo overhead đáng kể. Với service xử lý hàng nghìn request/giây, 15% overhead có nghĩa là hàng trăm request bị timeout.

python

# ĐÚNG: Dùng sampling profiler py-spy (overhead < 1%)
# py-spy record --pid $(pgrep -f "gunicorn") -d 60 -o prod.svg

# Hoặc profile có điều kiện (chỉ request chậm)
import cProfile
import time

def conditional_profile(func, threshold_s: float = 2.0):
    """Chỉ profile khi hàm chạy quá chậm."""
    start = time.perf_counter()
    result = func()
    elapsed = time.perf_counter() - start

    if elapsed > threshold_s:
        profiler = cProfile.Profile()
        profiler.enable()
        _ = func()  # Chạy lại với profiler
        profiler.disable()
        profiler.dump_stats(f"slow_{time.time():.0f}.prof")

    return result

❌ Sai lầm 4: Quên warmup khi benchmark

Vấn đề: Lần chạy đầu bao gồm import, cache miss, JIT compile — kết quả bị inflate.

python

import timeit

# SAI: Không warmup, lần đầu chậm vì cache miss
result = timeit.timeit("func()", globals=globals(), number=1)

Tại sao sai: Python cache bytecode, CPU cache data — lần chạy đầu luôn chậm hơn steady-state.

python

import timeit

# ĐÚNG: Warmup trước benchmark
def benchmark_fair(func, n_warmup: int = 10, n_measure: int = 100):
    for _ in range(n_warmup):
        func()
    elapsed = timeit.timeit(func, number=n_measure)
    return elapsed / n_measure

Under the Hood

Cơ chế hoạt động của cProfile

cProfile sử dụng C-level hook qua PyEval_SetProfile() — mỗi khi CPython gọi hoặc trả về từ một function, hook này được kích hoạt. Vì viết bằng C (khác profile module viết bằng Python), overhead tương đối thấp.

python

import sys

# cProfile tương đương việc đặt trace function:
def trace_calls(frame, event, arg):
    if event == "call":
        print(f"→ Gọi {frame.f_code.co_name}")
    elif event == "return":
        print(f"← Trả về từ {frame.f_code.co_name}")
    return trace_calls

# sys.setprofile(trace_calls)  # Tương tự cProfile nhưng chậm hơn nhiều

So sánh công cụ profiling

Công cụ	Loại	Overhead	Độ chi tiết	Dùng khi
`cProfile`	Deterministic	10-20%	Mức hàm	Dev, staging — tìm hàm chậm
`line_profiler`	Deterministic	100-500%	Mức dòng	Dev — tìm dòng chậm trong hàm
`memory_profiler`	Deterministic	200-1000%	Mức dòng (RAM)	Dev — phát hiện memory hog
`py-spy`	Sampling	<1%	Mức hàm + stack	Production — an toàn
`Scalene`	Sampling+Alloc	5-15%	CPU + Memory + GPU	Dev — profile toàn diện
`timeit`	Micro-benchmark	N/A	Biểu thức đơn	So sánh hai cách viết

Deterministic vs Sampling

Deterministic (cProfile): Hook vào mọi call/return → chính xác 100% số lần gọi, nhưng overhead cao vì hook chạy trên mọi function call. Với call stack sâu 20 cấp và 10.000 call/giây, hook chạy 200.000 lần/giây.

Sampling (py-spy): Cứ mỗi N milliseconds, "chụp ảnh" call stack hiện tại → overhead gần bằng 0, nhưng có thể bỏ lỡ function chạy rất nhanh. Với sample rate 100Hz (mặc định py-spy), hàm chạy dưới 10ms có thể không xuất hiện.

Chi phí thực đo

python

import cProfile
import time

def baseline(n: int = 1_000_000) -> int:
    total = 0
    for i in range(n):
        total += i
    return total

# Không profile
start = time.perf_counter()
baseline()
no_profile = time.perf_counter() - start

# Có cProfile
profiler = cProfile.Profile()
profiler.enable()
start = time.perf_counter()
baseline()
with_profile = time.perf_counter() - start
profiler.disable()

overhead_pct = (with_profile - no_profile) / no_profile * 100
print(f"Không profile: {no_profile*1000:.1f}ms")
print(f"Có cProfile:   {with_profile*1000:.1f}ms")
print(f"Overhead:      {overhead_pct:.1f}%")
# Ước lượng: overhead khoảng 10-25% tùy độ sâu call stack

Checklist ghi nhớ

✅ Checklist triển khai

Quy trình profiling

[ ] Luôn profile trước khi optimize — không bao giờ đoán bottleneck
[ ] Chạy cProfile (-s cumtime) để xác định hàm tốn nhất
[ ] Dùng line_profiler cho hàm đã xác định là bottleneck
[ ] Benchmark với dữ liệu kích thước production, không phải toy data

Chọn đúng công cụ

[ ] Dev/staging: cProfile hoặc Scalene cho tổng quan
[ ] Tìm dòng cụ thể: line_profiler (@profile + kernprof)
[ ] Production: py-spy (sampling, overhead < 1%)
[ ] Memory: memory_profiler hoặc tracemalloc

Kỷ luật benchmark

[ ] Warmup ít nhất 10 lần trước khi đo
[ ] Chạy đủ số lần lặp để kết quả ổn định
[ ] Giữ nguyên array gốc — dùng .copy() nếu hàm mutate input
[ ] So sánh trên cùng hardware và điều kiện tải

Bài tập luyện tập

Bài 1: Phân tích output cProfile — Foundation

Đề bài: Cho output cProfile sau, xác định hàm nào cần optimize trước.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.002    0.000   12.540    0.125 app.py:10(handle_request)
      100    0.850    0.009    0.850    0.009 app.py:20(validate_input)
      100    8.200    0.082   11.500    0.115 app.py:35(render_template)
     5000    3.100    0.001    3.100    0.001 app.py:55(escape_html)
      100    1.030    0.010    1.030    0.010 app.py:70(query_database)

🧠 Quiz

Câu hỏi: Hàm nào là bottleneck chính cần optimize đầu tiên?

[ ] A. handle_request vì cumtime cao nhất (12.54s)
[ ] B. query_database vì truy vấn DB thường là bottleneck
[x] C. render_template vì tottime cao nhất (8.2s) và chiếm phần lớn cumtime
[ ] D. escape_html vì được gọi nhiều nhất (5000 lần) Giải thích: handle_request có cumtime cao vì nó gọi các hàm con — bản thân nó chỉ tốn 0.002s. render_template có tottime 8.2s — đây là thời gian thực sự tiêu tốn trong hàm này. escape_html được gọi từ trong render_template (3.1s), nên optimize render_template hoặc escape_html đều giúp giảm bottleneck.

💡 Gợi ý

tottime = thời gian tự thân (không tính sub-call)
cumtime = thời gian tích lũy (bao gồm sub-call)
Hàm wrapper thường có cumtime cao nhưng tottime thấp — không phải bottleneck thực

Bài 2: Viết profiling decorator — Intermediate

Đề bài: Viết decorator @auto_profile tự động log profiling data khi hàm chạy quá threshold_ms. Yêu cầu: lưu .prof file, log top 5 hàm tốn nhất.

💡 Gợi ý

Dùng time.perf_counter() đo thời gian thực
Dùng cProfile.Profile() cho lần chạy thứ hai (nếu vượt ngưỡng)
Dùng pstats.Stats để sort và lấy top N

✅ Lời giải

python

import cProfile
import pstats
import time
import logging
from functools import wraps
from io import StringIO
from typing import Callable, Any

logger = logging.getLogger(__name__)

def auto_profile(threshold_ms: float = 500.0):
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            start = time.perf_counter()
            result = func(*args, **kwargs)
            elapsed_ms = (time.perf_counter() - start) * 1000

            if elapsed_ms > threshold_ms:
                prof = cProfile.Profile()
                prof.enable()
                _ = func(*args, **kwargs)
                prof.disable()

                filename = f"slow_{func.__name__}_{time.time():.0f}.prof"
                prof.dump_stats(filename)

                stream = StringIO()
                stats = pstats.Stats(prof, stream=stream)
                stats.strip_dirs().sort_stats("cumulative")
                stats.print_stats(5)
                logger.warning(
                    "SLOW %s (%.0fms > %.0fms):\n%s",
                    func.__name__, elapsed_ms, threshold_ms,
                    stream.getvalue(),
                )
            return result
        return wrapper
    return decorator

# Sử dụng:
# @auto_profile(threshold_ms=200.0)
# def generate_report(month: int) -> dict: ...

Phân tích: Decorator chạy hàm 2 lần khi vượt ngưỡng — lần 1 lấy kết quả, lần 2 profile. Trade-off: tốn thêm 1 lần gọi, nhưng đảm bảo kết quả trả về không bị ảnh hưởng bởi profiler overhead. Trong production thực tế, bạn có thể dùng sampling thay vì deterministic cho lần profile.

Bài 3: Tìm bottleneck trong data pipeline — Advanced

Đề bài: Cho pipeline dưới đây, dùng cProfile và line_profiler để xác định bottleneck, sau đó optimize để đạt speedup ít nhất 5x.

python

import json
import hashlib

def data_pipeline(records: list[dict]) -> list[str]:
    results = []
    for record in records:
        cleaned = {k: str(v).strip().lower() for k, v in record.items()}
        serialized = json.dumps(cleaned, sort_keys=True)
        checksum = hashlib.md5(serialized.encode()).hexdigest()
        results.append(f"{checksum}:{serialized}")
    return results

💡 Gợi ý

Profile với 100.000 records để thấy rõ bottleneck
json.dumps trên từng record riêng lẻ rất tốn — cân nhắc batch hoặc dùng orjson
sort_keys=True thêm overhead sorting mỗi lần gọi

✅ Lời giải

python

import hashlib
try:
    import orjson
    def fast_dumps(obj: dict) -> bytes:
        return orjson.dumps(obj, option=orjson.OPT_SORT_KEYS)
except ImportError:
    import json
    def fast_dumps(obj: dict) -> bytes:
        return json.dumps(obj, sort_keys=True).encode()

def data_pipeline_optimized(records: list[dict]) -> list[str]:
    results: list[str] = []
    md5 = hashlib.md5
    append = results.append

    for record in records:
        cleaned = {k: str(v).strip().lower() for k, v in record.items()}
        serialized = fast_dumps(cleaned)
        checksum = md5(serialized).hexdigest()
        append(f"{checksum}:{serialized.decode()}")

    return results

# Benchmark: orjson nhanh hơn json.dumps 5-10x cho dict nhỏ
# Kết hợp với local variable lookup (md5, append), tổng speedup ~5-8x

Phân tích: Bottleneck chính là json.dumps gọi N lần. orjson được viết bằng Rust, nhanh hơn đáng kể cho serialization. Local variable binding (md5 = hashlib.md5) giảm overhead LOAD_GLOBAL cho mỗi iteration.

Liên kết học tiếp

Từ khóa glossary: profiling, cProfile, line_profiler, py-spy, flame graph, sampling profiler, deterministic profiler, bottleneck, benchmarking

Tìm kiếm liên quan: python tối ưu hiệu năng, đo hiệu năng python, tìm bottleneck python, flame graph python

Profiling Python — Tìm đúng chỗ đau trước khi chữa ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

cProfile — Profiler deterministic có sẵn ​

line_profiler — Profile từng dòng code ​

memory_profiler — Theo dõi bộ nhớ từng dòng ​

py-spy — Sampling profiler cho production ​

Flame graph — Đọc hiểu trực quan ​

Thực chiến ​

Tình huống: Tìm bottleneck API endpoint chậm trong production ​

Sai lầm điển hình ​

❌ Sai lầm 1: Optimize mà không profile ​

❌ Sai lầm 2: Benchmark với dữ liệu không thực tế ​

❌ Sai lầm 3: Dùng deterministic profiler trên production ​

❌ Sai lầm 4: Quên warmup khi benchmark ​

Under the Hood ​

Cơ chế hoạt động của cProfile ​

So sánh công cụ profiling ​

Deterministic vs Sampling ​

Chi phí thực đo ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Phân tích output cProfile — Foundation ​

Bài 2: Viết profiling decorator — Intermediate ​

Bài 3: Tìm bottleneck trong data pipeline — Advanced ​

Liên kết học tiếp ​

Profiling Python — Tìm đúng chỗ đau trước khi chữa

Bức tranh tư duy

Cốt lõi kỹ thuật

cProfile — Profiler deterministic có sẵn

line_profiler — Profile từng dòng code

memory_profiler — Theo dõi bộ nhớ từng dòng

py-spy — Sampling profiler cho production

Flame graph — Đọc hiểu trực quan

Thực chiến

Tình huống: Tìm bottleneck API endpoint chậm trong production

Sai lầm điển hình

❌ Sai lầm 1: Optimize mà không profile

❌ Sai lầm 2: Benchmark với dữ liệu không thực tế

❌ Sai lầm 3: Dùng deterministic profiler trên production

❌ Sai lầm 4: Quên warmup khi benchmark

Under the Hood

Cơ chế hoạt động của cProfile

So sánh công cụ profiling

Deterministic vs Sampling

Chi phí thực đo

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Phân tích output cProfile — Foundation

Bài 2: Viết profiling decorator — Intermediate

Bài 3: Tìm bottleneck trong data pipeline — Advanced

Liên kết học tiếp