C Extensions — Escape hatch khi Python không đủ nhanh

Một hệ thống xử lý ảnh y tế cần áp dụng convolution filter trên 10.000 ảnh DICOM mỗi giờ. Phiên bản Python thuần mất 45 phút. NumPy vectorized giảm xuống 8 phút — vẫn chưa đủ. Team viết kernel convolution bằng Cython với typed memoryview, giải phóng GIL — thời gian xuống 90 giây. Tận dụng multi-thread: 22 giây. Cùng thuật toán, cùng kết quả, nhanh hơn 120 lần so với Python thuần.

C extension không phải công cụ đầu tiên bạn chạm tới — nó là vũ khí cuối cùng sau khi đã profile, đã vectorize bằng NumPy, đã tối ưu thuật toán. Nhưng khi bottleneck thực sự nằm ở tight loop tính toán, viết native code là cách duy nhất vượt qua trần hiệu năng CPython. Bài này trang bị cho bạn 5 phương pháp — từ đơn giản (ctypes) đến mạnh mẽ nhất (Python C API) — để biết chọn đúng công cụ cho đúng tình huống.

Bức tranh tư duy

Hãy hình dung Python interpreter như một phiên dịch viên ngồi giữa bạn (code) và máy (CPU). Mỗi phép cộng a + b, phiên dịch viên phải: kiểm tra type, tìm phương thức __add__, tạo object kết quả, quản lý reference count. Tốn hàng chục nano-giây cho một phép cộng mà CPU thực hiện trong 0.3 nano-giây.

Python loop (1 triệu iterations):
┌──────┐  ┌───────────┐  ┌──────┐
│ code │→ │ interpreter│→ │ CPU  │  × 1.000.000 lần
│ a+b  │  │ type check │  │ add  │  = ~200ms
│      │  │ ref count  │  │      │
└──────┘  │ obj create │  └──────┘
          └───────────┘

C extension (1 triệu iterations):
┌──────────────────────┐  ┌──────┐
│ compiled C function  │→ │ CPU  │  × 1.000.000 lần
│ (bypass interpreter) │  │ add  │  = ~0.5ms
└──────────────────────┘  └──────┘

5 cách gọi native code từ Python, xếp theo độ dễ dùng giảm dần, hiệu năng tăng dần:

Dễ nhất ──────────────────────────── Mạnh nhất
ctypes → cffi → Cython → pybind11 → Python C API
(gọi C  (gọi C  (viết    (C++ ←→   (toàn quyền
 lib     lib    hybrid   Python    kiểm soát)
 có sẵn) đẹp    Python/  binding)
         hơn)   C)

Analogy này breakdown khi project cần hỗ trợ PyPy (cffi tốt hơn ctypes) hoặc khi codebase C++ phức tạp (pybind11 vượt trội Cython).

Cốt lõi kỹ thuật

ctypes — Gọi thư viện C có sẵn, không cần compile

ctypes là module built-in, cho phép load shared library (.so/.dll) và gọi hàm C trực tiếp. Không cần compiler, không cần header file.

python

import ctypes
import sys

# Load thư viện chuẩn C
if sys.platform == "win32":
    libc = ctypes.CDLL("msvcrt")
elif sys.platform == "darwin":
    libc = ctypes.cdll.LoadLibrary("libSystem.B.dylib")
else:
    libc = ctypes.CDLL("libc.so.6")

# Khai báo signature — BẮT BUỘC để tránh undefined behavior
libc.printf.argtypes = [ctypes.c_char_p]
libc.printf.restype = ctypes.c_int

libc.printf(b"Hello from C! pid=%d\n", ctypes.c_int(12345))

Truyền NumPy array vào hàm C:

python

import ctypes
import numpy as np

# Giả sử đã compile mylib.so với hàm:
# void sum_array(double* data, int n, double* result)
lib = ctypes.CDLL("./mylib.so")
lib.sum_array.argtypes = [
    ctypes.POINTER(ctypes.c_double),
    ctypes.c_int,
    ctypes.POINTER(ctypes.c_double),
]
lib.sum_array.restype = None

arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float64)
result = ctypes.c_double(0.0)

lib.sum_array(
    arr.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
    len(arr),
    ctypes.byref(result),
)
print(f"Sum = {result.value}")  # 15.0

Struct mapping:

python

import ctypes

class Point(ctypes.Structure):
    _fields_ = [
        ("x", ctypes.c_double),
        ("y", ctypes.c_double),
    ]

p = Point(3.0, 4.0)
# lib.process_point(ctypes.byref(p))

cffi — Foreign Function Interface hiện đại

cffi có syntax gần C hơn ctypes, error message tốt hơn, và hỗ trợ PyPy xuất sắc.

python

from cffi import FFI

ffi = FFI()

# Khai báo giống C header
ffi.cdef("""
    typedef struct {
        double x;
        double y;
    } Point;

    double distance(Point* p1, Point* p2);
""")

# ABI mode: load library có sẵn
# lib = ffi.dlopen("./mylib.so")

# API mode: compile inline C code
ffi.set_source("_geometry", """
    #include <math.h>

    typedef struct { double x; double y; } Point;

    double distance(Point* p1, Point* p2) {
        double dx = p2->x - p1->x;
        double dy = p2->y - p1->y;
        return sqrt(dx*dx + dy*dy);
    }
""")

if __name__ == "__main__":
    ffi.compile(verbose=True)

python

# Sử dụng sau khi compile
from _geometry import ffi, lib

p1 = ffi.new("Point*", {"x": 0.0, "y": 0.0})
p2 = ffi.new("Point*", {"x": 3.0, "y": 4.0})

dist = lib.distance(p1, p2)
print(f"Distance = {dist}")  # 5.0

Cython — Viết Python, chạy tốc độ C

Cython biên dịch Python code (có type annotation) thành C, rồi thành shared library. Đây là lựa chọn phổ biến nhất khi cần tăng tốc Python loop.

python

# fast_compute.pyx
import cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)  # Tắt bounds check → nhanh hơn
@cython.wraparound(False)   # Tắt negative indexing → nhanh hơn
def compute_sum_squares(double[:] data) -> double:
    """Tính tổng bình phương, typed memoryview."""
    cdef Py_ssize_t i
    cdef Py_ssize_t n = data.shape[0]
    cdef double total = 0.0

    for i in range(n):
        total += data[i] * data[i]

    return total

python

# setup.py
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize("fast_compute.pyx"),
    include_dirs=[np.get_include()],
)
# Build: python setup.py build_ext --inplace

Giải phóng GIL cho multi-threading:

python

# parallel_compute.pyx
from cython.parallel import prange
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def parallel_sum_squares(double[:] data) -> double:
    cdef Py_ssize_t i
    cdef Py_ssize_t n = data.shape[0]
    cdef double total = 0.0

    with nogil:  # Giải phóng GIL
        for i in prange(n, schedule="static"):
            total += data[i] * data[i]

    return total
# Với 8 cores: nhanh gấp ~6-7x so với single-thread Cython

cdef vs cpdef vs def:

python

# functions.pyx

cdef double c_only(double x):
    """Chỉ gọi được từ Cython/C. Nhanh nhất."""
    return x * x

cpdef double hybrid(double x):
    """Gọi được từ Python VÀ C. Nhanh gần bằng cdef."""
    return x * x

def python_callable(x):
    """Python function bình thường. Chậm nhất."""
    return x * x

pybind11 — C++ ↔ Python binding

pybind11 là lựa chọn tốt nhất khi bạn có codebase C++ sẵn hoặc cần tận dụng C++ features (templates, RAII, STL).

cpp

// compute.cpp
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <cmath>

namespace py = pybind11;

double sum_squares(py::array_t<double> input) {
    auto buf = input.request();
    auto* ptr = static_cast<double*>(buf.ptr);
    size_t n = buf.size;

    double total = 0.0;
    for (size_t i = 0; i < n; ++i) {
        total += ptr[i] * ptr[i];
    }
    return total;
}

PYBIND11_MODULE(compute, m) {
    m.doc() = "Compute module";
    m.def("sum_squares", &sum_squares,
          "Tính tổng bình phương",
          py::arg("input"));
}

python

# Sử dụng
import numpy as np
import compute  # Module C++ vừa build

data = np.random.random(1_000_000)
result = compute.sum_squares(data)

Python C API — Toàn quyền kiểm soát

Python C API cho phép kiểm soát hoàn toàn: tạo type, quản lý reference count, tương tác trực tiếp với interpreter. Phức tạp nhất nhưng mạnh nhất.

// fastmod.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject* fastmod_sum_squares(PyObject* self, PyObject* args) {
    PyObject* list_obj;
    if (!PyArg_ParseTuple(args, "O", &list_obj))
        return NULL;

    Py_ssize_t n = PyList_Size(list_obj);
    double total = 0.0;

    for (Py_ssize_t i = 0; i < n; ++i) {
        PyObject* item = PyList_GetItem(list_obj, i);
        double val = PyFloat_AsDouble(item);
        if (val == -1.0 && PyErr_Occurred())
            return NULL;
        total += val * val;
    }
    return PyFloat_FromDouble(total);
}

static PyMethodDef methods[] = {
    {"sum_squares", fastmod_sum_squares, METH_VARARGS,
     "Tính tổng bình phương các phần tử"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT, "fastmod", NULL, -1, methods
};

PyMODINIT_FUNC PyInit_fastmod(void) {
    return PyModule_Create(&module);
}

Thực chiến

Tình huống: Tối ưu hot path trong scoring engine

Bối cảnh: Scoring engine tính risk score cho 500.000 giao dịch/giây. Hàm calculate_risk_score gọi 500K lần/giây, mỗi lần tính Euclidean distance + sigmoid trên 20-dimension feature vector. Profiler cho thấy hàm này chiếm 78% CPU.

Mục tiêu: Giảm thời gian tính score ít nhất 10x mà không thay đổi logic.

python

# Phiên bản Python thuần (baseline)
import math
from typing import Sequence

def calculate_risk_score_python(
    features: Sequence[float],
    weights: Sequence[float],
    threshold: float = 0.5,
) -> float:
    """Pure Python — 500K calls/s = quá chậm."""
    distance_sq = 0.0
    for f, w in zip(features, weights):
        distance_sq += (f * w) ** 2
    distance = math.sqrt(distance_sq)
    score = 1.0 / (1.0 + math.exp(-distance + threshold))
    return score

python

# Phiên bản Cython (production)
# risk_engine.pyx
import cython
from libc.math cimport sqrt, exp

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef double calculate_risk_score(
    double[:] features,
    double[:] weights,
    double threshold = 0.5,
):
    """Cython — typed memoryview, no GIL overhead trên inner loop."""
    cdef Py_ssize_t i
    cdef Py_ssize_t n = features.shape[0]
    cdef double distance_sq = 0.0
    cdef double fw

    for i in range(n):
        fw = features[i] * weights[i]
        distance_sq += fw * fw

    cdef double distance = sqrt(distance_sq)
    cdef double score = 1.0 / (1.0 + exp(-distance + threshold))
    return score

python

# Benchmark so sánh
import numpy as np
import timeit

features = np.random.random(20)
weights = np.random.random(20)

# Pure Python
t_py = timeit.timeit(
    lambda: calculate_risk_score_python(features.tolist(), weights.tolist()),
    number=500_000,
)

# Cython (sau khi build)
# from risk_engine import calculate_risk_score
# t_cy = timeit.timeit(
#     lambda: calculate_risk_score(features, weights),
#     number=500_000,
# )

# NumPy vectorized (batch 500K cùng lúc)
features_batch = np.random.random((500_000, 20))
weights_batch = np.tile(weights, (500_000, 1))

def numpy_batch(features_batch, weights_batch, threshold=0.5):
    fw = features_batch * weights_batch
    dist = np.sqrt(np.sum(fw ** 2, axis=1))
    return 1.0 / (1.0 + np.exp(-dist + threshold))

t_np = timeit.timeit(lambda: numpy_batch(features_batch, weights_batch), number=1)

print(f"Python (500K calls):  {t_py:.2f}s")
# print(f"Cython (500K calls):  {t_cy:.2f}s")
print(f"NumPy  (500K batch):  {t_np:.3f}s")
# Ước lượng: Python ~8s, Cython ~0.3s (26x), NumPy batch ~0.05s (160x)

Phân tích:

Cython single call: Nhanh ~20-30x so với Python nhờ loại bỏ interpreter overhead
NumPy batch: Nhanh nhất khi xử lý batch — nhưng yêu cầu load toàn bộ data vào RAM
Chọn Cython khi: xử lý streaming (từng giao dịch), tích hợp với existing Python codebase
Chọn NumPy khi: xử lý batch, data đã ở dạng array
Trade-off: Cython cần compiler, build step, khó debug hơn Python thuần

Sai lầm điển hình

❌ Sai lầm 1: Không khai báo argtypes cho ctypes

Vấn đề: Gọi hàm C qua ctypes mà không khai báo types → undefined behavior, segfault.

python

import ctypes

lib = ctypes.CDLL("./mylib.so")

# SAI: Không khai báo argtypes
result = lib.compute(3.14)  # Python float → C int? double? ai biết!
# Kết quả: garbage value hoặc segfault

Tại sao sai: ctypes mặc định truyền Python int → C int. Nếu hàm C mong đợi double, bits sẽ bị reinterpret sai — không có compiler kiểm tra.

python

import ctypes

lib = ctypes.CDLL("./mylib.so")

# ĐÚNG: Luôn khai báo argtypes VÀ restype
lib.compute.argtypes = [ctypes.c_double]
lib.compute.restype = ctypes.c_double

result = lib.compute(3.14)  # Đúng type, an toàn
# lib.compute("hello")  # TypeError — bắt lỗi sớm!

❌ Sai lầm 2: NumPy array bị garbage collect khi C đang dùng

Vấn đề: Lấy pointer từ NumPy array rồi để array bị GC — pointer trỏ vào bộ nhớ đã giải phóng.

python

import ctypes
import numpy as np

# SAI: Array bị GC ngay sau khi hàm trả về
def get_pointer():
    arr = np.array([1.0, 2.0, 3.0])  # Local variable
    return arr.ctypes.data_as(ctypes.POINTER(ctypes.c_double))

ptr = get_pointer()  # arr đã bị GC!
# ptr giờ trỏ vào bộ nhớ đã giải phóng → use-after-free

Tại sao sai: arr.ctypes.data_as() trả về raw pointer, không giữ reference đến array. Khi get_pointer() return, arr hết scope → GC thu hồi → pointer dangling.

python

import ctypes
import numpy as np

# ĐÚNG: Giữ array sống trong suốt thời gian dùng pointer
def process_with_c(arr: np.ndarray) -> float:
    """arr được giữ sống bởi caller."""
    ptr = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
    lib.process(ptr, len(arr))
    return lib.get_result()

data = np.array([1.0, 2.0, 3.0])  # Sống ở scope ngoài
result = process_with_c(data)       # data vẫn sống khi C dùng

❌ Sai lầm 3: Cython không giải phóng GIL cho tight loop

Vấn đề: Cython code vẫn giữ GIL → các thread khác bị block, không tận dụng multi-core.

python

# SAI: GIL vẫn bị giữ — single-threaded performance
# compute.pyx
def slow_cython(double[:] data):
    cdef Py_ssize_t i
    cdef double total = 0.0
    for i in range(data.shape[0]):
        total += data[i] * data[i]
    # GIL held → các thread Python khác phải chờ
    return total

Tại sao sai: Mặc dù code Cython chạy nhanh hơn Python, GIL vẫn bị giữ. Trong web server multi-threaded, một request chạy tight loop = block tất cả request khác.

python

# ĐÚNG: Giải phóng GIL cho pure C operations
# compute.pyx
from cython.parallel import prange
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def fast_cython(double[:] data) -> double:
    cdef Py_ssize_t i
    cdef Py_ssize_t n = data.shape[0]
    cdef double total = 0.0

    with nogil:
        for i in range(n):  # Hoặc prange(n) cho parallel
            total += data[i] * data[i]

    return total
# Lưu ý: trong nogil block KHÔNG ĐƯỢC gọi Python API

❌ Sai lầm 4: Memory leak khi gọi C allocation

Vấn đề: C function cấp phát memory, Python không tự giải phóng được.

python

import ctypes

lib = ctypes.CDLL("./mylib.so")
lib.create_buffer.restype = ctypes.c_void_p

# SAI: Cấp phát rồi quên giải phóng
ptr = lib.create_buffer(1024)
# ... dùng ptr ...
# ptr bị reassign hoặc hết scope → memory leak!

Tại sao sai: Python GC quản lý Python objects, không quản lý C heap. Memory cấp phát bởi malloc() trong C phải được free() tường minh.

python

import ctypes
from contextlib import contextmanager
from typing import Iterator

lib = ctypes.CDLL("./mylib.so")
lib.create_buffer.restype = ctypes.c_void_p
lib.free_buffer.argtypes = [ctypes.c_void_p]

# ĐÚNG: Context manager đảm bảo cleanup
@contextmanager
def c_buffer(size: int) -> Iterator[ctypes.c_void_p]:
    ptr = lib.create_buffer(size)
    if not ptr:
        raise MemoryError(f"Failed to allocate {size} bytes")
    try:
        yield ptr
    finally:
        lib.free_buffer(ptr)

with c_buffer(1024) as buf:
    lib.process_data(buf, 1024)
# buf tự động được free khi ra khỏi with block

Under the Hood

Tại sao C extension nhanh hơn?

1. Bypass interpreter loop: CPython bytecode execution loop (ceval.c) xử lý ~1 opcode/15ns. Tight loop 1M iterations = 15ms overhead chỉ cho dispatching, chưa tính actual computation.

2. Native type operations: int Python tốn ~28 bytes, C int tốn 4 bytes. Phép cộng Python int: kiểm tra type + overflow + tạo object mới = ~100ns. C int: 1 CPU instruction = 0.3ns.

3. GIL release: C code có thể giải phóng GIL → chạy song song với Python code khác → tận dụng multi-core.

So sánh hiệu năng 5 phương pháp

Phương pháp	Setup complexity	Build cần	Speedup vs Python	GIL release	Best for
ctypes	Thấp	Không (load .so)	10-50x	Tự động	Gọi C lib có sẵn
cffi	Thấp-Trung	Có (API mode)	10-50x	Tự động	Gọi C lib, hỗ trợ PyPy
Cython	Trung bình	Có	20-100x	`with nogil`	Custom loop, NumPy integration
pybind11	Trung-Cao	Có (CMake)	50-200x	Manual	C++ codebase
Python C API	Cao	Có	50-200x	Manual	Maximum control, custom types

Chi phí function call overhead

python

import timeit

# Gọi hàm Python thuần
def py_noop():
    pass

# Gọi built-in (C implementation)
# len([])

t_py = timeit.timeit(py_noop, number=10_000_000)
t_builtin = timeit.timeit(lambda: len([]), number=10_000_000)

print(f"Python func call:  {t_py / 10_000_000 * 1e9:.0f}ns/call")
print(f"Built-in call:     {t_builtin / 10_000_000 * 1e9:.0f}ns/call")
# Ước lượng: Python ~80ns, built-in ~40ns
# C extension func call: ~20-30ns

Khi nào KHÔNG nên viết C extension

Tình huống	Tại sao không cần C
Code I/O-bound	Bottleneck là network/disk, C không giúp
Đã vectorize NumPy	NumPy đã gọi C bên trong
Logic phức tạp, ít iteration	Overhead viết/maintain C > benefit
Code chạy hiếm	10ms vs 1ms mỗi lần startup — không đáng
Team không biết C	Bug trong C extension = segfault, memory corruption

Checklist ghi nhớ

✅ Checklist triển khai

Quyết định dùng C extension

[ ] Đã profile và xác định bottleneck là CPU-bound tight loop
[ ] Đã thử NumPy vectorization — vẫn chưa đủ nhanh
[ ] Đã thử tối ưu thuật toán — complexity đã tối ưu
[ ] Team có khả năng maintain C/C++ code

ctypes / cffi

[ ] Luôn khai báo argtypes và restype cho mọi hàm
[ ] Giữ NumPy array sống trong suốt thời gian C dùng pointer
[ ] Dùng ctypes.util.find_library() để tìm library path portable
[ ] Cleanup mọi memory C cấp phát (dùng context manager)

Cython

[ ] Dùng @boundscheck(False) và @wraparound(False) cho tight loop
[ ] Dùng typed memoryview (double[:]) thay vì np.ndarray[...]
[ ] Giải phóng GIL với with nogil: cho pure C operations
[ ] Chạy cython -a file.pyx để kiểm tra — vàng = Python interaction (chậm)

pybind11 / C API

[ ] Quản lý reference count cẩn thận (Py_INCREF/Py_DECREF)
[ ] Kiểm tra return value NULL (báo hiệu Python exception)
[ ] Build với -Wall -Werror để bắt lỗi C sớm

Bài tập luyện tập

Bài 1: Chọn đúng công cụ — Foundation

Đề bài: Với mỗi tình huống, chọn phương pháp phù hợp nhất.

🧠 Quiz

Câu hỏi: Team cần gọi hàm từ thư viện OpenSSL (.so) đã có sẵn, không muốn compile gì thêm. Chọn gì?

[x] A. ctypes — load .so trực tiếp, không cần build
[ ] B. Cython — cần compile .pyx → .c → .so
[ ] C. pybind11 — cần viết C++ wrapper + CMake
[ ] D. Python C API — cần viết module C từ đầu Giải thích: ctypes load shared library có sẵn mà không cần compiler hay build step. cffi cũng hoạt động (ABI mode), nhưng ctypes là built-in, không cần cài thêm package.

Bài 2: Viết Cython function — Intermediate

Đề bài: Viết hàm Cython tính dot product của hai vector float64, có boundscheck(False), wraparound(False), và giải phóng GIL.

💡 Gợi ý

Dùng typed memoryview: double[:] a
cdef double total = 0.0 cho biến C
with nogil: bao quanh vòng lặp tính toán

✅ Lời giải

python

# dot_product.pyx
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef double dot_product(double[:] a, double[:] b):
    """Dot product với typed memoryview, GIL released."""
    if a.shape[0] != b.shape[0]:
        raise ValueError("Vectors must have same length")

    cdef Py_ssize_t i
    cdef Py_ssize_t n = a.shape[0]
    cdef double total = 0.0

    with nogil:
        for i in range(n):
            total += a[i] * b[i]

    return total

# Test (sau khi build):
# import numpy as np
# a = np.random.random(1_000_000)
# b = np.random.random(1_000_000)
# result = dot_product(a, b)
# assert abs(result - np.dot(a, b)) < 1e-6

Phân tích: cpdef cho phép gọi từ Python lẫn C. with nogil giải phóng GIL — nhưng trong block đó chỉ được dùng C operations, không được gọi Python API. boundscheck(False) bỏ kiểm tra index — cần đảm bảo logic đúng vì segfault nếu out-of-bounds.

Bài 3: Context manager cho C memory — Advanced

Đề bài: Viết class CBuffer dùng ctypes, cấp phát buffer từ C library, tự động giải phóng khi ra khỏi scope. Hỗ trợ context manager protocol VÀ __del__ fallback.

💡 Gợi ý

__enter__ trả về self hoặc pointer
__exit__ gọi C free
__del__ như safety net nếu quên dùng with

✅ Lời giải

python

import ctypes
from typing import Optional

class CBuffer:
    """RAII-style wrapper cho C-allocated memory."""

    __slots__ = ("_ptr", "_size", "_lib", "_freed")

    def __init__(self, lib: ctypes.CDLL, size: int) -> None:
        self._lib = lib
        self._size = size
        self._freed = False

        lib.allocate_buffer.argtypes = [ctypes.c_size_t]
        lib.allocate_buffer.restype = ctypes.c_void_p
        lib.free_buffer.argtypes = [ctypes.c_void_p]
        lib.free_buffer.restype = None

        self._ptr = lib.allocate_buffer(size)
        if not self._ptr:
            raise MemoryError(f"Cannot allocate {size} bytes")

    @property
    def ptr(self) -> ctypes.c_void_p:
        if self._freed:
            raise RuntimeError("Buffer already freed")
        return self._ptr

    @property
    def size(self) -> int:
        return self._size

    def __enter__(self) -> "CBuffer":
        return self

    def __exit__(self, *exc) -> None:
        self._free()

    def _free(self) -> None:
        if not self._freed and self._ptr:
            self._lib.free_buffer(self._ptr)
            self._freed = True

    def __del__(self) -> None:
        self._free()  # Safety net

# Sử dụng:
# lib = ctypes.CDLL("./mylib.so")
# with CBuffer(lib, 4096) as buf:
#     lib.process(buf.ptr, buf.size)
# # Tự động free khi ra khỏi with

Phân tích: Pattern RAII (Resource Acquisition Is Initialization) từ C++, adapt sang Python. __del__ là fallback nhưng timing không đảm bảo trong CPython cycle collector — luôn ưu tiên dùng with. __slots__ tiết kiệm memory cho wrapper object. _freed flag ngăn double-free.

Liên kết học tiếp

Từ khóa glossary: C extension, ctypes, cffi, Cython, pybind11, Python C API, GIL release, shared library, typed memoryview, nogil

Tìm kiếm liên quan: python gọi C, cython hướng dẫn, python ctypes tutorial, pybind11 numpy, tăng tốc python bằng C

C Extensions — Escape hatch khi Python không đủ nhanh ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

ctypes — Gọi thư viện C có sẵn, không cần compile ​

cffi — Foreign Function Interface hiện đại ​

Cython — Viết Python, chạy tốc độ C ​

pybind11 — C++ ↔ Python binding ​

Python C API — Toàn quyền kiểm soát ​

Thực chiến ​

Tình huống: Tối ưu hot path trong scoring engine ​

Sai lầm điển hình ​

❌ Sai lầm 1: Không khai báo argtypes cho ctypes ​

❌ Sai lầm 2: NumPy array bị garbage collect khi C đang dùng ​

❌ Sai lầm 3: Cython không giải phóng GIL cho tight loop ​

❌ Sai lầm 4: Memory leak khi gọi C allocation ​

Under the Hood ​

Tại sao C extension nhanh hơn? ​

So sánh hiệu năng 5 phương pháp ​

Chi phí function call overhead ​

Khi nào KHÔNG nên viết C extension ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Chọn đúng công cụ — Foundation ​

Bài 2: Viết Cython function — Intermediate ​

Bài 3: Context manager cho C memory — Advanced ​

Liên kết học tiếp ​

C Extensions — Escape hatch khi Python không đủ nhanh

Bức tranh tư duy

Cốt lõi kỹ thuật

ctypes — Gọi thư viện C có sẵn, không cần compile

cffi — Foreign Function Interface hiện đại

Cython — Viết Python, chạy tốc độ C

pybind11 — C++ ↔ Python binding

Python C API — Toàn quyền kiểm soát

Thực chiến

Tình huống: Tối ưu hot path trong scoring engine

Sai lầm điển hình

❌ Sai lầm 1: Không khai báo argtypes cho ctypes

❌ Sai lầm 2: NumPy array bị garbage collect khi C đang dùng

❌ Sai lầm 3: Cython không giải phóng GIL cho tight loop

❌ Sai lầm 4: Memory leak khi gọi C allocation

Under the Hood

Tại sao C extension nhanh hơn?

So sánh hiệu năng 5 phương pháp

Chi phí function call overhead

Khi nào KHÔNG nên viết C extension

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Chọn đúng công cụ — Foundation

Bài 2: Viết Cython function — Intermediate

Bài 3: Context manager cho C memory — Advanced

Liên kết học tiếp