Skip to content

Generators & Iterators Hiệu năng

Generators = Xử lý dữ liệu khổng lồ với bộ nhớ tí hon

Learning Outcomes

Sau khi hoàn thành trang này, bạn sẽ:

  • ✅ Hiểu sâu cơ chế hoạt động của generators và iterators
  • ✅ Sử dụng yield from để delegate generators
  • ✅ Viết async generators với async for
  • ✅ Hiểu generator-based coroutines (legacy pattern)
  • ✅ Xây dựng data pipelines hiệu quả

Vấn đề: File 10GB, RAM 100MB

Bạn có file log 10GB và chỉ có 100MB RAM. Làm sao xử lý?

python
# ❌ SAI: Load toàn bộ vào RAM
with open("file_10gb.log") as f:
    lines = f.readlines()  # 💀 MemoryError!
    for line in lines:
        xu_ly(line)

# ✅ ĐÚNG: Dùng Generator (đọc từng dòng)
with open("file_10gb.log") as f:
    for line in f:  # File object là iterator!
        xu_ly(line)  # Chỉ 1 dòng trong RAM tại một thời điểm

Generator là gì?

Generator là hàm trả về một iterator — nó không trả về tất cả giá trị cùng lúc, mà yield từng giá trị một khi được yêu cầu.

So sánh: List vs Generator

python
# List: Tạo TẤT CẢ ngay lập tức
def lay_binh_phuong_list(n: int) -> list[int]:
    ket_qua = []
    for i in range(n):
        ket_qua.append(i ** 2)
    return ket_qua

# Generator: Tạo TỪNG CÁI khi cần
def lay_binh_phuong_gen(n: int):
    for i in range(n):
        yield i ** 2  # "yield" thay vì "return"

# So sánh bộ nhớ
import sys

list_1m = lay_binh_phuong_list(1_000_000)
gen_1m = lay_binh_phuong_gen(1_000_000)

print(sys.getsizeof(list_1m))  # ~8,000,000 bytes (8MB)
print(sys.getsizeof(gen_1m))   # ~112 bytes (!!)

Keyword yield

yield là từ khóa biến hàm thành generator.

python
def dem_den_3():
    print("Bắt đầu")
    yield 1
    print("Tiếp tục")
    yield 2
    print("Gần xong")
    yield 3
    print("Kết thúc")

gen = dem_den_3()

print(next(gen))  # "Bắt đầu" → 1
print(next(gen))  # "Tiếp tục" → 2
print(next(gen))  # "Gần xong" → 3
print(next(gen))  # "Kết thúc" → StopIteration!

Cách hoạt động

  1. Khi gọi hàm generator, nó không chạy ngay — chỉ trả về generator object
  2. Mỗi lần gọi next(), hàm chạy đến yield tiếp theo rồi tạm dừng
  3. State (biến local) được giữ nguyên giữa các lần gọi
  4. Khi hết yield, raise StopIteration

Generator Expression

Giống List Comprehension nhưng dùng () thay vì [].

python
# List Comprehension - tạo list ngay
list_comp = [x**2 for x in range(1000000)]
# Chiếm 8MB RAM ngay lập tức

# Generator Expression - tạo generator
gen_exp = (x**2 for x in range(1000000))
# Chiếm ~100 bytes, tính khi cần

# Sử dụng trong hàm
sum(x**2 for x in range(1000000))  # Không cần () kép

Iterator Protocol

Iterable vs Iterator

Iterable

Bất kỳ object nào có phương thức __iter__() trả về iterator.

python
# Các Iterable phổ biến:
my_list = [1, 2, 3]        # list là iterable
my_tuple = (1, 2, 3)       # tuple là iterable
my_str = "abc"             # str là iterable
my_dict = {"a": 1}         # dict là iterable

# Kiểm tra
from collections.abc import Iterable
print(isinstance(my_list, Iterable))  # True

Iterator

Object có phương thức __next__() để lấy giá trị tiếp theo.

python
my_list = [1, 2, 3]

# Lấy iterator từ iterable
iterator = iter(my_list)  # Gọi __iter__()

# Lấy từng giá trị
print(next(iterator))  # 1  (Gọi __next__())
print(next(iterator))  # 2
print(next(iterator))  # 3
print(next(iterator))  # StopIteration!

Custom Iterator Class

python
from typing import Iterator

class CountDown:
    """Custom iterator đếm ngược."""
    
    def __init__(self, start: int):
        self.current = start
    
    def __iter__(self) -> Iterator[int]:
        return self  # Iterator trả về chính nó
    
    def __next__(self) -> int:
        if self.current <= 0:
            raise StopIteration
        value = self.current
        self.current -= 1
        return value

# Sử dụng
for num in CountDown(5):
    print(num)  # 5, 4, 3, 2, 1

Generator đơn giản hơn Class

python
# Thay vì viết class Iterator phức tạp:
def countdown(start: int):
    while start > 0:
        yield start
        start -= 1

# Cùng kết quả, code ngắn hơn nhiều!
for num in countdown(5):
    print(num)  # 5, 4, 3, 2, 1

yield from - Delegation Pattern 🔗

Cơ bản: Flatten Nested Generators

python
def gen_a():
    yield 1
    yield 2

def gen_b():
    yield 3
    yield 4

# ❌ Verbose: Manual iteration
def combined_manual():
    for item in gen_a():
        yield item
    for item in gen_b():
        yield item

# ✅ Clean: yield from
def combined():
    yield from gen_a()
    yield from gen_b()

list(combined())  # [1, 2, 3, 4]

Flatten Nested Structures

python
from typing import Any, Iterator

def flatten(nested: list) -> Iterator[Any]:
    """Flatten nested lists recursively."""
    for item in nested:
        if isinstance(item, list):
            yield from flatten(item)  # Recursive delegation
        else:
            yield item

nested = [1, [2, 3, [4, 5]], 6, [7, [8, 9]]]
print(list(flatten(nested)))  # [1, 2, 3, 4, 5, 6, 7, 8, 9]

yield from với Return Value

yield from có thể capture return value của sub-generator:

python
def sub_generator():
    yield 1
    yield 2
    return "done"  # Return value

def main_generator():
    result = yield from sub_generator()
    print(f"Sub-generator returned: {result}")
    yield 3

gen = main_generator()
print(next(gen))  # 1
print(next(gen))  # 2
print(next(gen))  # "Sub-generator returned: done" → 3

Bidirectional Communication

yield from tự động forward .send().throw():

python
def accumulator():
    """Sub-generator nhận values qua send()."""
    total = 0
    while True:
        value = yield total
        if value is None:
            break
        total += value
    return total

def delegator():
    """Delegate to accumulator."""
    result = yield from accumulator()
    yield f"Final total: {result}"

gen = delegator()
print(next(gen))       # 0 (initial total)
print(gen.send(10))    # 10
print(gen.send(20))    # 30
print(gen.send(5))     # 35
print(gen.send(None))  # "Final total: 35"

Tree Traversal với yield from

python
from dataclasses import dataclass
from typing import Iterator

@dataclass
class TreeNode:
    value: int
    left: "TreeNode | None" = None
    right: "TreeNode | None" = None
    
    def inorder(self) -> Iterator[int]:
        """In-order traversal using yield from."""
        if self.left:
            yield from self.left.inorder()
        yield self.value
        if self.right:
            yield from self.right.inorder()

# Build tree:     4
#               /   \
#              2     6
#             / \   / \
#            1   3 5   7

root = TreeNode(4,
    TreeNode(2, TreeNode(1), TreeNode(3)),
    TreeNode(6, TreeNode(5), TreeNode(7))
)

print(list(root.inorder()))  # [1, 2, 3, 4, 5, 6, 7]

Async Generators (Python 3.6+)

Cơ bản: async def + yield

python
import asyncio

async def async_countdown(n: int):
    """Async generator - yield trong async function."""
    while n > 0:
        yield n
        await asyncio.sleep(0.5)  # Non-blocking delay
        n -= 1

async def main():
    async for num in async_countdown(5):
        print(num)

asyncio.run(main())
# 5 (wait 0.5s) 4 (wait 0.5s) 3 (wait 0.5s) 2 (wait 0.5s) 1

Async Generator cho API Pagination

python
import asyncio
import aiohttp
from typing import AsyncIterator

async def fetch_all_users(base_url: str) -> AsyncIterator[dict]:
    """Async generator để paginate qua API."""
    async with aiohttp.ClientSession() as session:
        page = 1
        while True:
            url = f"{base_url}/users?page={page}&limit=100"
            async with session.get(url) as response:
                data = await response.json()
                users = data.get("users", [])
                
                if not users:
                    break
                
                for user in users:
                    yield user
                
                page += 1

async def main():
    async for user in fetch_all_users("https://api.example.com"):
        print(f"Processing: {user['name']}")

asyncio.run(main())
)

Async Generator Expression

python
import asyncio

async def get_data(n: int) -> int:
    await asyncio.sleep(0.1)
    return n * 2

async def main():
    # Async generator expression
    async_gen = (await get_data(i) for i in range(5))
    
    async for value in async_gen:
        print(value)  # 0, 2, 4, 6, 8

asyncio.run(main())

Async Comprehensions (Python 3.6+)

python
import asyncio

async def fetch_item(item_id: int) -> dict:
    await asyncio.sleep(0.1)
    return {"id": item_id, "data": f"Item {item_id}"}

async def main():
    # Async list comprehension
    items = [await fetch_item(i) for i in range(5)]
    print(items)
    
    # Async generator trong comprehension
    async def gen_ids():
        for i in range(5):
            yield i
            await asyncio.sleep(0.05)
    
    # Collect từ async generator
    results = [item async for item in gen_ids()]
    print(results)  # [0, 1, 2, 3, 4]

asyncio.run(main())

Async Context Manager + Generator

python
import asyncio
from contextlib import asynccontextmanager
from typing import AsyncIterator

@asynccontextmanager
async def managed_resource():
    """Async context manager using generator."""
    print("Acquiring resource...")
    await asyncio.sleep(0.1)
    resource = {"connection": "active"}
    try:
        yield resource
    finally:
        print("Releasing resource...")
        await asyncio.sleep(0.1)

async def main():
    async with managed_resource() as res:
        print(f"Using: {res}")

asyncio.run(main())
# Acquiring resource...
# Using: {'connection': 'active'}
# Releasing resource...

Generator-Based Coroutines (Legacy) 📜

⚠️ LEGACY PATTERN

Generator-based coroutines là pattern cũ trước Python 3.5. Hiện tại nên dùng async/await. Tuy nhiên, hiểu pattern này giúp bạn đọc legacy code và hiểu sâu hơn về coroutines.

Coroutine với @types.coroutine

python
import types

@types.coroutine
def legacy_sleep(seconds: float):
    """Legacy coroutine - yield để suspend."""
    yield ("sleep", seconds)

async def modern_task():
    """Modern async function có thể await legacy coroutine."""
    print("Starting task...")
    await legacy_sleep(1.0)
    print("Task completed!")

Generator như Coroutine (Pre-3.5 Pattern)

python
def coroutine_example():
    """
    Generator-based coroutine pattern.
    Dùng send() để gửi data vào generator.
    """
    print("Coroutine started")
    total = 0
    while True:
        value = yield total  # Nhận value từ send()
        if value is None:
            break
        total += value
        print(f"Received: {value}, Total: {total}")
    return total

# Sử dụng
coro = coroutine_example()
next(coro)           # Prime the coroutine (chạy đến yield đầu tiên)
coro.send(10)        # Received: 10, Total: 10
coro.send(20)        # Received: 20, Total: 30
coro.send(5)         # Received: 5, Total: 35
try:
    coro.send(None)  # Terminate
except StopIteration as e:
    print(f"Final: {e.value}")  # Final: 35

Decorator để Auto-Prime Coroutine

python
from functools import wraps
from typing import Generator, Callable

def coroutine(func: Callable[..., Generator]) -> Callable[..., Generator]:
    """Decorator để auto-prime generator coroutine."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        gen = func(*args, **kwargs)
        next(gen)  # Prime the generator
        return gen
    return wrapper

@coroutine
def averager():
    """Coroutine tính running average."""
    total = 0.0
    count = 0
    average = None
    while True:
        value = yield average
        total += value
        count += 1
        average = total / count

avg = averager()  # Đã được prime tự động
print(avg.send(10))  # 10.0
print(avg.send(20))  # 15.0
print(avg.send(30))  # 20.0

So sánh: Legacy vs Modern

python
# ❌ Legacy (Python 2.5 - 3.4)
@types.coroutine
def legacy_fetch(url):
    response = yield from aiohttp_request(url)
    return response

# ✅ Modern (Python 3.5+)
async def modern_fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.json()

Ứng dụng Thực tế

1. Data Pipeline với Generators

python
from typing import Iterator, Callable
import re

def read_lines(filepath: str) -> Iterator[str]:
    """Stage 1: Read file lazily."""
    with open(filepath, encoding="utf-8") as f:
        for line in f:
            yield line.strip()

def filter_lines(lines: Iterator[str], pattern: str) -> Iterator[str]:
    """Stage 2: Filter by regex."""
    regex = re.compile(pattern)
    for line in lines:
        if regex.search(line):
            yield line

def parse_log(lines: Iterator[str]) -> Iterator[dict]:
    """Stage 3: Parse log format."""
    for line in lines:
        parts = line.split(" ", 3)
        if len(parts) >= 4:
            yield {
                "timestamp": parts[0],
                "level": parts[1],
                "source": parts[2],
                "message": parts[3]
            }

def pipeline(filepath: str, pattern: str) -> Iterator[dict]:
    """Compose pipeline stages."""
    lines = read_lines(filepath)
    filtered = filter_lines(lines, pattern)
    parsed = parse_log(filtered)
    return parsed

# Xử lý file 10GB với memory footprint nhỏ
for log_entry in pipeline("huge.log", r"ERROR|CRITICAL"):
    print(log_entry)

2. Batching với Generators

python
from typing import Iterator, TypeVar
from itertools import islice

T = TypeVar('T')

def batch(iterable: Iterator[T], size: int) -> Iterator[list[T]]:
    """Yield batches of specified size."""
    iterator = iter(iterable)
    while True:
        batch_items = list(islice(iterator, size))
        if not batch_items:
            break
        yield batch_items

# Xử lý 1 triệu records theo batch 1000
def process_records():
    for i in range(1_000_000):
        yield {"id": i, "data": f"record_{i}"}

for batch_records in batch(process_records(), 1000):
    # Bulk insert vào database
    db.bulk_insert(batch_records)
    print(f"Inserted batch of {len(batch_records)}")

3. Infinite Sequences

python
from typing import Iterator
from itertools import count, cycle, repeat

def fibonacci() -> Iterator[int]:
    """Infinite Fibonacci sequence."""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

def primes() -> Iterator[int]:
    """Infinite prime number generator."""
    def is_prime(n: int) -> bool:
        if n < 2:
            return False
        for i in range(2, int(n ** 0.5) + 1):
            if n % i == 0:
                return False
        return True
    
    n = 2
    while True:
        if is_prime(n):
            yield n
        n += 1

# Lấy 10 số Fibonacci đầu tiên
from itertools import islice
print(list(islice(fibonacci(), 10)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

# Lấy 10 số nguyên tố đầu tiên
print(list(islice(primes(), 10)))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

4. Sliding Window

python
from typing import Iterator, TypeVar
from collections import deque

T = TypeVar('T')

def sliding_window(iterable: Iterator[T], size: int) -> Iterator[tuple[T, ...]]:
    """Generate sliding windows of specified size."""
    iterator = iter(iterable)
    window = deque(islice(iterator, size), maxlen=size)
    
    if len(window) == size:
        yield tuple(window)
    
    for item in iterator:
        window.append(item)
        yield tuple(window)

# Moving average
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for window in sliding_window(iter(data), 3):
    avg = sum(window) / len(window)
    print(f"Window: {window}, Avg: {avg:.2f}")

Production Pitfalls ⚠️

1. Generator Exhaustion

python
# ❌ BAD: Generator chỉ dùng được 1 lần!
gen = (x**2 for x in range(5))
print(list(gen))  # [0, 1, 4, 9, 16]
print(list(gen))  # [] - Đã exhausted!

# ✅ GOOD: Tạo generator mới hoặc dùng list
def get_squares():
    return (x**2 for x in range(5))

print(list(get_squares()))  # [0, 1, 4, 9, 16]
print(list(get_squares()))  # [0, 1, 4, 9, 16]

2. Không thể Index Generator

python
# ❌ BAD: Generator không hỗ trợ indexing
gen = (x for x in range(10))
# gen[5]  # TypeError: 'generator' object is not subscriptable

# ✅ GOOD: Convert to list hoặc dùng itertools
from itertools import islice

gen = (x for x in range(10))
fifth = next(islice(gen, 5, 6))  # Lấy phần tử thứ 5
print(fifth)  # 5

3. Generator trong Exception Handling

python
# ❌ BAD: Generator không cleanup khi exception
def risky_generator():
    resource = acquire_resource()
    try:
        for item in process(resource):
            yield item
    # Nếu consumer raise exception, finally không chạy!

# ✅ GOOD: Dùng contextlib hoặc try-finally đúng cách
from contextlib import contextmanager

@contextmanager
def managed_generator():
    resource = acquire_resource()
    try:
        yield process(resource)
    finally:
        release_resource(resource)  # Luôn cleanup

4. Memory Leak với Large Generators

python
# ❌ BAD: Giữ reference đến generator đã dùng
generators = []
for i in range(1000):
    gen = (x**2 for x in range(1000000))
    generators.append(gen)  # Memory leak!

# ✅ GOOD: Xử lý và discard
for i in range(1000):
    gen = (x**2 for x in range(1000000))
    result = sum(gen)  # Process immediately
    # gen được garbage collected

5. Async Generator Cleanup

python
import asyncio

# ❌ BAD: Không cleanup async generator
async def bad_usage():
    async def gen():
        try:
            while True:
                yield await fetch_data()
        finally:
            print("Cleanup!")  # Có thể không chạy!
    
    g = gen()
    async for item in g:
        if should_stop(item):
            break  # Generator không được close properly!

# ✅ GOOD: Explicit cleanup
async def good_usage():
    async def gen():
        try:
            while True:
                yield await fetch_data()
        finally:
            print("Cleanup!")
    
    g = gen()
    try:
        async for item in g:
            if should_stop(item):
                break
    finally:
        await g.aclose()  # Explicit cleanup

Bảng Tóm tắt

python
# === GENERATOR FUNCTION ===
def my_generator():
    yield 1
    yield 2
    yield 3

# === GENERATOR EXPRESSION ===
gen = (x**2 for x in range(10))

# === YIELD FROM (Delegation) ===
def combined():
    yield from gen1()
    yield from gen2()

# === ASYNC GENERATOR ===
async def async_gen():
    for i in range(10):
        yield i
        await asyncio.sleep(0.1)

async for item in async_gen():
    print(item)

# === GENERATOR METHODS ===
gen = my_generator()
next(gen)           # Lấy giá trị tiếp theo
gen.send(value)     # Gửi value vào generator
gen.throw(exc)      # Raise exception trong generator
gen.close()         # Đóng generator

# === BUILT-IN LAZY ITERATORS ===
range(10)           # Lazy range
map(func, iterable) # Lazy map
filter(func, iter)  # Lazy filter
zip(iter1, iter2)   # Lazy zip
enumerate(iter)     # Lazy enumerate

# === ITERTOOLS MODULE ===
from itertools import (
    islice,      # Slice iterator
    chain,       # Nối iterables
    cycle,       # Lặp vô hạn
    count,       # Đếm vô hạn
    groupby,     # Nhóm liên tiếp
    takewhile,   # Lấy while condition
    dropwhile,   # Bỏ while condition
)