Giao diện
Generators & Iterators Hiệu năng
Generators = Xử lý dữ liệu khổng lồ với bộ nhớ tí hon
Learning Outcomes
Sau khi hoàn thành trang này, bạn sẽ:
- ✅ Hiểu sâu cơ chế hoạt động của generators và iterators
- ✅ Sử dụng
yield fromđể delegate generators - ✅ Viết async generators với
async for - ✅ Hiểu generator-based coroutines (legacy pattern)
- ✅ Xây dựng data pipelines hiệu quả
Vấn đề: File 10GB, RAM 100MB
Bạn có file log 10GB và chỉ có 100MB RAM. Làm sao xử lý?
python
# ❌ SAI: Load toàn bộ vào RAM
with open("file_10gb.log") as f:
lines = f.readlines() # 💀 MemoryError!
for line in lines:
xu_ly(line)
# ✅ ĐÚNG: Dùng Generator (đọc từng dòng)
with open("file_10gb.log") as f:
for line in f: # File object là iterator!
xu_ly(line) # Chỉ 1 dòng trong RAM tại một thời điểmGenerator là gì?
Generator là hàm trả về một iterator — nó không trả về tất cả giá trị cùng lúc, mà yield từng giá trị một khi được yêu cầu.
So sánh: List vs Generator
python
# List: Tạo TẤT CẢ ngay lập tức
def lay_binh_phuong_list(n: int) -> list[int]:
ket_qua = []
for i in range(n):
ket_qua.append(i ** 2)
return ket_qua
# Generator: Tạo TỪNG CÁI khi cần
def lay_binh_phuong_gen(n: int):
for i in range(n):
yield i ** 2 # "yield" thay vì "return"
# So sánh bộ nhớ
import sys
list_1m = lay_binh_phuong_list(1_000_000)
gen_1m = lay_binh_phuong_gen(1_000_000)
print(sys.getsizeof(list_1m)) # ~8,000,000 bytes (8MB)
print(sys.getsizeof(gen_1m)) # ~112 bytes (!!)Keyword yield
yield là từ khóa biến hàm thành generator.
python
def dem_den_3():
print("Bắt đầu")
yield 1
print("Tiếp tục")
yield 2
print("Gần xong")
yield 3
print("Kết thúc")
gen = dem_den_3()
print(next(gen)) # "Bắt đầu" → 1
print(next(gen)) # "Tiếp tục" → 2
print(next(gen)) # "Gần xong" → 3
print(next(gen)) # "Kết thúc" → StopIteration!Cách hoạt động
- Khi gọi hàm generator, nó không chạy ngay — chỉ trả về generator object
- Mỗi lần gọi
next(), hàm chạy đếnyieldtiếp theo rồi tạm dừng - State (biến local) được giữ nguyên giữa các lần gọi
- Khi hết
yield, raiseStopIteration
Generator Expression
Giống List Comprehension nhưng dùng () thay vì [].
python
# List Comprehension - tạo list ngay
list_comp = [x**2 for x in range(1000000)]
# Chiếm 8MB RAM ngay lập tức
# Generator Expression - tạo generator
gen_exp = (x**2 for x in range(1000000))
# Chiếm ~100 bytes, tính khi cần
# Sử dụng trong hàm
sum(x**2 for x in range(1000000)) # Không cần () képIterator Protocol
Iterable vs Iterator
Iterable
Bất kỳ object nào có phương thức __iter__() trả về iterator.
python
# Các Iterable phổ biến:
my_list = [1, 2, 3] # list là iterable
my_tuple = (1, 2, 3) # tuple là iterable
my_str = "abc" # str là iterable
my_dict = {"a": 1} # dict là iterable
# Kiểm tra
from collections.abc import Iterable
print(isinstance(my_list, Iterable)) # TrueIterator
Object có phương thức __next__() để lấy giá trị tiếp theo.
python
my_list = [1, 2, 3]
# Lấy iterator từ iterable
iterator = iter(my_list) # Gọi __iter__()
# Lấy từng giá trị
print(next(iterator)) # 1 (Gọi __next__())
print(next(iterator)) # 2
print(next(iterator)) # 3
print(next(iterator)) # StopIteration!Custom Iterator Class
python
from typing import Iterator
class CountDown:
"""Custom iterator đếm ngược."""
def __init__(self, start: int):
self.current = start
def __iter__(self) -> Iterator[int]:
return self # Iterator trả về chính nó
def __next__(self) -> int:
if self.current <= 0:
raise StopIteration
value = self.current
self.current -= 1
return value
# Sử dụng
for num in CountDown(5):
print(num) # 5, 4, 3, 2, 1Generator đơn giản hơn Class
python
# Thay vì viết class Iterator phức tạp:
def countdown(start: int):
while start > 0:
yield start
start -= 1
# Cùng kết quả, code ngắn hơn nhiều!
for num in countdown(5):
print(num) # 5, 4, 3, 2, 1yield from - Delegation Pattern 🔗
Cơ bản: Flatten Nested Generators
python
def gen_a():
yield 1
yield 2
def gen_b():
yield 3
yield 4
# ❌ Verbose: Manual iteration
def combined_manual():
for item in gen_a():
yield item
for item in gen_b():
yield item
# ✅ Clean: yield from
def combined():
yield from gen_a()
yield from gen_b()
list(combined()) # [1, 2, 3, 4]Flatten Nested Structures
python
from typing import Any, Iterator
def flatten(nested: list) -> Iterator[Any]:
"""Flatten nested lists recursively."""
for item in nested:
if isinstance(item, list):
yield from flatten(item) # Recursive delegation
else:
yield item
nested = [1, [2, 3, [4, 5]], 6, [7, [8, 9]]]
print(list(flatten(nested))) # [1, 2, 3, 4, 5, 6, 7, 8, 9]yield from với Return Value
yield from có thể capture return value của sub-generator:
python
def sub_generator():
yield 1
yield 2
return "done" # Return value
def main_generator():
result = yield from sub_generator()
print(f"Sub-generator returned: {result}")
yield 3
gen = main_generator()
print(next(gen)) # 1
print(next(gen)) # 2
print(next(gen)) # "Sub-generator returned: done" → 3Bidirectional Communication
yield from tự động forward .send() và .throw():
python
def accumulator():
"""Sub-generator nhận values qua send()."""
total = 0
while True:
value = yield total
if value is None:
break
total += value
return total
def delegator():
"""Delegate to accumulator."""
result = yield from accumulator()
yield f"Final total: {result}"
gen = delegator()
print(next(gen)) # 0 (initial total)
print(gen.send(10)) # 10
print(gen.send(20)) # 30
print(gen.send(5)) # 35
print(gen.send(None)) # "Final total: 35"Tree Traversal với yield from
python
from dataclasses import dataclass
from typing import Iterator
@dataclass
class TreeNode:
value: int
left: "TreeNode | None" = None
right: "TreeNode | None" = None
def inorder(self) -> Iterator[int]:
"""In-order traversal using yield from."""
if self.left:
yield from self.left.inorder()
yield self.value
if self.right:
yield from self.right.inorder()
# Build tree: 4
# / \
# 2 6
# / \ / \
# 1 3 5 7
root = TreeNode(4,
TreeNode(2, TreeNode(1), TreeNode(3)),
TreeNode(6, TreeNode(5), TreeNode(7))
)
print(list(root.inorder())) # [1, 2, 3, 4, 5, 6, 7]Async Generators (Python 3.6+) ⚡
Cơ bản: async def + yield
python
import asyncio
async def async_countdown(n: int):
"""Async generator - yield trong async function."""
while n > 0:
yield n
await asyncio.sleep(0.5) # Non-blocking delay
n -= 1
async def main():
async for num in async_countdown(5):
print(num)
asyncio.run(main())
# 5 (wait 0.5s) 4 (wait 0.5s) 3 (wait 0.5s) 2 (wait 0.5s) 1Async Generator cho API Pagination
python
import asyncio
import aiohttp
from typing import AsyncIterator
async def fetch_all_users(base_url: str) -> AsyncIterator[dict]:
"""Async generator để paginate qua API."""
async with aiohttp.ClientSession() as session:
page = 1
while True:
url = f"{base_url}/users?page={page}&limit=100"
async with session.get(url) as response:
data = await response.json()
users = data.get("users", [])
if not users:
break
for user in users:
yield user
page += 1
async def main():
async for user in fetch_all_users("https://api.example.com"):
print(f"Processing: {user['name']}")
asyncio.run(main())
)Async Generator Expression
python
import asyncio
async def get_data(n: int) -> int:
await asyncio.sleep(0.1)
return n * 2
async def main():
# Async generator expression
async_gen = (await get_data(i) for i in range(5))
async for value in async_gen:
print(value) # 0, 2, 4, 6, 8
asyncio.run(main())Async Comprehensions (Python 3.6+)
python
import asyncio
async def fetch_item(item_id: int) -> dict:
await asyncio.sleep(0.1)
return {"id": item_id, "data": f"Item {item_id}"}
async def main():
# Async list comprehension
items = [await fetch_item(i) for i in range(5)]
print(items)
# Async generator trong comprehension
async def gen_ids():
for i in range(5):
yield i
await asyncio.sleep(0.05)
# Collect từ async generator
results = [item async for item in gen_ids()]
print(results) # [0, 1, 2, 3, 4]
asyncio.run(main())Async Context Manager + Generator
python
import asyncio
from contextlib import asynccontextmanager
from typing import AsyncIterator
@asynccontextmanager
async def managed_resource():
"""Async context manager using generator."""
print("Acquiring resource...")
await asyncio.sleep(0.1)
resource = {"connection": "active"}
try:
yield resource
finally:
print("Releasing resource...")
await asyncio.sleep(0.1)
async def main():
async with managed_resource() as res:
print(f"Using: {res}")
asyncio.run(main())
# Acquiring resource...
# Using: {'connection': 'active'}
# Releasing resource...Generator-Based Coroutines (Legacy) 📜
⚠️ LEGACY PATTERN
Generator-based coroutines là pattern cũ trước Python 3.5. Hiện tại nên dùng async/await. Tuy nhiên, hiểu pattern này giúp bạn đọc legacy code và hiểu sâu hơn về coroutines.
Coroutine với @types.coroutine
python
import types
@types.coroutine
def legacy_sleep(seconds: float):
"""Legacy coroutine - yield để suspend."""
yield ("sleep", seconds)
async def modern_task():
"""Modern async function có thể await legacy coroutine."""
print("Starting task...")
await legacy_sleep(1.0)
print("Task completed!")Generator như Coroutine (Pre-3.5 Pattern)
python
def coroutine_example():
"""
Generator-based coroutine pattern.
Dùng send() để gửi data vào generator.
"""
print("Coroutine started")
total = 0
while True:
value = yield total # Nhận value từ send()
if value is None:
break
total += value
print(f"Received: {value}, Total: {total}")
return total
# Sử dụng
coro = coroutine_example()
next(coro) # Prime the coroutine (chạy đến yield đầu tiên)
coro.send(10) # Received: 10, Total: 10
coro.send(20) # Received: 20, Total: 30
coro.send(5) # Received: 5, Total: 35
try:
coro.send(None) # Terminate
except StopIteration as e:
print(f"Final: {e.value}") # Final: 35Decorator để Auto-Prime Coroutine
python
from functools import wraps
from typing import Generator, Callable
def coroutine(func: Callable[..., Generator]) -> Callable[..., Generator]:
"""Decorator để auto-prime generator coroutine."""
@wraps(func)
def wrapper(*args, **kwargs):
gen = func(*args, **kwargs)
next(gen) # Prime the generator
return gen
return wrapper
@coroutine
def averager():
"""Coroutine tính running average."""
total = 0.0
count = 0
average = None
while True:
value = yield average
total += value
count += 1
average = total / count
avg = averager() # Đã được prime tự động
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0So sánh: Legacy vs Modern
python
# ❌ Legacy (Python 2.5 - 3.4)
@types.coroutine
def legacy_fetch(url):
response = yield from aiohttp_request(url)
return response
# ✅ Modern (Python 3.5+)
async def modern_fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.json()Ứng dụng Thực tế
1. Data Pipeline với Generators
python
from typing import Iterator, Callable
import re
def read_lines(filepath: str) -> Iterator[str]:
"""Stage 1: Read file lazily."""
with open(filepath, encoding="utf-8") as f:
for line in f:
yield line.strip()
def filter_lines(lines: Iterator[str], pattern: str) -> Iterator[str]:
"""Stage 2: Filter by regex."""
regex = re.compile(pattern)
for line in lines:
if regex.search(line):
yield line
def parse_log(lines: Iterator[str]) -> Iterator[dict]:
"""Stage 3: Parse log format."""
for line in lines:
parts = line.split(" ", 3)
if len(parts) >= 4:
yield {
"timestamp": parts[0],
"level": parts[1],
"source": parts[2],
"message": parts[3]
}
def pipeline(filepath: str, pattern: str) -> Iterator[dict]:
"""Compose pipeline stages."""
lines = read_lines(filepath)
filtered = filter_lines(lines, pattern)
parsed = parse_log(filtered)
return parsed
# Xử lý file 10GB với memory footprint nhỏ
for log_entry in pipeline("huge.log", r"ERROR|CRITICAL"):
print(log_entry)2. Batching với Generators
python
from typing import Iterator, TypeVar
from itertools import islice
T = TypeVar('T')
def batch(iterable: Iterator[T], size: int) -> Iterator[list[T]]:
"""Yield batches of specified size."""
iterator = iter(iterable)
while True:
batch_items = list(islice(iterator, size))
if not batch_items:
break
yield batch_items
# Xử lý 1 triệu records theo batch 1000
def process_records():
for i in range(1_000_000):
yield {"id": i, "data": f"record_{i}"}
for batch_records in batch(process_records(), 1000):
# Bulk insert vào database
db.bulk_insert(batch_records)
print(f"Inserted batch of {len(batch_records)}")3. Infinite Sequences
python
from typing import Iterator
from itertools import count, cycle, repeat
def fibonacci() -> Iterator[int]:
"""Infinite Fibonacci sequence."""
a, b = 0, 1
while True:
yield a
a, b = b, a + b
def primes() -> Iterator[int]:
"""Infinite prime number generator."""
def is_prime(n: int) -> bool:
if n < 2:
return False
for i in range(2, int(n ** 0.5) + 1):
if n % i == 0:
return False
return True
n = 2
while True:
if is_prime(n):
yield n
n += 1
# Lấy 10 số Fibonacci đầu tiên
from itertools import islice
print(list(islice(fibonacci(), 10)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
# Lấy 10 số nguyên tố đầu tiên
print(list(islice(primes(), 10)))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]4. Sliding Window
python
from typing import Iterator, TypeVar
from collections import deque
T = TypeVar('T')
def sliding_window(iterable: Iterator[T], size: int) -> Iterator[tuple[T, ...]]:
"""Generate sliding windows of specified size."""
iterator = iter(iterable)
window = deque(islice(iterator, size), maxlen=size)
if len(window) == size:
yield tuple(window)
for item in iterator:
window.append(item)
yield tuple(window)
# Moving average
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for window in sliding_window(iter(data), 3):
avg = sum(window) / len(window)
print(f"Window: {window}, Avg: {avg:.2f}")Production Pitfalls ⚠️
1. Generator Exhaustion
python
# ❌ BAD: Generator chỉ dùng được 1 lần!
gen = (x**2 for x in range(5))
print(list(gen)) # [0, 1, 4, 9, 16]
print(list(gen)) # [] - Đã exhausted!
# ✅ GOOD: Tạo generator mới hoặc dùng list
def get_squares():
return (x**2 for x in range(5))
print(list(get_squares())) # [0, 1, 4, 9, 16]
print(list(get_squares())) # [0, 1, 4, 9, 16]2. Không thể Index Generator
python
# ❌ BAD: Generator không hỗ trợ indexing
gen = (x for x in range(10))
# gen[5] # TypeError: 'generator' object is not subscriptable
# ✅ GOOD: Convert to list hoặc dùng itertools
from itertools import islice
gen = (x for x in range(10))
fifth = next(islice(gen, 5, 6)) # Lấy phần tử thứ 5
print(fifth) # 53. Generator trong Exception Handling
python
# ❌ BAD: Generator không cleanup khi exception
def risky_generator():
resource = acquire_resource()
try:
for item in process(resource):
yield item
# Nếu consumer raise exception, finally không chạy!
# ✅ GOOD: Dùng contextlib hoặc try-finally đúng cách
from contextlib import contextmanager
@contextmanager
def managed_generator():
resource = acquire_resource()
try:
yield process(resource)
finally:
release_resource(resource) # Luôn cleanup4. Memory Leak với Large Generators
python
# ❌ BAD: Giữ reference đến generator đã dùng
generators = []
for i in range(1000):
gen = (x**2 for x in range(1000000))
generators.append(gen) # Memory leak!
# ✅ GOOD: Xử lý và discard
for i in range(1000):
gen = (x**2 for x in range(1000000))
result = sum(gen) # Process immediately
# gen được garbage collected5. Async Generator Cleanup
python
import asyncio
# ❌ BAD: Không cleanup async generator
async def bad_usage():
async def gen():
try:
while True:
yield await fetch_data()
finally:
print("Cleanup!") # Có thể không chạy!
g = gen()
async for item in g:
if should_stop(item):
break # Generator không được close properly!
# ✅ GOOD: Explicit cleanup
async def good_usage():
async def gen():
try:
while True:
yield await fetch_data()
finally:
print("Cleanup!")
g = gen()
try:
async for item in g:
if should_stop(item):
break
finally:
await g.aclose() # Explicit cleanupBảng Tóm tắt
python
# === GENERATOR FUNCTION ===
def my_generator():
yield 1
yield 2
yield 3
# === GENERATOR EXPRESSION ===
gen = (x**2 for x in range(10))
# === YIELD FROM (Delegation) ===
def combined():
yield from gen1()
yield from gen2()
# === ASYNC GENERATOR ===
async def async_gen():
for i in range(10):
yield i
await asyncio.sleep(0.1)
async for item in async_gen():
print(item)
# === GENERATOR METHODS ===
gen = my_generator()
next(gen) # Lấy giá trị tiếp theo
gen.send(value) # Gửi value vào generator
gen.throw(exc) # Raise exception trong generator
gen.close() # Đóng generator
# === BUILT-IN LAZY ITERATORS ===
range(10) # Lazy range
map(func, iterable) # Lazy map
filter(func, iter) # Lazy filter
zip(iter1, iter2) # Lazy zip
enumerate(iter) # Lazy enumerate
# === ITERTOOLS MODULE ===
from itertools import (
islice, # Slice iterator
chain, # Nối iterables
cycle, # Lặp vô hạn
count, # Đếm vô hạn
groupby, # Nhóm liên tiếp
takewhile, # Lấy while condition
dropwhile, # Bỏ while condition
)Cross-links
- Prerequisites: Functions & Closures
- Related: Asyncio Fundamentals - Async programming
- Related: itertools Module (Phase 2) - Iterator utilities
- Related: Memory Optimization (Phase 3) - Memory-efficient patterns
- Next: Context Managers