Module pathlib — Thao tác filesystem hiện đại

os.path.join("data", "users", "config.json") — dòng code này trông quen thuộc? Nó hoạt động, nhưng mong manh. Trên Windows bạn nhận data\users\config.json, trên Linux là data/users/config.json, và khi cần lấy phần mở rộng file, bạn lại gọi os.path.splitext() — một hàm trả tuple mà không ai nhớ nổi index nào là gì.

Python 3.4 giới thiệu pathlib để giải quyết triệt để vấn đề này. Thay vì thao tác chuỗi thủ công, bạn làm việc với Path objects — đối tượng hiểu cấu trúc filesystem, tự xử lý separator theo OS, và cung cấp API thống nhất cho mọi thao tác:

python

from pathlib import Path

config = Path("data") / "users" / "config.json"
print(config.suffix)     # .json
print(config.stem)       # config
print(config.parent)     # data/users
config.read_text(encoding="utf-8")  # đọc toàn bộ nội dung

Toán tử / được overload thành phép nối path — đây là thiết kế Pythonic signature khiến code đọc như đường dẫn thật, không còn là chuỗi ghép nối dễ vỡ

Bức tranh tư duy

Hãy hình dung hai cách tìm đường đến một địa chỉ:

os.path = bản đồ giấy: bạn phải tự ghép tên đường, tự xác định dấu phân cách (/ hay \), tự tách tên file khỏi thư mục. Mọi thứ là chuỗi thuần túy — sai một ký tự là lạc đường
pathlib = GPS thông minh: bạn chỉ cần nói "đi từ A đến B", hệ thống tự xử lý separator, tự biết đang ở OS nào, và cung cấp thông tin chi tiết (tên file, phần mở rộng, thư mục cha) qua thuộc tính rõ ràng

Trong hệ thống phân cấp class, sự phân tách còn rõ hơn:

PurePath = bản đồ (chỉ tính toán đường, không chạm filesystem) — dùng khi cần parse, nối, so sánh path mà không cần file thật sự tồn tại
Path = GPS + xe (tính toán + thao tác thực tế) — đọc file, tạo thư mục, kiểm tra tồn tại

                    PurePath
                   /        \
        PurePosixPath    PureWindowsPath
              |                |
           Path               Path
          /                      \
     PosixPath            WindowsPath
     (Linux/Mac)           (Windows)

Path("report.pdf")
├── .name     → "report.pdf"      # tên file đầy đủ
├── .stem     → "report"          # tên không có extension
├── .suffix   → ".pdf"            # phần mở rộng
├── .parent   → Path(".")         # thư mục cha
├── .parts    → ("report.pdf",)   # tuple các thành phần
└── .anchor   → ""                # root (/ hoặc C:\)

Nguyên tắc vàng: Dùng PurePath khi chỉ cần tính toán path (parse URL, config cross-platform). Dùng Path khi cần tương tác với filesystem thật.

Cốt lõi kỹ thuật

Tạo Path

python

from pathlib import Path

# Từ chuỗi tĩnh
config = Path("/etc/app/config.yaml")

# Thư mục làm việc hiện tại
cwd = Path.cwd()

# Thư mục home của user
home = Path.home()

# Relative path từ vị trí script — pattern phổ biến nhất trong dự án thực tế
script_dir = Path(__file__).parent
data_dir = script_dir / "data"

Path() tự động chọn PosixPath trên Linux/Mac hoặc WindowsPath trên Windows — bạn không cần quan tâm OS đang chạy.

Toán tử `/` — nối path

python

from pathlib import Path

base = Path("project")
readme = base / "docs" / "README.md"
# PosixPath('project/docs/README.md')

# Tương đương gọi __truediv__ — nhưng đọc tự nhiên hơn hẳn
readme = base.__truediv__("docs").__truediv__("README.md")

# So sánh với cách cũ
import os
readme_old = os.path.join("project", "docs", "README.md")

Toán tử / hoạt động với cả chuỗi bên phải lẫn Path object. Nếu vế phải là absolute path, nó thay thế hoàn toàn vế trái:

python

Path("relative") / "/absolute"
# PosixPath('/absolute') — vế trái bị bỏ qua

Thuộc tính Path

python

from pathlib import Path

p = Path("/home/dev/projects/app/data/backup.tar.gz")

p.name       # 'backup.tar.gz'     — tên file đầy đủ
p.stem       # 'backup.tar'        — tên không có suffix cuối
p.suffix     # '.gz'               — extension cuối cùng
p.suffixes   # ['.tar', '.gz']     — tất cả extensions
p.parent     # Path('/home/dev/projects/app/data')
p.parents[0] # Path('/home/dev/projects/app/data')   — giống parent
p.parents[2] # Path('/home/dev/projects')             — đi lên 3 cấp
p.parts      # ('/', 'home', 'dev', 'projects', 'app', 'data', 'backup.tar.gz')
p.anchor     # '/'                  — trên Windows sẽ là 'C:\\'

p.is_absolute()              # True
p.is_relative_to("/home")   # True (Python 3.9+)

Biến đổi Path

Các method biến đổi trả về Path mới, không thay đổi Path gốc (immutable):

python

from pathlib import Path

p = Path("/data/reports/quarterly.xlsx")

# Đổi extension
p.with_suffix(".csv")          # Path('/data/reports/quarterly.csv')

# Đổi tên file (giữ thư mục)
p.with_name("annual.xlsx")     # Path('/data/reports/annual.xlsx')

# Đổi stem (giữ extension) — Python 3.9+
p.with_stem("monthly")         # Path('/data/reports/monthly.xlsx')

# Resolve: chuyển relative → absolute, xử lý symlink
Path("./src/../src/main.py").resolve()
# Path('/full/absolute/path/src/main.py')

# Relative path giữa hai path
Path("/home/dev/project/src").relative_to("/home/dev")
# PurePosixPath('project/src')

File I/O — đọc và ghi

python

from pathlib import Path

file = Path("config.json")

# === ĐỌC ===
# Đọc toàn bộ text — luôn chỉ định encoding
content = file.read_text(encoding="utf-8")

# Đọc binary (ảnh, PDF, protobuf...)
raw = file.read_bytes()

# Đọc từng dòng — tiết kiệm bộ nhớ cho file lớn
with file.open("r", encoding="utf-8") as f:
    for line in f:
        process(line.strip())

# === GHI ===
# Ghi text (overwrite toàn bộ)
file.write_text('{"key": "value"}', encoding="utf-8")

# Ghi binary
file.write_bytes(b"\x89PNG\r\n\x1a\n")

# Append — cần dùng open() vì write_text luôn overwrite
with file.open("a", encoding="utf-8") as f:
    f.write("\n// appended line")

⚠️ write_text LUÔN OVERWRITE

write_text() và write_bytes() ghi đè toàn bộ nội dung. Không có chế độ append. Dùng open("a") nếu cần ghi thêm.

Glob patterns — tìm kiếm file

python

from pathlib import Path

src = Path("./src")

# Tìm file .py trực tiếp trong src/
py_files = list(src.glob("*.py"))

# Tìm recursive (tất cả subdirectories)
all_py = list(src.glob("**/*.py"))

# rglob — shortcut cho glob("**/<pattern>")
all_py = list(src.rglob("*.py"))

# Pattern nâng cao
src.glob("test_*.py")          # prefix matching
src.glob("**/[!_]*.py")        # loại trừ file bắt đầu bằng _
src.glob("module_?.py")        # ? = đúng 1 ký tự bất kỳ

# pathlib KHÔNG hỗ trợ brace expansion {json,yaml}
# Giải pháp: kết hợp nhiều glob
from itertools import chain

configs = chain(
    src.rglob("*.json"),
    src.rglob("*.yaml"),
    src.rglob("*.toml"),
)

iterdir — duyệt thư mục

python

from pathlib import Path

project = Path(".")

# Liệt kê mọi entry trong thư mục (không recursive)
for entry in project.iterdir():
    kind = "📁" if entry.is_dir() else "📄"
    print(f"{kind} {entry.name}")

# Lọc chỉ file
files = [p for p in project.iterdir() if p.is_file()]

# Lọc chỉ thư mục con
subdirs = [p for p in project.iterdir() if p.is_dir()]

# Sort theo tên
sorted_entries = sorted(project.iterdir(), key=lambda p: p.name.lower())

Kiểm tra và tạo — exists, mkdir, touch, unlink

python

from pathlib import Path

path = Path("data/output/2024")

# Kiểm tra tồn tại
path.exists()       # True/False
path.is_file()      # True nếu là file thật (không phải dir/symlink bị hỏng)
path.is_dir()       # True nếu là directory
path.is_symlink()   # True nếu là symbolic link

# Tạo thư mục — parents=True tạo cả thư mục cha, exist_ok=True không lỗi nếu đã có
path.mkdir(parents=True, exist_ok=True)

# Tạo file rỗng (hoặc cập nhật timestamp nếu đã tồn tại)
Path("marker.lock").touch(exist_ok=True)

# Xóa file
Path("temp.txt").unlink(missing_ok=True)  # missing_ok: Python 3.8+

# Xóa thư mục rỗng
Path("empty_dir").rmdir()

# Xóa thư mục có nội dung — cần shutil
import shutil
shutil.rmtree(Path("build_output"))

PurePath vs Path — cross-platform

python

from pathlib import PurePosixPath, PureWindowsPath, Path

# PurePath: tính toán path mà KHÔNG chạm filesystem
# Hữu ích khi parse path từ config, URL, hoặc hệ thống khác OS hiện tại

# Parse Windows path trên Linux
win_path = PureWindowsPath("C:\\Users\\dev\\project\\main.py")
print(win_path.name)       # 'main.py'
print(win_path.parts)      # ('C:\\', 'Users', 'dev', 'project', 'main.py')

# Parse Unix path trên Windows
unix_path = PurePosixPath("/var/log/app/error.log")
print(unix_path.parent)    # PurePosixPath('/var/log/app')

# Path: phụ thuộc OS — chỉ tạo được PosixPath trên Linux, WindowsPath trên Windows
p = Path("data/file.txt")
# Linux  → PosixPath('data/file.txt')
# Windows → WindowsPath('data\\file.txt')

Khi nào dùng PurePath? Khi bạn cần parse path string từ hệ thống khác (VD: đọc Windows path từ config file trên Linux server) mà không cần file thật sự tồn tại.

Thực chiến

Bài toán: Công cụ scaffolding dự án cross-platform

Xây dựng tool tạo cấu trúc thư mục cho dự án Python, tìm và xử lý file theo pattern, ghi file an toàn — tất cả phải chạy đúng trên mọi OS.

Bước 1 — Tạo cấu trúc thư mục dự án

python

from pathlib import Path


def scaffold_project(root: str | Path, name: str) -> Path:
    """Tạo cấu trúc thư mục chuẩn cho dự án Python."""
    project = Path(root) / name
    
    directories = [
        project / "src" / name,
        project / "tests",
        project / "docs",
        project / "scripts",
        project / "data" / "raw",
        project / "data" / "processed",
    ]
    
    for d in directories:
        d.mkdir(parents=True, exist_ok=True)
    
    # Tạo __init__.py cho package
    (project / "src" / name / "__init__.py").touch()
    (project / "tests" / "__init__.py").touch()
    
    # Tạo file cấu hình cơ bản
    (project / "README.md").write_text(
        f"# {name}\n\nProject description.\n",
        encoding="utf-8",
    )
    (project / ".gitignore").write_text(
        "__pycache__/\n*.pyc\n.venv/\ndist/\n*.egg-info/\n",
        encoding="utf-8",
    )
    
    return project


result = scaffold_project(".", "my_analyzer")
print(f"Đã tạo dự án tại: {result.resolve()}")

Bước 2 — Tìm và xử lý file theo pattern

python

from pathlib import Path
from datetime import datetime


def analyze_project_files(project: Path) -> dict:
    """Phân tích cấu trúc file trong dự án."""
    stats = {
        "total_files": 0,
        "by_extension": {},
        "largest_files": [],
        "empty_files": [],
    }
    
    for f in project.rglob("*"):
        if not f.is_file():
            continue
        if f.name.startswith("."):
            continue
        
        stats["total_files"] += 1
        ext = f.suffix or "(no ext)"
        stats["by_extension"][ext] = stats["by_extension"].get(ext, 0) + 1
        
        size = f.stat().st_size
        if size == 0:
            stats["empty_files"].append(str(f.relative_to(project)))
        
        stats["largest_files"].append((str(f.relative_to(project)), size))
    
    stats["largest_files"].sort(key=lambda x: x[1], reverse=True)
    stats["largest_files"] = stats["largest_files"][:10]
    
    return stats

Bước 3 — Atomic file write an toàn

python

from pathlib import Path
import tempfile
import json


def safe_write_json(path: Path, data: dict, indent: int = 2) -> None:
    """Ghi JSON an toàn — nếu crash giữa chừng, file gốc không bị hỏng."""
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    
    content = json.dumps(data, indent=indent, ensure_ascii=False)
    
    # Ghi vào temp file cùng thư mục (để rename là atomic)
    fd, tmp_name = tempfile.mkstemp(
        dir=path.parent,
        prefix=f".{path.stem}_",
        suffix=".tmp",
    )
    tmp = Path(tmp_name)
    
    try:
        tmp.write_text(content, encoding="utf-8")
        tmp.replace(path)  # atomic rename trên cùng filesystem
    except Exception:
        tmp.unlink(missing_ok=True)
        raise


# Sử dụng
config = {"database": {"host": "localhost", "port": 5432}}
safe_write_json(Path("config/db.json"), config)

Bước 4 — Migration từ os.path sang pathlib

python

# ❌ CŨ: codebase dùng os.path
import os
import glob as glob_module

def find_configs_old(base_dir: str) -> list[str]:
    configs = []
    for root, dirs, files in os.walk(base_dir):
        for f in files:
            if f.endswith((".json", ".yaml")):
                full = os.path.join(root, f)
                if os.path.getsize(full) > 0:
                    configs.append(os.path.abspath(full))
    return configs

def read_config_old(filepath: str) -> str:
    if not os.path.exists(filepath):
        return ""
    with open(filepath, "r") as f:
        return f.read()


# ✅ MỚI: pathlib — ngắn hơn, an toàn hơn, rõ ràng hơn
from pathlib import Path

def find_configs(base_dir: str | Path) -> list[Path]:
    return [
        p.resolve()
        for p in Path(base_dir).rglob("*")
        if p.is_file()
        and p.suffix in {".json", ".yaml", ".yml"}
        and p.stat().st_size > 0
    ]

def read_config(filepath: str | Path) -> str:
    try:
        return Path(filepath).read_text(encoding="utf-8")
    except FileNotFoundError:
        return ""

Sai lầm điển hình

❌ SAI: Không chỉ định encoding trong read_text

python

# ❌ SAI — dùng encoding mặc định của hệ thống (Windows: cp1252, Linux: utf-8)
content = Path("data.csv").read_text()
# Chạy đúng trên Linux, lỗi UnicodeDecodeError trên Windows với ký tự tiếng Việt

python

# ✅ ĐÚNG — luôn chỉ định encoding tường minh
content = Path("data.csv").read_text(encoding="utf-8")

# Nếu file có thể chứa encoding khác, xử lý lỗi
content = Path("legacy.txt").read_text(encoding="utf-8", errors="replace")

Tại sao: read_text() không có encoding mặc định cố định — nó dùng locale.getpreferredencoding() của OS. Cùng một file sẽ cho kết quả khác nhau trên Windows vs Linux.

❌ SAI: TOCTOU race condition

python

# ❌ SAI — khoảng trống giữa kiểm tra và đọc
path = Path("data.json")
if path.exists():           # ← t1: file tồn tại
    data = path.read_text() # ← t2: file có thể đã bị xóa bởi process khác!

python

# ✅ ĐÚNG — EAFP: cứ thực hiện, bắt lỗi nếu xảy ra
try:
    data = Path("data.json").read_text(encoding="utf-8")
except FileNotFoundError:
    data = "{}"
except PermissionError:
    raise RuntimeError("Không có quyền đọc file data.json")

Tại sao: Trong môi trường production có nhiều process chạy đồng thời, file có thể biến mất giữa exists() và read_text(). Pattern EAFP loại bỏ hoàn toàn race condition này.

❌ SAI: Path injection từ user input

python

# ❌ SAI — user input đi thẳng vào path
def download_file(filename: str) -> bytes:
    return (Path("uploads") / filename).read_bytes()

# Attacker gửi: filename = "../../../etc/shadow"
# Kết quả: Path("uploads/../../../etc/shadow") → đọc file hệ thống!

python

# ✅ ĐÚNG — validate path không thoát ra khỏi thư mục gốc
def download_file_safe(filename: str) -> bytes:
    base = Path("uploads").resolve()
    target = (base / filename).resolve()
    
    if not target.is_relative_to(base):
        raise ValueError(f"Path traversal blocked: {filename}")
    
    return target.read_bytes()

Tại sao: resolve() biến ../ thành path tuyệt đối thật, is_relative_to() kiểm tra kết quả vẫn nằm trong thư mục cho phép. Đây là lỗ hổng bảo mật OWASP Top 10.

❌ SAI: read_text cho file lớn

python

# ❌ SAI — load 5GB log file vào RAM
log = Path("production.log").read_text(encoding="utf-8")  # 💥 MemoryError
lines_with_error = [l for l in log.splitlines() if "ERROR" in l]

python

# ✅ ĐÚNG — stream processing, chỉ giữ 1 dòng trong RAM
def count_errors(log_path: Path) -> int:
    count = 0
    with log_path.open("r", encoding="utf-8") as f:
        for line in f:  # iterator — đọc từng dòng
            if "ERROR" in line:
                count += 1
    return count

Tại sao: read_text() load toàn bộ nội dung vào RAM. File 5GB = 5GB RAM tối thiểu (thực tế nhiều hơn do Python string overhead). Dùng open() + iterator để xử lý streaming.

❌ SAI: Glob tất cả rồi filter

python

# ❌ SAI — glob("*") lấy mọi thứ, rồi filter bằng Python
all_files = list(Path(".").rglob("*"))  # 100.000 file vào list
py_files = [f for f in all_files if f.suffix == ".py"]  # lọc ra 500

python

# ✅ ĐÚNG — glob pattern cụ thể, chỉ match đúng thứ cần
py_files = list(Path(".").rglob("*.py"))  # chỉ 500 file

# Tốt hơn nữa: dùng generator nếu không cần toàn bộ list
def find_large_scripts(base: Path, min_kb: int = 10) -> list[Path]:
    return [
        p for p in base.rglob("*.py")
        if p.stat().st_size > min_kb * 1024
    ]

Tại sao: rglob("*") phải duyệt và tạo Path object cho MỌI file/thư mục. Với dự án lớn (node_modules, .git), đây là hàng trăm nghìn object vô ích. Glob cụ thể để OS-level filtering làm việc hiệu quả hơn.

Under the Hood

Path dispatch: PosixPath vs WindowsPath

Khi bạn viết Path("file.txt"), Python không tạo instance của class Path trực tiếp. Thay vào đó, Path.__new__() kiểm tra os.name:

python

# Simplified internal logic
class Path(PurePath):
    def __new__(cls, *args, **kwargs):
        if cls is Path:
            cls = WindowsPath if os.name == "nt" else PosixPath
        return super().__new__(cls, *args, **kwargs)

Hệ quả: bạn không thể tạo WindowsPath trên Linux hoặc PosixPath trên Windows — sẽ raise NotImplementedError. Đây là lý do PurePath variants tồn tại: chúng không phụ thuộc OS vì không chạm filesystem.

Bảng migration os.path → pathlib

os / os.path	pathlib	Ghi chú
`os.path.join(a, b)`	`Path(a) / b`	Toán tử `/`
`os.path.dirname(p)`	`Path(p).parent`	Trả Path, không phải str
`os.path.basename(p)`	`Path(p).name`	Bao gồm extension
`os.path.splitext(p)`	`.stem` + `.suffix`	Tách thành 2 thuộc tính
`os.path.exists(p)`	`Path(p).exists()`	Method trên object
`os.path.isfile(p)`	`Path(p).is_file()`
`os.path.isdir(p)`	`Path(p).is_dir()`
`os.path.abspath(p)`	`Path(p).resolve()`	resolve() cũng xử lý symlink
`os.path.expanduser(p)`	`Path(p).expanduser()`	Mở rộng `~`
`os.getcwd()`	`Path.cwd()`	Class method
`os.listdir(p)`	`Path(p).iterdir()`	Trả iterator of Path
`os.walk(p)`	`Path(p).rglob("*")`	Không tách (root, dirs, files)
`glob.glob(pat)`	`Path(".").glob(pat)`	Tích hợp sẵn
`open(p, "r")`	`Path(p).open("r")`	Hoặc `read_text()`
`os.makedirs(p)`	`Path(p).mkdir(parents=True)`
`os.remove(p)`	`Path(p).unlink()`

Performance: pathlib vs os.path

pathlib chậm hơn os.path trong micro-benchmarks vì mỗi thao tác tạo Path object mới (object creation overhead). Tuy nhiên:

python

# Benchmark thực tế: tìm 10.000 file .py trong project lớn
# os.path + os.walk:  ~1.2s
# pathlib.rglob:      ~1.4s  (chậm hơn ~15%)
# Kết luận: chênh lệch không đáng kể cho hầu hết ứng dụng

Khi nào performance thực sự quan trọng: xử lý hàng triệu path trong vòng lặp tight (VD: build system, file indexer). Ở đó, os.scandir() + os.path vẫn nhanh hơn. Với 99% use case khác, ưu tiên readability của pathlib.

Path.resolve() vs os.path.abspath()

Hai function này không tương đương:

python

import os
from pathlib import Path

# os.path.abspath: chỉ xử lý chuỗi, KHÔNG kiểm tra filesystem
os.path.abspath("./link_to_dir/../file.txt")
# → '/cwd/link_to_dir/../file.txt' (normalize string)

# Path.resolve: theo symlink, trả path thật trên filesystem
Path("./link_to_dir/../file.txt").resolve()
# → '/actual/target/dir/../file.txt' → '/actual/target/file.txt'
# (resolve symlink TRƯỚC, rồi normalize)

resolve() an toàn hơn vì trả path thật sự trên disk — quan trọng khi kiểm tra path traversal.

Checklist ghi nhớ

✅ Checklist triển khai

Tạo và nối Path

[ ] Dùng Path() thay vì string concatenation cho mọi đường dẫn
[ ] Dùng toán tử / để nối path — base / "sub" / "file.txt"
[ ] Dùng Path(__file__).parent để lấy thư mục chứa script hiện tại
[ ] Dùng resolve() khi cần path tuyệt đối và xử lý symlink

Đọc / Ghi file

[ ] LUÔN chỉ định encoding="utf-8" trong read_text() và write_text()
[ ] Dùng open() + iterator cho file lớn — KHÔNG read_text() cho file > 100MB
[ ] Dùng atomic write pattern (temp file + replace()) cho dữ liệu quan trọng
[ ] Tạo parent directories trước khi ghi: path.parent.mkdir(parents=True, exist_ok=True)

Tìm kiếm file

[ ] Dùng glob pattern cụ thể ("*.py") thay vì glob tất cả rồi filter
[ ] Dùng rglob() cho tìm kiếm recursive — tương đương glob("**/<pattern>")
[ ] Dùng iterdir() khi chỉ cần duyệt 1 cấp thư mục

An toàn và bảo mật

[ ] Validate user input path với resolve() + is_relative_to(base) — chống path traversal
[ ] Dùng EAFP (try/except) thay vì LBYL (if exists → read) — tránh TOCTOU race condition
[ ] Kiểm tra is_symlink() trước khi xóa file trong user-controlled directories
[ ] Dùng unlink(missing_ok=True) thay vì kiểm tra exists() trước

Bài tập luyện tập

🧠 Quiz

Câu 1: Path("data.tar.gz").suffix trả về giá trị gì?

[ ] A. ".tar.gz"
[ ] B. ".tar"
[x] C. ".gz"
[ ] D. "tar.gz"

Giải thích: .suffix chỉ trả extension cuối cùng. Dùng .suffixes để lấy tất cả: ['.tar', '.gz']. Dùng "".join(p.suffixes) nếu cần ".tar.gz".

🧠 Quiz

Câu 2: Điều gì xảy ra khi chạy Path("a") / "/b/c" ?

[ ] A. Path("a/b/c")
[x] B. Path("/b/c")
[ ] C. ValueError — không nối được absolute path
[ ] D. Path("a//b/c")

Giải thích: Khi vế phải của / là absolute path, nó thay thế hoàn toàn vế trái. Đây là hành vi giống os.path.join("a", "/b/c") → "/b/c".

Bài tập 1: Dọn dẹp thư mục build — Viết hàm xóa tất cả file *.pyc và thư mục __pycache__ trong project

python

from pathlib import Path
import shutil


def clean_pycache(project_root: str | Path) -> dict[str, int]:
    """Xóa mọi __pycache__ và .pyc trong project."""
    root = Path(project_root).resolve()
    removed = {"files": 0, "dirs": 0}
    
    # Xóa .pyc files trước
    for pyc in root.rglob("*.pyc"):
        if pyc.resolve().is_relative_to(root):
            pyc.unlink()
            removed["files"] += 1
    
    # Xóa __pycache__ directories
    for cache_dir in sorted(root.rglob("__pycache__"), reverse=True):
        if cache_dir.is_dir() and cache_dir.resolve().is_relative_to(root):
            shutil.rmtree(cache_dir)
            removed["dirs"] += 1
    
    return removed


# Test
result = clean_pycache(".")
print(f"Đã xóa {result['files']} file .pyc và {result['dirs']} thư mục __pycache__")

Điểm chính: Sort reverse để xóa thư mục con trước thư mục cha. Kiểm tra is_relative_to để tránh theo symlink ra ngoài project.

Bài tập 2: Báo cáo cấu trúc dự án — Tạo tree view dạng text của thư mục, bỏ qua thư mục ẩn và node_modules

python

from pathlib import Path


IGNORE_DIRS = {".git", ".venv", "node_modules", "__pycache__", ".mypy_cache"}


def tree(directory: Path, prefix: str = "", max_depth: int = 4) -> str:
    """Tạo tree view dạng text."""
    if max_depth <= 0:
        return prefix + "...\n"
    
    entries = sorted(
        directory.iterdir(),
        key=lambda p: (not p.is_dir(), p.name.lower()),
    )
    
    # Lọc bỏ thư mục ẩn và trong ignore list
    entries = [
        e for e in entries
        if not (e.is_dir() and (e.name in IGNORE_DIRS or e.name.startswith(".")))
    ]
    
    lines = []
    for i, entry in enumerate(entries):
        is_last = i == len(entries) - 1
        connector = "└── " if is_last else "├── "
        
        if entry.is_dir():
            lines.append(f"{prefix}{connector}📁 {entry.name}/")
            extension = "    " if is_last else "│   "
            lines.append(tree(entry, prefix + extension, max_depth - 1))
        else:
            size_kb = entry.stat().st_size / 1024
            lines.append(f"{prefix}{connector}{entry.name} ({size_kb:.1f}KB)")
    
    return "\n".join(lines)


# Sử dụng
print(tree(Path(".")))

Điểm chính: Sort để thư mục lên trước, dùng iterdir() thay vì rglob vì chỉ cần duyệt từng cấp. max_depth ngăn đệ quy quá sâu.

Bài tập 3: Batch rename files — Đổi tên tất cả file ảnh từ IMG_XXXX.jpg sang YYYY-MM-DD_NNN.jpg dựa trên ngày chỉnh sửa

python

from pathlib import Path
from datetime import datetime
from collections import defaultdict


def batch_rename_photos(photo_dir: str | Path, dry_run: bool = True) -> list[tuple[str, str]]:
    """Đổi tên ảnh theo ngày chỉnh sửa. dry_run=True chỉ in preview."""
    base = Path(photo_dir).resolve()
    renames = []
    date_counters: dict[str, int] = defaultdict(int)
    
    photos = sorted(
        base.glob("IMG_*.jpg"),
        key=lambda p: p.stat().st_mtime,
    )
    
    for photo in photos:
        mtime = datetime.fromtimestamp(photo.stat().st_mtime)
        date_str = mtime.strftime("%Y-%m-%d")
        date_counters[date_str] += 1
        counter = date_counters[date_str]
        
        new_name = f"{date_str}_{counter:03d}.jpg"
        new_path = photo.parent / new_name
        
        renames.append((photo.name, new_name))
        
        if not dry_run:
            photo.rename(new_path)
    
    return renames


# Preview trước
for old, new in batch_rename_photos("./photos", dry_run=True):
    print(f"  {old} → {new}")

# Thực hiện khi đã xác nhận
# batch_rename_photos("./photos", dry_run=False)

Điểm chính: Luôn có dry_run mode. Sort theo st_mtime trước khi đánh số. Dùng defaultdict để đếm ảnh theo ngày.

Liên kết học tiếp

Từ khóa glossary: pathlib, Path, PurePath, glob, rglob, iterdir, read_text, write_text, cross-platform, os.path migration, atomic write, TOCTOU, path traversal

Tìm kiếm liên quan: thao tác file Python, đường dẫn cross-platform, đọc ghi file Python, tìm kiếm file glob, chuyển đổi os.path sang pathlib

Module pathlib — Thao tác filesystem hiện đại ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

Tạo Path ​

Toán tử / — nối path ​

Thuộc tính Path ​

Biến đổi Path ​

File I/O — đọc và ghi ​

Glob patterns — tìm kiếm file ​

iterdir — duyệt thư mục ​

Kiểm tra và tạo — exists, mkdir, touch, unlink ​

PurePath vs Path — cross-platform ​

Thực chiến ​

Bài toán: Công cụ scaffolding dự án cross-platform ​

Sai lầm điển hình ​

❌ SAI: Không chỉ định encoding trong read_text ​

❌ SAI: TOCTOU race condition ​

❌ SAI: Path injection từ user input ​

❌ SAI: read_text cho file lớn ​

❌ SAI: Glob tất cả rồi filter ​

Under the Hood ​

Path dispatch: PosixPath vs WindowsPath ​

Bảng migration os.path → pathlib ​

Performance: pathlib vs os.path ​

Path.resolve() vs os.path.abspath() ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Liên kết học tiếp ​

Module pathlib — Thao tác filesystem hiện đại

Bức tranh tư duy

Cốt lõi kỹ thuật

Tạo Path

Toán tử `/` — nối path

Thuộc tính Path

Biến đổi Path

File I/O — đọc và ghi

Glob patterns — tìm kiếm file

iterdir — duyệt thư mục

Kiểm tra và tạo — exists, mkdir, touch, unlink

PurePath vs Path — cross-platform

Thực chiến

Bài toán: Công cụ scaffolding dự án cross-platform

Sai lầm điển hình

❌ SAI: Không chỉ định encoding trong read_text

❌ SAI: TOCTOU race condition

❌ SAI: Path injection từ user input

❌ SAI: read_text cho file lớn

❌ SAI: Glob tất cả rồi filter

Under the Hood

Path dispatch: PosixPath vs WindowsPath

Bảng migration os.path → pathlib

Performance: pathlib vs os.path

Path.resolve() vs os.path.abspath()

Checklist ghi nhớ

Bài tập luyện tập

Liên kết học tiếp