B-Tree & B+ Tree — Kiến trúc sư tầng lưu trữ

Một câu truy vấn SELECT * FROM users WHERE id = 42 trên bảng 1 tỷ bản ghi. Nếu dùng AVL Tree làm index, chiều cao cây ≈ 30 tầng → 30 lần random disk seek × 10ms/lần = 300ms chỉ để tìm một record. Một database phải xử lý hàng nghìn query/giây — con số đó hoàn toàn không chấp nhận được.

B-Tree giải quyết bài toán này bằng cách tăng số nhánh con tại mỗi node lên hàng trăm thay vì chỉ 2. Chiều cao cây giảm xuống 3-4 tầng ngay cả với hàng tỷ bản ghi. B+ Tree cải tiến thêm: toàn bộ dữ liệu nằm ở lá, các lá liên kết thành danh sách — cho phép range scan tuần tự cực nhanh. Đây là cấu trúc đứng sau mọi database index từ PostgreSQL, MySQL InnoDB đến filesystem index của NTFS và ext4.

Bức tranh tư duy

Hình dung một thư viện khổng lồ với 1 tỷ cuốn sách. Nếu tổ chức theo kiểu cây nhị phân — mỗi ngã rẽ chỉ có hai hướng "trái" hoặc "phải" — bạn phải đi qua 30 ngã rẽ để tìm một cuốn sách. Mỗi ngã rẽ là một lần mở cửa phòng mới (disk seek).

B-Tree tổ chức theo kiểu khác: mỗi phòng chứa hàng trăm bảng chỉ dẫn, mỗi bảng dẫn đến một phòng con. Chỉ cần 3-4 phòng là đến được bất kỳ cuốn sách nào. B+ Tree đi xa hơn: sách thật chỉ nằm ở phòng cuối cùng (lá), và các phòng lá nối nhau bằng hành lang (linked list) — muốn lấy 100 cuốn liên tiếp chỉ cần đi thẳng qua hành lang.

B+ Tree bậc 4 (mỗi node chứa tối đa 3 keys):

            [30 | 60]                   ← root (internal)
           /    |    \
     [10|20] [40|50] [70|80|90]         ← internal nodes
      / | \   / | \   / | | \
    lá  lá lá lá lá lá lá lá lá         ← leaf nodes (chứa data)
    ←→ ←→ ←→ ←→ ←→ ←→ ←→ ←→           ← linked list giữa lá

Cốt lõi kỹ thuật

Tại sao cây nhị phân thất bại trên disk?

Tiêu chí	AVL/Red-Black Tree	B-Tree (bậc 1000)
Chiều cao (10⁹ records)	≈ 30	≈ 3
Disk seeks cho 1 lookup	30	3
Thời gian (HDD 10ms/seek)	300ms	30ms
Thời gian (SSD 0.1ms/seek)	3ms	0.3ms

RAM truy cập ngẫu nhiên trong ~100ns. HDD cần ~10ms — chậm hơn 100.000 lần. Mỗi lần nhảy đến node mới trên disk là một random seek. Mục tiêu: giảm tối đa số lần seek bằng cách tăng fanout tại mỗi node.

B-Tree: Cấu trúc và tính chất

B-Tree bậc m (order m) tuân theo các quy tắc:

Mỗi node chứa tối đa m-1 keys và m con.
Mỗi node (trừ root) chứa tối thiểu ⌈m/2⌉ - 1 keys.
Tất cả lá nằm cùng một tầng (perfectly balanced).
Keys trong mỗi node sắp xếp tăng dần.
Dữ liệu (hoặc pointer đến dữ liệu) có thể nằm ở cả internal nodes lẫn leaf nodes.

B-Tree Search — O(log_m n)

Tại mỗi node, tìm vị trí key bằng binary search trong mảng keys (nằm trọn trong 1 disk page), rồi đi xuống con tương ứng.

C++PythonJavaGo

cpp

struct BTreeNode {
    std::vector<int> keys;
    std::vector<BTreeNode*> children;
    bool isLeaf;
};

BTreeNode* search(BTreeNode* node, int key) {
    while (node != nullptr) {
        int i = 0;
        while (i < (int)node->keys.size() && key > node->keys[i])
            i++;

        if (i < (int)node->keys.size() && key == node->keys[i])
            return node;  // found

        if (node->isLeaf)
            return nullptr;  // not found

        node = node->children[i];
    }
    return nullptr;
}

python

class BTreeNode:
    def __init__(self, is_leaf: bool = False) -> None:
        self.keys: list[int] = []
        self.children: list[BTreeNode] = []
        self.is_leaf = is_leaf

def search(node: BTreeNode | None, key: int) -> BTreeNode | None:
    while node is not None:
        i = 0
        while i < len(node.keys) and key > node.keys[i]:
            i += 1

        if i < len(node.keys) and key == node.keys[i]:
            return node

        if node.is_leaf:
            return None

        node = node.children[i]
    return None

java

class BTreeNode {
    int[] keys;
    BTreeNode[] children;
    int numKeys;
    boolean isLeaf;
}

BTreeNode search(BTreeNode node, int key) {
    while (node != null) {
        int i = 0;
        while (i < node.numKeys && key > node.keys[i]) i++;

        if (i < node.numKeys && key == node.keys[i])
            return node;

        if (node.isLeaf) return null;

        node = node.children[i];
    }
    return null;
}

type BTreeNode struct {
    keys     []int
    children []*BTreeNode
    isLeaf   bool
}

func search(node *BTreeNode, key int) *BTreeNode {
    for node != nil {
        i := 0
        for i < len(node.keys) && key > node.keys[i] {
            i++
        }
        if i < len(node.keys) && key == node.keys[i] {
            return node
        }
        if node.isLeaf {
            return nil
        }
        node = node.children[i]
    }
    return nil
}

B-Tree Insert — Tách node (Split)

Insert tuân theo quy trình: tìm lá phù hợp, chèn key. Nếu node đầy (m-1 keys) → tách thành hai node, đẩy key giữa lên cha. Nếu cha cũng đầy → tách tiếp (có thể lan lên đến root, làm cây cao thêm 1 tầng).

C++PythonJavaGo

cpp

// Tách node con children[i] đã đầy
void splitChild(BTreeNode* parent, int i, int order) {
    BTreeNode* fullChild = parent->children[i];
    int mid = order / 2;

    BTreeNode* newNode = new BTreeNode();
    newNode->isLeaf = fullChild->isLeaf;

    // Chuyển nửa sau keys sang node mới
    newNode->keys.assign(fullChild->keys.begin() + mid + 1,
                         fullChild->keys.end());
    if (!fullChild->isLeaf) {
        newNode->children.assign(fullChild->children.begin() + mid + 1,
                                 fullChild->children.end());
        fullChild->children.resize(mid + 1);
    }

    int midKey = fullChild->keys[mid];
    fullChild->keys.resize(mid);

    // Chèn midKey và newNode vào parent
    parent->keys.insert(parent->keys.begin() + i, midKey);
    parent->children.insert(parent->children.begin() + i + 1, newNode);
}

python

def split_child(parent: BTreeNode, i: int, order: int) -> None:
    full_child = parent.children[i]
    mid = order // 2

    new_node = BTreeNode(is_leaf=full_child.is_leaf)
    new_node.keys = full_child.keys[mid + 1:]

    if not full_child.is_leaf:
        new_node.children = full_child.children[mid + 1:]
        full_child.children = full_child.children[:mid + 1]

    mid_key = full_child.keys[mid]
    full_child.keys = full_child.keys[:mid]

    parent.keys.insert(i, mid_key)
    parent.children.insert(i + 1, new_node)

java

void splitChild(BTreeNode parent, int i, int order) {
    BTreeNode fullChild = parent.children[i];
    int mid = order / 2;

    BTreeNode newNode = new BTreeNode();
    newNode.isLeaf = fullChild.isLeaf;

    // Copy nửa sau sang newNode
    newNode.numKeys = fullChild.numKeys - mid - 1;
    System.arraycopy(fullChild.keys, mid + 1,
                     newNode.keys, 0, newNode.numKeys);
    if (!fullChild.isLeaf) {
        System.arraycopy(fullChild.children, mid + 1,
                         newNode.children, 0, newNode.numKeys + 1);
    }

    int midKey = fullChild.keys[mid];
    fullChild.numKeys = mid;

    // Shift parent keys/children và chèn
    for (int j = parent.numKeys; j > i; j--) {
        parent.keys[j] = parent.keys[j - 1];
        parent.children[j + 1] = parent.children[j];
    }
    parent.keys[i] = midKey;
    parent.children[i + 1] = newNode;
    parent.numKeys++;
}

func splitChild(parent *BTreeNode, i, order int) {
    fullChild := parent.children[i]
    mid := order / 2

    newNode := &BTreeNode{isLeaf: fullChild.isLeaf}
    newNode.keys = append([]int{}, fullChild.keys[mid+1:]...)

    if !fullChild.isLeaf {
        newNode.children = append([]*BTreeNode{},
            fullChild.children[mid+1:]...)
        fullChild.children = fullChild.children[:mid+1]
    }

    midKey := fullChild.keys[mid]
    fullChild.keys = fullChild.keys[:mid]

    // Chèn midKey và newNode vào parent
    parent.keys = append(parent.keys[:i],
        append([]int{midKey}, parent.keys[i:]...)...)
    parent.children = append(parent.children[:i+1],
        append([]*BTreeNode{newNode}, parent.children[i+1:]...)...)
}

B+ Tree — Khác biệt so với B-Tree

Tiêu chí	B-Tree	B+ Tree
Dữ liệu ở đâu	Cả internal lẫn leaf	Chỉ ở leaf
Internal nodes chứa	Keys + data pointers	Chỉ keys (làm "bảng chỉ dẫn")
Leaf nodes liên kết	Không	Có (linked list)
Range query	Phải duyệt cây (chậm)	Scan tuần tự qua lá (nhanh)
Fanout	Thấp hơn (node to hơn do chứa data)	Cao hơn (node nhẹ, chứa nhiều key)

B+ Tree là lựa chọn gần như tuyệt đối cho database index vì:

Fanout cao hơn → cây thấp hơn → ít disk seek hơn.
Range scan qua linked list các lá — sequential I/O nhanh gấp 100 lần random I/O trên HDD.
Full scan chỉ cần đọc leaf level — không phải đọc toàn bộ cây.

Thực chiến

Tình huống: Phân tích index PostgreSQL

Bối cảnh: Bảng orders với 100 triệu bản ghi. Column created_at có B+ Tree index. Cần hiểu tại sao SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31' nhanh.

Phân tích:

PostgreSQL B+ Tree index cho bảng orders (100M rows):

Page size: 8KB
Key size: 8 bytes (timestamp) + 6 bytes (pointer) ≈ 14 bytes
Keys per page: 8192 / 14 ≈ 585
Fanout ≈ 585

Chiều cao = ⌈log_585(100,000,000)⌉ = ⌈8 / 2.77⌉ ≈ 3 tầng

→ Tìm 1 record: 3 disk reads
→ Range scan 1 tháng (~8M rows): 3 reads để đến leaf đầu tiên,
  sau đó sequential scan qua ~14,000 leaf pages

PythonC++JavaGo

python

import math

def analyze_btree_index(
    total_rows: int,
    page_size: int = 8192,
    key_size: int = 8,
    pointer_size: int = 6,
) -> dict:
    entry_size = key_size + pointer_size
    fanout = page_size // entry_size
    height = math.ceil(math.log(total_rows) / math.log(fanout))
    leaf_pages = math.ceil(total_rows / fanout)

    return {
        "fanout": fanout,
        "height": height,
        "leaf_pages": leaf_pages,
        "point_query_reads": height,
        "full_scan_pages": leaf_pages,
    }

# PostgreSQL: 100M rows, 8KB page, timestamp key
stats = analyze_btree_index(100_000_000)
# {'fanout': 585, 'height': 3, 'leaf_pages': 170941,
#  'point_query_reads': 3, 'full_scan_pages': 170941}

cpp

#include <cmath>
#include <map>
#include <string>

std::map<std::string, int> analyzeBTreeIndex(
    long long totalRows,
    int pageSize = 8192,
    int keySize = 8,
    int pointerSize = 6
) {
    int entrySize = keySize + pointerSize;
    int fanout = pageSize / entrySize;
    int height = (int)std::ceil(std::log(totalRows) / std::log(fanout));
    long long leafPages = (totalRows + fanout - 1) / fanout;

    return {
        {"fanout", fanout},
        {"height", height},
        {"leaf_pages", (int)leafPages},
        {"point_query_reads", height}
    };
}

java

public class BTreeAnalyzer {
    public static void analyze(long totalRows, int pageSize,
                                int keySize, int pointerSize) {
        int entrySize = keySize + pointerSize;
        int fanout = pageSize / entrySize;
        int height = (int) Math.ceil(
            Math.log(totalRows) / Math.log(fanout));
        long leafPages = (totalRows + fanout - 1) / fanout;

        System.out.printf("Fanout: %d%n", fanout);
        System.out.printf("Height: %d%n", height);
        System.out.printf("Leaf pages: %d%n", leafPages);
        System.out.printf("Point query reads: %d%n", height);
    }
}

import "math"

type BTreeStats struct {
    Fanout, Height, LeafPages, PointQueryReads int
}

func AnalyzeBTreeIndex(totalRows int64, pageSize, keySize, pointerSize int) BTreeStats {
    entrySize := keySize + pointerSize
    fanout := pageSize / entrySize
    height := int(math.Ceil(
        math.Log(float64(totalRows)) / math.Log(float64(fanout))))
    leafPages := int((totalRows + int64(fanout) - 1) / int64(fanout))

    return BTreeStats{
        Fanout:          fanout,
        Height:          height,
        LeafPages:       leafPages,
        PointQueryReads: height,
    }
}

Phân tích:

Point query: 3 disk reads — thực tế root và level 1 thường nằm trong buffer cache → chỉ 1 physical read.
Range scan 1 tháng: sequential I/O qua ~14K pages ≈ 112MB. Trên SSD NVMe (~3GB/s) chỉ mất ~37ms. Trên HDD sequential (~200MB/s) mất ~560ms — vẫn chấp nhận được vì là sequential, không phải random seek.

Sai lầm điển hình

❌ Sai lầm 1: Nhầm B-Tree với Binary Search Tree

Vấn đề: B-Tree là multi-way balanced tree với fanout hàng trăm, không phải cây nhị phân (2 nhánh). Chữ "B" trong B-Tree KHÔNG viết tắt cho "Binary".

Tại sao sai: Thiết kế index dựa trên giả định binary tree → đánh giá sai chiều cao cây và số disk seeks.

❌ Sai lầm 2: Dùng B-Tree khi dữ liệu nằm hoàn toàn trong RAM

Vấn đề: B-Tree tối ưu cho disk I/O. Nếu dữ liệu đủ nhỏ để nằm trọn trong RAM, Red-Black Tree hoặc Skip List thường nhanh hơn vì cache-line utilization tốt hơn.

Tại sao sai: B-Tree node chứa nhiều keys → duyệt tuần tự trong node chậm hơn binary search trên RAM-resident BST. Ưu thế B-Tree chỉ phát huy khi cost của disk seek >> cost của so sánh trong node.

❌ Sai lầm 3: Bỏ qua fill factor khi estimate kích thước index

Vấn đề: B-Tree không luôn đầy 100%. PostgreSQL mặc định fill factor 90% cho B-Tree index. Estimate kích thước phải tính cả overhead này.

Tại sao sai: Đánh giá thấp dung lượng disk cho index → disk đầy bất ngờ trong production.

❌ Sai lầm 4: Tạo quá nhiều index

Vấn đề: Mỗi index là một B+ Tree riêng. Mỗi INSERT/UPDATE/DELETE phải cập nhật TẤT CẢ index trên bảng.

Tại sao sai: Đọc nhanh hơn nhưng ghi chậm hơn. Bảng với 10 index: mỗi INSERT phải cập nhật 10 B+ Tree. Write-heavy workload có thể chậm gấp 10 lần.

Under the Hood

Hiệu năng

Thao tác	B-Tree	B+ Tree	AVL Tree
Search	O(log_m n)	O(log_m n)	O(log₂ n)
Insert	O(log_m n)	O(log_m n)	O(log₂ n)
Delete	O(log_m n)	O(log_m n)	O(log₂ n)
Range query	O(log_m n + k)	O(log_m n + k) sequential	O(log₂ n + k)
Disk seeks	≈ log_m n	≈ log_m n	≈ log₂ n

Với m = 500, n = 10⁹: log₅₀₀(10⁹) ≈ 3.3 vs log₂(10⁹) ≈ 30. B-Tree giảm disk seeks 10 lần.

Nội bộ triển khai

Fanout và chiều cao: Chiều cao h = ⌈log_m(n)⌉. Với page size 8KB, key 8 bytes, pointer 6 bytes: fanout ≈ 585. Cho 10⁹ records: h ≈ 3. Đây là con số kỳ diệu — chỉ 3 disk reads cho bất kỳ lookup nào.

Disk page alignment: Mỗi B-Tree node chiếm đúng 1 disk page (thường 4KB hoặc 8KB). Một lần disk read đọc cả node. So sánh keys trong node là in-memory operation — nhanh hơn disk seek 5 bậc độ lớn.

SSD thay đổi phương trình: SSD random read ≈ 0.1ms (vs HDD 10ms). B-Tree vẫn chiến thắng BST trên SSD vì giảm amplification factor, nhưng khoảng cách hẹp hơn. LSM-Tree (dùng trong LevelDB, RocksDB) có thể phù hợp hơn cho write-heavy workload trên SSD.

B+ Tree range scan advantage: Sau khi tìm leaf đầu tiên bằng O(log_m n) seeks, range scan theo linked list là sequential I/O. Trên HDD, sequential throughput ≈ 200MB/s vs random ≈ 1MB/s — chênh lệch 200x. Đây là lý do B+ Tree thống trị database indexing.

Trade-offs

Tiêu chí	B+ Tree	LSM-Tree	Hash Index
Point lookup	✅ O(log_m n)	⚠️ O(log n) × levels	✅ O(1) amortized
Range query	✅ Sequential scan	⚠️ Merge nhiều levels	❌ Không hỗ trợ
Write throughput	⚠️ Random I/O	✅ Sequential writes	✅ O(1) amortized
Space amplification	Thấp	Cao (nhiều levels)	Thấp
Dùng trong	PostgreSQL, MySQL InnoDB	RocksDB, Cassandra, LevelDB	Memcached, Redis

Checklist ghi nhớ

✅ Checklist triển khai

Lý thuyết

[ ] B-Tree ≠ Binary Tree — chữ B không viết tắt cho Binary
[ ] Chiều cao = ⌈log_m(n)⌉ — m là fanout, thường hàng trăm
[ ] B+ Tree: dữ liệu chỉ ở lá, lá liên kết linked list
[ ] B+ Tree range scan = sequential I/O — nhanh gấp 100x random trên HDD

Database practice

[ ] Mỗi index = 1 B+ Tree riêng — cân nhắc write overhead
[ ] Fill factor không phải 100% — tính thêm 10-20% overhead
[ ] Root và internal levels thường trong buffer cache → physical reads ít hơn chiều cao cây
[ ] Composite index: thứ tự columns ảnh hưởng hiệu quả range query

Architecture

[ ] RAM-only workload: cân nhắc Red-Black Tree hoặc Skip List thay vì B-Tree
[ ] Write-heavy trên SSD: cân nhắc LSM-Tree (RocksDB) thay vì B+ Tree
[ ] Không index column có selectivity thấp (gender, boolean)

Bài tập luyện tập

Bài 1: Tính chiều cao B+ Tree — Foundation

Đề bài: Cho bảng 500 triệu bản ghi, page size 16KB, mỗi key 16 bytes, mỗi pointer 8 bytes. Tính fanout và chiều cao B+ Tree index.

🧠 Quiz

Câu hỏi: Một B+ Tree index trên bảng 10⁹ rows với fanout 500 có chiều cao xấp xỉ bao nhiêu?

[ ] A. 2
[x] B. 4
[ ] C. 10
[ ] D. 30 Giải thích: log₅₀₀(10⁹) = 9/2.7 ≈ 3.3, làm tròn lên = 4. Đáp án D (30) là chiều cao binary tree — sai concept. Đáp án A (2) chỉ đúng nếu fanout > 31622.

💡 Gợi ý

Fanout = page_size / (key_size + pointer_size).
Chiều cao = ⌈log_fanout(total_rows)⌉.

✅ Lời giải

python

import math

page_size = 16 * 1024  # 16KB
key_size = 16
pointer_size = 8
total_rows = 500_000_000

fanout = page_size // (key_size + pointer_size)  # 682
height = math.ceil(math.log(total_rows) / math.log(fanout))  # 4

print(f"Fanout: {fanout}")       # 682
print(f"Height: {height}")       # 4
print(f"Disk reads per lookup: {height}")  # 4

Phân tích: Fanout 682, chiều cao 4. Trên thực tế, root thường cache → 3 physical reads.

Bài 2: So sánh B+ Tree vs LSM-Tree — Advanced

Đề bài: Hệ thống logging nhận 100K writes/giây, cần query theo time range. Phân tích nên chọn B+ Tree (PostgreSQL) hay LSM-Tree (ClickHouse/RocksDB), giải thích trade-offs.

💡 Gợi ý

Write amplification: B+ Tree random writes vs LSM-Tree sequential writes.
Read amplification: B+ Tree 1 lookup vs LSM-Tree merge nhiều levels.
Space amplification: B+ Tree compact vs LSM-Tree multiple copies.

✅ Lời giải

Phân tích:

100K writes/s là write-heavy → LSM-Tree có lợi thế vì sequential writes.
Query theo time range → B+ Tree range scan tốt, nhưng LSM-Tree cũng hỗ trợ nếu dữ liệu sorted by time.
Đề xuất: LSM-Tree (ClickHouse cho analytics, RocksDB cho key-value) phù hợp hơn vì write throughput cao hơn, trade-off chấp nhận được là read chậm hơn một chút do phải merge levels.

Liên kết học tiếp

Từ khóa glossary: B-Tree, B+ Tree, database index, fanout, disk I/O, page, leaf node, internal node, fill factor, LSM-Tree

Tìm kiếm liên quan: cây B, chỉ mục cơ sở dữ liệu, PostgreSQL index, MySQL InnoDB, storage engine

B-Tree & B+ Tree — Kiến trúc sư tầng lưu trữ ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

Tại sao cây nhị phân thất bại trên disk? ​

B-Tree: Cấu trúc và tính chất ​

B-Tree Search — O(log_m n) ​

B-Tree Insert — Tách node (Split) ​

B+ Tree — Khác biệt so với B-Tree ​

Thực chiến ​

Tình huống: Phân tích index PostgreSQL ​

Sai lầm điển hình ​

❌ Sai lầm 1: Nhầm B-Tree với Binary Search Tree ​

❌ Sai lầm 2: Dùng B-Tree khi dữ liệu nằm hoàn toàn trong RAM ​

❌ Sai lầm 3: Bỏ qua fill factor khi estimate kích thước index ​

❌ Sai lầm 4: Tạo quá nhiều index ​

Under the Hood ​

Hiệu năng ​

Nội bộ triển khai ​

Trade-offs ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Tính chiều cao B+ Tree — Foundation ​

Bài 2: So sánh B+ Tree vs LSM-Tree — Advanced ​

Liên kết học tiếp ​

B-Tree & B+ Tree — Kiến trúc sư tầng lưu trữ

Bức tranh tư duy

Cốt lõi kỹ thuật

Tại sao cây nhị phân thất bại trên disk?

B-Tree: Cấu trúc và tính chất

B-Tree Search — O(log_m n)

B-Tree Insert — Tách node (Split)

B+ Tree — Khác biệt so với B-Tree

Thực chiến

Tình huống: Phân tích index PostgreSQL

Sai lầm điển hình

❌ Sai lầm 1: Nhầm B-Tree với Binary Search Tree

❌ Sai lầm 2: Dùng B-Tree khi dữ liệu nằm hoàn toàn trong RAM

❌ Sai lầm 3: Bỏ qua fill factor khi estimate kích thước index

❌ Sai lầm 4: Tạo quá nhiều index

Under the Hood

Hiệu năng

Nội bộ triển khai

Trade-offs

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Tính chiều cao B+ Tree — Foundation

Bài 2: So sánh B+ Tree vs LSM-Tree — Advanced

Liên kết học tiếp