Trie — Cây tiền tố

Khi bạn gõ ba ký tự đầu vào ô tìm kiếm Google và nhận được 10 gợi ý trong dưới 50ms, đằng sau đó không phải một phép duyệt qua hàng tỷ cụm từ. Google sử dụng biến thể của Trie (Finite State Transducer) kết hợp ranking model để trả kết quả autocomplete gần như tức thì. Tương tự, mỗi khi router Cisco xử lý packet, nó tra bảng routing — một Patricia Trie — để tìm longest prefix match trong vài nanosecond.

Trie (đọc là "try", viết tắt từ reTRIEval) là cấu trúc dữ liệu cây mà ở đó mỗi cạnh biểu diễn một ký tự, mỗi đường đi từ root biểu diễn một prefix. Các từ có chung prefix chia sẻ đường đi — và chính property này làm Trie vượt trội HashMap ở bài toán prefix search: HashMap phải quét toàn bộ key để tìm prefix, Trie chỉ cần đi theo đường dẫn O(L).

Nắm vững Trie là tiên quyết để hiểu Aho-Corasick — thuật toán multi-pattern matching đang chạy trong mọi hệ thống antivirus và IDS hiện đại. Aho-Corasick chính là Trie được bổ sung failure links.

Bức tranh tư duy

Hình dung một danh bạ điện thoại cũ được tổ chức theo bảng chữ cái. Muốn tìm tất cả người họ "Nguyễn Văn", bạn lật đến mục "N", rồi "Ng", rồi "Ngu"... — mỗi bước thu hẹp không gian tìm kiếm. Bạn không bao giờ phải duyệt qua tất cả tên trong danh bạ.

Trie hoạt động chính xác như vậy. Mỗi node là một "mục lục", mỗi cạnh là một ký tự. Đi theo đường dẫn p → y → t đưa bạn đến prefix "pyt", từ đó DFS xuống thu được "python", "pytorch", "pythonic".

text

INSERT: "python", "pytorch", "pip", "pandas"

            (root)
           /   |   \
          p    ...
         / \
        y    i    a
        |    |    |
        t    p    n
       / \        |
      h   o       d
      |   |       |
      o   r       a
      |   |       |
      n*  c       s*
          |
          h*

  * = end_of_word

Autocomplete("py") → đi đến node 'y' → DFS
  → "python", "pytorch"    (chỉ duyệt subtree, không duyệt toàn bộ)

📌 Khi nào analogy bị phá vỡ

Danh bạ điện thoại sắp theo thứ tự. Trie thì không — thứ tự duyệt phụ thuộc cách lưu children (map vs array). Nếu cần kết quả có thứ tự, phải sort sau khi DFS hoặc dùng sorted map cho children.

Cốt lõi kỹ thuật

Cấu trúc TrieNode

Mỗi node lưu:

children: map từ ký tự sang node con
is_end_of_word: đánh dấu từ hoàn chỉnh
metadata (tùy chọn): frequency, priority cho ranking

Các thao tác cơ bản

Insert O(L): Duyệt từng ký tự, tạo node nếu chưa tồn tại, đánh dấu end.

Search O(L): Duyệt theo ký tự. Nếu đến cuối pattern và is_end_of_word = true → tìm thấy.

Prefix Search O(L + K): Duyệt đến cuối prefix, DFS thu thập K từ trong subtree.

Delete O(L): Unmark is_end_of_word. Có thể prune node nếu không có children.

C++PythonJavaGo

cpp

#include <unordered_map>
#include <string>
#include <vector>
#include <algorithm>

struct TrieNode {
    std::unordered_map<char, TrieNode*> children;
    bool isEnd = false;
    int frequency = 0;

    ~TrieNode() {
        for (auto& [c, child] : children)
            delete child;
    }
};

class Trie {
    TrieNode* root;

public:
    Trie() : root(new TrieNode()) {}
    ~Trie() { delete root; }

    void insert(const std::string& word, int freq = 1) {
        TrieNode* cur = root;
        for (char c : word) {
            if (!cur->children.count(c))
                cur->children[c] = new TrieNode();
            cur = cur->children[c];
        }
        cur->isEnd = true;
        cur->frequency += freq;
    }

    bool search(const std::string& word) const {
        TrieNode* node = findNode(word);
        return node && node->isEnd;
    }

    bool startsWith(const std::string& prefix) const {
        return findNode(prefix) != nullptr;
    }

    std::vector<std::string> autocomplete(
        const std::string& prefix, int topN = 10) const {
        TrieNode* node = findNode(prefix);
        if (!node) return {};

        std::vector<std::pair<int, std::string>> results;
        dfsCollect(node, prefix, results);

        std::sort(results.begin(), results.end(),
            [](auto& a, auto& b) { return a.first > b.first; });

        std::vector<std::string> out;
        for (int i = 0; i < std::min(topN, (int)results.size()); i++)
            out.push_back(results[i].second);
        return out;
    }

    bool remove(const std::string& word) {
        TrieNode* node = findNode(word);
        if (!node || !node->isEnd) return false;
        node->isEnd = false;
        node->frequency = 0;
        return true;
    }

private:
    TrieNode* findNode(const std::string& prefix) const {
        TrieNode* cur = root;
        for (char c : prefix) {
            auto it = cur->children.find(c);
            if (it == cur->children.end()) return nullptr;
            cur = it->second;
        }
        return cur;
    }

    void dfsCollect(TrieNode* node, const std::string& word,
        std::vector<std::pair<int, std::string>>& results) const {
        if (node->isEnd)
            results.emplace_back(node->frequency, word);
        for (auto& [c, child] : node->children)
            dfsCollect(child, word + c, results);
    }
};

python

from typing import List, Optional, Dict

class TrieNode:
    __slots__ = ('children', 'is_end', 'frequency')

    def __init__(self) -> None:
        self.children: Dict[str, 'TrieNode'] = {}
        self.is_end: bool = False
        self.frequency: int = 0


class Trie:
    def __init__(self) -> None:
        self.root = TrieNode()

    def insert(self, word: str, freq: int = 1) -> None:
        if not word:
            return
        cur = self.root
        for ch in word:
            if ch not in cur.children:
                cur.children[ch] = TrieNode()
            cur = cur.children[ch]
        cur.is_end = True
        cur.frequency += freq

    def search(self, word: str) -> bool:
        node = self._find_node(word)
        return node is not None and node.is_end

    def starts_with(self, prefix: str) -> bool:
        return self._find_node(prefix) is not None

    def autocomplete(self, prefix: str, top_n: int = 10) -> List[str]:
        node = self._find_node(prefix)
        if node is None:
            return []
        results: List[tuple] = []
        self._dfs_collect(node, prefix, results)
        results.sort(key=lambda x: -x[0])
        return [word for _, word in results[:top_n]]

    def delete(self, word: str) -> bool:
        node = self._find_node(word)
        if node is None or not node.is_end:
            return False
        node.is_end = False
        node.frequency = 0
        return True

    def _find_node(self, prefix: str) -> Optional[TrieNode]:
        cur = self.root
        for ch in prefix:
            if ch not in cur.children:
                return None
            cur = cur.children[ch]
        return cur

    def _dfs_collect(
        self, node: TrieNode, word: str, results: List[tuple]
    ) -> None:
        if node.is_end:
            results.append((node.frequency, word))
        for ch, child in node.children.items():
            self._dfs_collect(child, word + ch, results)

java

import java.util.*;

class TrieNode {
    Map<Character, TrieNode> children = new HashMap<>();
    boolean isEnd = false;
    int frequency = 0;
}

public class Trie {
    private final TrieNode root = new TrieNode();

    public void insert(String word, int freq) {
        TrieNode cur = root;
        for (char c : word.toCharArray()) {
            cur.children.putIfAbsent(c, new TrieNode());
            cur = cur.children.get(c);
        }
        cur.isEnd = true;
        cur.frequency += freq;
    }

    public boolean search(String word) {
        TrieNode node = findNode(word);
        return node != null && node.isEnd;
    }

    public boolean startsWith(String prefix) {
        return findNode(prefix) != null;
    }

    public List<String> autocomplete(String prefix, int topN) {
        TrieNode node = findNode(prefix);
        if (node == null) return Collections.emptyList();

        List<int[]> freqs = new ArrayList<>();
        List<String> words = new ArrayList<>();
        dfsCollect(node, new StringBuilder(prefix), freqs, words);

        Integer[] indices = new Integer[words.size()];
        for (int i = 0; i < indices.length; i++) indices[i] = i;
        Arrays.sort(indices, (a, b) -> freqs.get(b)[0] - freqs.get(a)[0]);

        List<String> result = new ArrayList<>();
        for (int i = 0; i < Math.min(topN, indices.length); i++)
            result.add(words.get(indices[i]));
        return result;
    }

    private TrieNode findNode(String prefix) {
        TrieNode cur = root;
        for (char c : prefix.toCharArray()) {
            if (!cur.children.containsKey(c)) return null;
            cur = cur.children.get(c);
        }
        return cur;
    }

    private void dfsCollect(TrieNode node, StringBuilder sb,
            List<int[]> freqs, List<String> words) {
        if (node.isEnd) {
            freqs.add(new int[]{node.frequency});
            words.add(sb.toString());
        }
        for (var entry : node.children.entrySet()) {
            sb.append(entry.getKey());
            dfsCollect(entry.getValue(), sb, freqs, words);
            sb.deleteCharAt(sb.length() - 1);
        }
    }
}

type TrieNode struct {
    Children  map[byte]*TrieNode
    IsEnd     bool
    Frequency int
}

func NewTrieNode() *TrieNode {
    return &TrieNode{Children: make(map[byte]*TrieNode)}
}

type Trie struct {
    Root *TrieNode
}

func NewTrie() *Trie {
    return &Trie{Root: NewTrieNode()}
}

func (t *Trie) Insert(word string, freq int) {
    cur := t.Root
    for i := 0; i < len(word); i++ {
        c := word[i]
        if _, ok := cur.Children[c]; !ok {
            cur.Children[c] = NewTrieNode()
        }
        cur = cur.Children[c]
    }
    cur.IsEnd = true
    cur.Frequency += freq
}

func (t *Trie) Search(word string) bool {
    node := t.findNode(word)
    return node != nil && node.IsEnd
}

func (t *Trie) StartsWith(prefix string) bool {
    return t.findNode(prefix) != nil
}

func (t *Trie) Autocomplete(prefix string, topN int) []string {
    node := t.findNode(prefix)
    if node == nil {
        return nil
    }
    type entry struct {
        freq int
        word string
    }
    var results []entry
    var dfs func(*TrieNode, string)
    dfs = func(n *TrieNode, w string) {
        if n.IsEnd {
            results = append(results, entry{n.Frequency, w})
        }
        for c, child := range n.Children {
            dfs(child, w+string(c))
        }
    }
    dfs(node, prefix)

    sort.Slice(results, func(i, j int) bool {
        return results[i].freq > results[j].freq
    })
    var out []string
    for i := 0; i < len(results) && i < topN; i++ {
        out = append(out, results[i].word)
    }
    return out
}

func (t *Trie) findNode(prefix string) *TrieNode {
    cur := t.Root
    for i := 0; i < len(prefix); i++ {
        child, ok := cur.Children[prefix[i]]
        if !ok {
            return nil
        }
        cur = child
    }
    return cur
}

Compressed Trie (Radix Tree)

Trie chuẩn lãng phí memory khi chuỗi node chỉ có 1 child. Compressed Trie gộp các chuỗi đơn lẻ thành 1 edge:

text

Standard Trie:               Compressed Trie (Radix):
     (root)                       (root)
      |                            |
      r                          "romane"──*
      |                           /
      o                      "ulus"──*
      |
      m
     / \
    a    u
    |    |
    n    l
    |    |
    e*   u
         |
         s*

Nodes giảm từ 8 → 3. Memory cải thiện đáng kể cho data thưa.

Ứng dụng: Linux kernel radix tree cho page cache, IP routing tables.

Thực chiến

Tình huống: Autocomplete Service cho Search Bar

Bối cảnh: E-commerce platform với 2 triệu sản phẩm. Search bar cần autocomplete trong < 20ms, trả về top 5 gợi ý theo popularity.

Mục tiêu: Build autocomplete service, handle 10.000 QPS, memory < 500MB.

PythonC++

python

from typing import List, Tuple
import heapq

class AutocompleteService:
    """Production autocomplete với frequency-based ranking."""

    def __init__(self) -> None:
        self.trie = Trie()
        self._total_words = 0

    def index_products(self, products: List[Tuple[str, int]]) -> None:
        """Đánh index danh sách sản phẩm (tên, lượt tìm)."""
        for name, search_count in products:
            normalized = name.lower().strip()
            if normalized:
                self.trie.insert(normalized, search_count)
                self._total_words += 1

    def suggest(self, prefix: str, limit: int = 5) -> List[str]:
        """Trả về top suggestions cho prefix."""
        if not prefix or len(prefix) < 2:
            return []  # Yêu cầu >= 2 ký tự
        normalized = prefix.lower().strip()
        return self.trie.autocomplete(normalized, limit)

    def update_frequency(self, word: str, delta: int = 1) -> None:
        """Cập nhật frequency khi user chọn gợi ý."""
        node = self.trie._find_node(word.lower())
        if node and node.is_end:
            node.frequency += delta

cpp

#include <string>
#include <vector>
#include <algorithm>

class AutocompleteService {
    Trie trie;

public:
    void indexProducts(
        const std::vector<std::pair<std::string, int>>& products) {
        for (auto& [name, count] : products) {
            std::string norm = name;
            std::transform(norm.begin(), norm.end(), norm.begin(), ::tolower);
            if (!norm.empty())
                trie.insert(norm, count);
        }
    }

    std::vector<std::string> suggest(
        const std::string& prefix, int limit = 5) {
        if (prefix.size() < 2) return {};
        std::string norm = prefix;
        std::transform(norm.begin(), norm.end(), norm.begin(), ::tolower);
        return trie.autocomplete(norm, limit);
    }
};

Phân tích:

Tại sao Trie, không HashMap: HashMap không hỗ trợ prefix search native. Phải quét toàn bộ 2M key → O(n × L) per query vs O(L + K) với Trie
Memory: 2M sản phẩm, trung bình 20 ký tự, ~50% prefix sharing → ~20M nodes × 100 bytes ≈ 2GB raw. Cần compressed trie hoặc array-based children
Scaling: Trie nằm hoàn toàn trong memory → single server latency < 1ms. Horizontal scale bằng sharding theo prefix range

Sai lầm điển hình

❌ Sai lầm 1: Dùng Trie cho dataset nhỏ

Vấn đề: Overhead Trie không đáng cho vài trăm entries.

python

# SAI: 100 words, Trie là overkill
trie = Trie()
for w in small_words:  # len = 100
    trie.insert(w)
results = trie.autocomplete("py")

Tại sao sai: Trie node overhead ~100-200 bytes. Với 100 từ × 10 ký tự = 1000 nodes ≈ 100-200KB. Trong khi list + startswith() chỉ tốn vài KB và nhanh hơn do cache locality.

python

# ĐÚNG: linear scan cho dataset nhỏ
results = sorted(
    [w for w in small_words if w.startswith("py")],
    key=lambda w: -frequency[w]
)[:10]

❌ Sai lầm 2: Không normalize input

Vấn đề: Case sensitivity tạo duplicate entries.

python

# SAI: "Python" và "python" là 2 đường đi khác nhau
trie.insert("Python")
trie.insert("python")  # duplicate!
trie.search("PYTHON")  # False — user confused

Tại sao sai: Trong production, user gõ mọi case. Không normalize = autocomplete trả thiếu kết quả hoặc duplicate.

python

# ĐÚNG: normalize trước khi insert và search
def insert(self, word: str) -> None:
    word = word.lower().strip()  # normalize
    # ... insert logic

❌ Sai lầm 3: Insert random strings không shared prefix

Vấn đề: Trie memory explodes với data không có prefix locality.

python

# SAI: UUID không share prefix → mỗi ký tự tạo node mới
import uuid
for _ in range(1_000_000):
    trie.insert(str(uuid.uuid4()))
# 36 chars × 1M = 36M nodes × 200 bytes = 7.2 GB !!!

Tại sao sai: Trie chỉ tiết kiệm memory khi data có prefix overlap. Random strings → zero sharing → memory tệ hơn HashMap.

python

# ĐÚNG: dùng hash set cho random unique strings
id_set = set()
for uid in uuids:
    id_set.add(uid)  # O(1) lookup, compact memory

Under the Hood

Complexity analysis

Thao tác	Time	Space (auxiliary)	Ghi chú
Insert	O(L)	O(L) worst	L = độ dài từ
Search	O(L)	O(1)	Không allocate
Delete	O(L)	O(1)	Chỉ unmark flag
Autocomplete	O(L + K)	O(K)	K = số kết quả
Build Trie	O(n × L)	O(n × L × Σ)	n = số từ, Σ = alphabet size

Memory layout thực tế

text

Mỗi TrieNode (Python dict-based):
  ┌─────────────────────────────────┐
  │ children: dict (56 bytes empty) │
  │   + 72 bytes per entry          │
  │ is_end: bool (28 bytes)         │
  │ frequency: int (28 bytes)       │
  │ __dict__: (104 bytes)           │  ← loại bỏ bằng __slots__
  └─────────────────────────────────┘
  Tổng: ~216 bytes/node (dict) → ~112 bytes/node (__slots__)

Optimization options:
  1. __slots__        → giảm ~40% memory
  2. Array children   → fixed 26×8 = 208 bytes thay vì dict overhead
  3. Compressed trie  → gộp chuỗi 1-child, giảm node count
  4. Double-array     → 2 arrays, compact, nhưng phức tạp insert

Trie vs HashMap vs Sorted Array

Tiêu chí	Trie	HashMap	Sorted Array
Exact search	O(L)	O(L) avg	O(L × log n)
Prefix search	O(L + K) ✅	O(n × L) ❌	O(L × log n + K)
Insert	O(L)	O(L) amortized	O(n) shift
Memory	High (pointer overhead)	Medium	Low (compact)
Cache performance	Poor (pointer chasing)	Fair	Excellent
Ordered iteration	✅ DFS	❌	✅ Native

Trade-offs

Quyết định	Trie chọn	Đánh đổi
Per-character branching	O(L) exact lookup	Memory overhead per node
Shared prefix	Tiết kiệm memory cho data có locality	Vô dụng cho random strings
Tree structure	Native prefix search	Cache-unfriendly (pointer chasing)
Unbounded alphabet	Flexible (Unicode, binary)	Memory per node tỷ lệ alphabet

Checklist ghi nhớ

✅ Checklist triển khai

Cấu trúc dữ liệu

[ ] Mỗi node lưu: children map, is_end_of_word, metadata tùy chọn
[ ] Dùng __slots__ (Python) hoặc struct (C/Go) để giảm memory overhead
[ ] Root luôn là node rỗng — không biểu diễn ký tự nào

Thao tác cơ bản

[ ] Insert O(L): tạo node khi chưa tồn tại, đánh dấu end
[ ] Search O(L): duyệt theo ký tự, check is_end ở cuối
[ ] Prefix search O(L + K): duyệt đến prefix node, DFS thu thập
[ ] Delete: unmark is_end, tùy chọn prune nodes lá không cần

Khi nào dùng / không dùng

[ ] ✅ Dùng khi: prefix search, autocomplete, dictionary, IP routing
[ ] ❌ Không dùng khi: dataset < 1000 items, random strings, memory-constrained
[ ] Prefix locality = Trie hiệu quả. Không locality = HashMap tốt hơn

Production

[ ] Normalize input (lowercase, trim) trước mọi thao tác
[ ] Compressed trie cho data thưa (giảm memory 3-10×)
[ ] Frequency tracking cho ranking autocomplete results
[ ] Benchmark memory usage trên data thực — Trie có thể ngốn RAM bất ngờ

Bài tập luyện tập

Bài 1: Implement Trie cơ bản — Foundation

Đề bài: Implement Trie hỗ trợ 3 thao tác: insert(word), search(word), startsWith(prefix). Test với: insert "apple", "app", "application".

🧠 Quiz

Câu hỏi kiểm tra: Sau khi insert "apple" và "app", search("app") trả về gì?

[x] A. True — vì "app" đã được insert riêng
[ ] B. False — vì "app" chỉ là prefix của "apple"
[ ] C. Error — vì "app" là substring
[ ] D. Undefined — tùy implementation Giải thích: Cả "apple" và "app" đều được insert. Node tại 'p' thứ hai có is_end = True (từ insert "app"). search kiểm tra is_end, không chỉ sự tồn tại của path.

💡 Gợi ý

Insert "apple": tạo path a → p → p → l → e, đánh dấu 'e' là end
Insert "app": duyệt a → p → p (đã tồn tại), đánh dấu 'p' thứ hai là end
search("app"): duyệt đến 'p' thứ hai, check is_end → True

✅ Lời giải

python

trie = Trie()
trie.insert("apple")
trie.insert("app")
trie.insert("application")

assert trie.search("apple") == True
assert trie.search("app") == True
assert trie.search("ap") == False         # chưa insert
assert trie.starts_with("ap") == True     # prefix tồn tại
assert trie.starts_with("xyz") == False

Phân tích: Insert O(L) cho mỗi từ. "apple" và "app" share 3 nodes đầu → tiết kiệm memory.

Bài 2: Autocomplete với Ranking — Intermediate

Đề bài: Cho danh sách sản phẩm với số lượt search: [("iphone 15", 50000), ("iphone 14", 30000), ("ipad pro", 20000), ("ipad air", 15000), ("imac", 10000)]. Implement autocomplete trả top 3 theo frequency.

🧠 Quiz

Câu hỏi kiểm tra: autocomplete("ip", top_n=3) trả về thứ tự nào?

[x] A. ["iphone 15", "iphone 14", "ipad pro"]
[ ] B. ["ipad air", "ipad pro", "iphone 14"]
[ ] C. ["imac", "ipad air", "ipad pro"]
[ ] D. ["iphone 14", "iphone 15", "ipad pro"] Giải thích: Prefix "ip" match tất cả 4 sản phẩm bắt đầu bằng "ip". Sort theo frequency: 50000, 30000, 20000, 15000. Top 3: "iphone 15", "iphone 14", "ipad pro".

✅ Lời giải

python

trie = Trie()
products = [
    ("iphone 15", 50000), ("iphone 14", 30000),
    ("ipad pro", 20000), ("ipad air", 15000), ("imac", 10000)
]
for name, freq in products:
    trie.insert(name, freq)

assert trie.autocomplete("ip", 3) == ["iphone 15", "iphone 14", "ipad pro"]
assert trie.autocomplete("iph", 2) == ["iphone 15", "iphone 14"]
assert trie.autocomplete("im", 1) == ["imac"]

Phân tích: DFS thu thập O(K) kết quả, sort O(K log K), trả top N. Bottleneck: K lớn (nhiều sản phẩm cùng prefix) → dùng min-heap thay sort toàn bộ.

Liên kết học tiếp

Từ khóa glossary: Trie, prefix tree, cây tiền tố, autocomplete, radix tree, compressed trie, Patricia trie, longest prefix match

Tìm kiếm liên quan: cây tiền tố, gợi ý tìm kiếm, IP routing table, dictionary data structure, prefix matching

Trie — Cây tiền tố ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

Cấu trúc TrieNode ​

Các thao tác cơ bản ​

Compressed Trie (Radix Tree) ​

Thực chiến ​

Tình huống: Autocomplete Service cho Search Bar ​

Sai lầm điển hình ​

❌ Sai lầm 1: Dùng Trie cho dataset nhỏ ​

❌ Sai lầm 2: Không normalize input ​

❌ Sai lầm 3: Insert random strings không shared prefix ​

Under the Hood ​

Complexity analysis ​

Memory layout thực tế ​

Trie vs HashMap vs Sorted Array ​

Trade-offs ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Implement Trie cơ bản — Foundation ​

Bài 2: Autocomplete với Ranking — Intermediate ​

Liên kết học tiếp ​

Trie — Cây tiền tố

Bức tranh tư duy

Cốt lõi kỹ thuật

Cấu trúc TrieNode

Các thao tác cơ bản

Compressed Trie (Radix Tree)

Thực chiến

Tình huống: Autocomplete Service cho Search Bar

Sai lầm điển hình

❌ Sai lầm 1: Dùng Trie cho dataset nhỏ

❌ Sai lầm 2: Không normalize input

❌ Sai lầm 3: Insert random strings không shared prefix

Under the Hood

Complexity analysis

Memory layout thực tế

Trie vs HashMap vs Sorted Array

Trade-offs

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Implement Trie cơ bản — Foundation

Bài 2: Autocomplete với Ranking — Intermediate

Liên kết học tiếp