Giao diện
Aho-Corasick - Multi-Pattern Search Engine
"Khi bạn cần tìm 10,000 patterns cùng lúc, Aho-Corasick là câu trả lời duy nhất." - HPN
Định nghĩa
Aho-Corasick là thuật toán multi-pattern string matching, kết hợp:
- Trie - Để lưu trữ tất cả patterns
- Failure Links - Để tránh backtracking khi mismatch
📘 Automaton Theory
Aho-Corasick là một DFA (Deterministic Finite Automaton) hoàn chỉnh:
- States = Nodes trong Trie
- Transitions = Character edges + Failure links
- Accept states = End-of-word nodes
Tại sao cần Aho-Corasick?
Bài toán: Bad Word Filter
Filter 5,000 từ cấm trong chat:
| Approach | Complexity | 1 triệu messages |
|---|---|---|
| Brute-force | ~8 giờ | |
| Aho-Corasick | ~5 giây |
Mô hình hóa: Failure Links
Patterns: ["he", "she", "his", "hers"]
Failure Link Logic:
she → he: "he" là suffix của "she"sh → h: "h" là suffix của "sh"
Production Implementation
python
# HPN Engineering Standard
# Implementation: Aho-Corasick Multi-Pattern Matcher
from typing import List, Dict, Optional, Set, Tuple
from collections import deque
class AhoCorasickNode:
__slots__ = ('children', 'failure', 'output', 'depth')
def __init__(self, depth: int = 0) -> None:
self.children: Dict[str, 'AhoCorasickNode'] = {}
self.failure: Optional['AhoCorasickNode'] = None
self.output: Set[str] = set()
self.depth = depth
class AhoCorasick:
"""
Multi-pattern string matching.
Time: Build O(M), Search O(N + Z)
Space: O(M × SIGMA)
Use Cases: IDS (Snort), bad word filter, antivirus
"""
def __init__(self, patterns: Optional[List[str]] = None) -> None:
self.root = AhoCorasickNode(depth=0)
self.root.failure = self.root
self._pattern_count = 0
self._is_built = False
if patterns:
for pattern in patterns:
self.add_pattern(pattern)
self.build()
def add_pattern(self, pattern: str) -> None:
if not pattern:
return
self._is_built = False
current = self.root
for char in pattern:
if char not in current.children:
current.children[char] = AhoCorasickNode(current.depth + 1)
current = current.children[char]
current.output.add(pattern)
self._pattern_count += 1
def build(self) -> None:
"""Build failure links using BFS."""
if self._is_built:
return
queue = deque()
for child in self.root.children.values():
child.failure = self.root
queue.append(child)
while queue:
current = queue.popleft()
for char, child in current.children.items():
queue.append(child)
failure_node = current.failure
while failure_node is not self.root and char not in failure_node.children:
failure_node = failure_node.failure
if char in failure_node.children and failure_node.children[char] is not child:
child.failure = failure_node.children[char]
else:
child.failure = self.root
child.output = child.output.union(child.failure.output)
self._is_built = True
def search(self, text: str) -> List[Tuple[int, str]]:
"""Search all pattern occurrences."""
if not self._is_built:
self.build()
if not text:
return []
results: List[Tuple[int, str]] = []
current = self.root
for i, char in enumerate(text):
while current is not self.root and char not in current.children:
current = current.failure
if char in current.children:
current = current.children[char]
for pattern in current.output:
results.append((i, pattern))
return results
def contains_any(self, text: str) -> bool:
"""Quick check if text contains any pattern."""
if not self._is_built:
self.build()
current = self.root
for char in text:
while current is not self.root and char not in current.children:
current = current.failure
if char in current.children:
current = current.children[char]
if current.output:
return True
return FalseĐộ phức tạp
| Phase | Complexity |
|---|---|
| Build Trie | |
| Build Failure Links | |
| Search |
Ứng dụng Thực tế
| System | Use Case |
|---|---|
| Snort IDS | Match attack signatures in packets |
| fgrep | grep -F -f patterns.txt |
| ClamAV | Virus signature scanning |
Anti-patterns
❌ TRÁNH
- Quên
build()sau khi add patterns - Dùng cho 1 pattern → Dùng KMP
- Case mismatch → Normalize cả patterns và text
💡 HPN's Rule
"10+ patterns? Aho-Corasick. Không bàn cãi."