Skip to content

Aho-Corasick - Multi-Pattern Search Engine

"Khi bạn cần tìm 10,000 patterns cùng lúc, Aho-Corasick là câu trả lời duy nhất." - HPN

Định nghĩa

Aho-Corasick là thuật toán multi-pattern string matching, kết hợp:

  1. Trie - Để lưu trữ tất cả patterns
  2. Failure Links - Để tránh backtracking khi mismatch

📘 Automaton Theory

Aho-Corasick là một DFA (Deterministic Finite Automaton) hoàn chỉnh:

  • States = Nodes trong Trie
  • Transitions = Character edges + Failure links
  • Accept states = End-of-word nodes

Tại sao cần Aho-Corasick?

Bài toán: Bad Word Filter

Filter 5,000 từ cấm trong chat:

ApproachComplexity1 triệu messages
Brute-forceO(N×M×K)~8 giờ
Aho-CorasickO(N+M+Z)~5 giây

Patterns: ["he", "she", "his", "hers"]

Failure Link Logic:

  • she → he: "he" là suffix của "she"
  • sh → h: "h" là suffix của "sh"

Production Implementation

python
# HPN Engineering Standard
# Implementation: Aho-Corasick Multi-Pattern Matcher

from typing import List, Dict, Optional, Set, Tuple
from collections import deque


class AhoCorasickNode:
    __slots__ = ('children', 'failure', 'output', 'depth')
    
    def __init__(self, depth: int = 0) -> None:
        self.children: Dict[str, 'AhoCorasickNode'] = {}
        self.failure: Optional['AhoCorasickNode'] = None
        self.output: Set[str] = set()
        self.depth = depth


class AhoCorasick:
    """
    Multi-pattern string matching.
    
    Time: Build O(M), Search O(N + Z)
    Space: O(M × SIGMA)
    
    Use Cases: IDS (Snort), bad word filter, antivirus
    """
    
    def __init__(self, patterns: Optional[List[str]] = None) -> None:
        self.root = AhoCorasickNode(depth=0)
        self.root.failure = self.root
        self._pattern_count = 0
        self._is_built = False
        
        if patterns:
            for pattern in patterns:
                self.add_pattern(pattern)
            self.build()
    
    def add_pattern(self, pattern: str) -> None:
        if not pattern:
            return
        
        self._is_built = False
        current = self.root
        
        for char in pattern:
            if char not in current.children:
                current.children[char] = AhoCorasickNode(current.depth + 1)
            current = current.children[char]
        
        current.output.add(pattern)
        self._pattern_count += 1
    
    def build(self) -> None:
        """Build failure links using BFS."""
        if self._is_built:
            return
        
        queue = deque()
        
        for child in self.root.children.values():
            child.failure = self.root
            queue.append(child)
        
        while queue:
            current = queue.popleft()
            
            for char, child in current.children.items():
                queue.append(child)
                failure_node = current.failure
                
                while failure_node is not self.root and char not in failure_node.children:
                    failure_node = failure_node.failure
                
                if char in failure_node.children and failure_node.children[char] is not child:
                    child.failure = failure_node.children[char]
                else:
                    child.failure = self.root
                
                child.output = child.output.union(child.failure.output)
        
        self._is_built = True
    
    def search(self, text: str) -> List[Tuple[int, str]]:
        """Search all pattern occurrences."""
        if not self._is_built:
            self.build()
        
        if not text:
            return []
        
        results: List[Tuple[int, str]] = []
        current = self.root
        
        for i, char in enumerate(text):
            while current is not self.root and char not in current.children:
                current = current.failure
            
            if char in current.children:
                current = current.children[char]
            
            for pattern in current.output:
                results.append((i, pattern))
        
        return results
    
    def contains_any(self, text: str) -> bool:
        """Quick check if text contains any pattern."""
        if not self._is_built:
            self.build()
        
        current = self.root
        for char in text:
            while current is not self.root and char not in current.children:
                current = current.failure
            if char in current.children:
                current = current.children[char]
            if current.output:
                return True
        return False

Độ phức tạp

PhaseComplexity
Build TrieO(M)
Build Failure LinksO(M)
SearchO(N+Z)

Ứng dụng Thực tế

SystemUse Case
Snort IDSMatch attack signatures in packets
fgrepgrep -F -f patterns.txt
ClamAVVirus signature scanning

Anti-patterns

❌ TRÁNH

  1. Quên build() sau khi add patterns
  2. Dùng cho 1 pattern → Dùng KMP
  3. Case mismatch → Normalize cả patterns và text

💡 HPN's Rule

"10+ patterns? Aho-Corasick. Không bàn cãi."