Skip to content

Huffman Coding - The Compression Engine

"Mỗi lần bạn nén file ZIP, bạn đang chạy Huffman Coding." - HPN

Problem Statement

Nén dữ liệu lossless (không mất thông tin) bằng cách gán mã ngắn hơn cho ký tự xuất hiện nhiều hơn.

Input: "AAAAABBBCCCD"

Fixed-length encoding (3 bits each):
A=000, B=001, C=010, D=011
Total: 12 chars × 3 bits = 36 bits

Huffman encoding (variable-length):
A=1, B=01, C=001, D=000
Total: 5×1 + 3×2 + 3×3 + 1×3 = 5+6+9+3 = 23 bits

Compression ratio: 36% saved!

📘 Tại sao Huffman là Greedy?

Tại mỗi bước, chọn 2 node có frequency nhỏ nhất để merge. Đây là lựa chọn greedy → đẩy ký tự ít xuất hiện xuống sâu trong tree.

Real-World Applications

DomainUse CaseChi tiết
File CompressionZIP, GZIP, DEFLATELossless compression
Image FormatsJPEG, PNGPart of the compression pipeline
NetworkingHTTP/2, HPACKHeader compression
DatabasesColumn compressionStorage optimization
BioinformaticsDNA sequence compressionHigh-frequency base pairs

The Algorithm

Step 1: Build Frequency Table

"AAAAABBBCCCD"
→ A:5, B:3, C:3, D:1

Step 2: Build Min-Heap (Priority Queue)

Initial heap (by frequency):
  [D:1, B:3, C:3, A:5]

Step 3: Build Huffman Tree

Iteration 1: Pop D(1) and B(3), create internal node DB(4)
Heap: [C:3, DB:4, A:5]

Iteration 2: Pop C(3) and DB(4), create CDB(7)
Heap: [A:5, CDB:7]

Iteration 3: Pop A(5) and CDB(7), create root ACDB(12)
Heap: [ACDB:12] → Done!

Step 4: Assign Codes

        ACDB(12)
       /        \
      A(5)    CDB(7)
     [1]     /      \
           C(3)   DB(4)
          [01]   /    \
               D(1)  B(3)
              [001] [000]

Wait, let me recalculate...
Actually the order matters for which gets 0 vs 1
Final codes depend on implementation (left=0, right=1 convention)

Visualization

Implementation

python
import heapq
from typing import Dict, Tuple, Optional
from collections import Counter
from dataclasses import dataclass, field


@dataclass(order=True)
class HuffmanNode:
    """Node in Huffman Tree."""
    freq: int
    char: Optional[str] = field(compare=False)
    left: Optional['HuffmanNode'] = field(default=None, compare=False)
    right: Optional['HuffmanNode'] = field(default=None, compare=False)
    
    def is_leaf(self) -> bool:
        return self.left is None and self.right is None


class HuffmanCoding:
    """
    Huffman Coding implementation for lossless compression.
    
    Time Complexity:
    - Build tree: O(N log N) where N = unique characters
    - Encode: O(M) where M = input length
    - Decode: O(M)
    
    Space: O(N) for the tree and codes
    """
    
    def __init__(self):
        self.root: Optional[HuffmanNode] = None
        self.codes: Dict[str, str] = {}
        self.reverse_codes: Dict[str, str] = {}
    
    def build_tree(self, text: str) -> None:
        """
        Build Huffman tree from input text.
        
        Greedy approach:
        1. Create leaf node for each character
        2. Always merge two nodes with smallest frequency
        3. Repeat until one node remains (root)
        """
        # Count frequencies
        freq = Counter(text)
        
        # Create min-heap of leaf nodes
        heap = [HuffmanNode(f, c) for c, f in freq.items()]
        heapq.heapify(heap)
        
        # Edge case: single character
        if len(heap) == 1:
            node = heapq.heappop(heap)
            self.root = HuffmanNode(node.freq, None, node, None)
            self._generate_codes(self.root, "")
            return
        
        # Build tree by merging smallest nodes
        while len(heap) > 1:
            # Pop two smallest
            left = heapq.heappop(heap)
            right = heapq.heappop(heap)
            
            # Create internal node
            merged = HuffmanNode(
                freq=left.freq + right.freq,
                char=None,
                left=left,
                right=right
            )
            
            heapq.heappush(heap, merged)
        
        self.root = heap[0]
        self._generate_codes(self.root, "")
    
    def _generate_codes(self, node: HuffmanNode, code: str) -> None:
        """Generate codes by traversing tree (left=0, right=1)."""
        if node is None:
            return
        
        if node.is_leaf():
            # Assign code (handle edge case of single char)
            self.codes[node.char] = code if code else "0"
            self.reverse_codes[code if code else "0"] = node.char
            return
        
        self._generate_codes(node.left, code + "0")
        self._generate_codes(node.right, code + "1")
    
    def encode(self, text: str) -> str:
        """Encode text to binary string."""
        if not self.codes:
            self.build_tree(text)
        
        return ''.join(self.codes[c] for c in text)
    
    def decode(self, encoded: str) -> str:
        """Decode binary string back to text."""
        if not self.root:
            raise ValueError("Tree not built. Call build_tree first.")
        
        result = []
        current = self.root
        
        for bit in encoded:
            if bit == '0':
                current = current.left
            else:
                current = current.right
            
            if current.is_leaf():
                result.append(current.char)
                current = self.root
        
        return ''.join(result)
    
    def get_compression_ratio(self, text: str) -> Tuple[int, int, float]:
        """
        Calculate compression ratio.
        
        Returns: (original_bits, compressed_bits, ratio)
        """
        if not self.codes:
            self.build_tree(text)
        
        # Original: 8 bits per char (ASCII)
        original_bits = len(text) * 8
        
        # Compressed: variable-length codes
        compressed_bits = sum(len(self.codes[c]) for c in text)
        
        ratio = (1 - compressed_bits / original_bits) * 100
        
        return original_bits, compressed_bits, ratio
    
    def print_codes(self) -> None:
        """Print character codes sorted by code length."""
        sorted_codes = sorted(self.codes.items(), key=lambda x: (len(x[1]), x[0]))
        print("Character Codes:")
        for char, code in sorted_codes:
            display_char = repr(char) if char in ' \n\t' else char
            print(f"  {display_char}: {code}")


# ============================================
# PRODUCTION: FILE COMPRESSION
# ============================================

def compress_file(input_path: str, output_path: str) -> dict:
    """
    Compress a file using Huffman coding.
    
    Returns compression statistics.
    """
    with open(input_path, 'r', encoding='utf-8') as f:
        text = f.read()
    
    huffman = HuffmanCoding()
    encoded = huffman.encode(text)
    
    # Convert binary string to bytes for storage
    # Pad to make length multiple of 8
    padding = 8 - len(encoded) % 8
    encoded = encoded + '0' * padding
    
    # Convert to bytes
    byte_array = bytearray()
    for i in range(0, len(encoded), 8):
        byte_array.append(int(encoded[i:i+8], 2))
    
    # Save padding info and codes for decompression
    # (In real implementation, you'd serialize the tree or codes)
    
    with open(output_path, 'wb') as f:
        # First byte: padding amount
        f.write(bytes([padding]))
        f.write(byte_array)
    
    original, compressed, ratio = huffman.get_compression_ratio(text)
    
    return {
        'original_size': len(text),
        'original_bits': original,
        'compressed_bits': compressed,
        'compression_ratio': f"{ratio:.2f}%",
        'codes': huffman.codes
    }


# ============================================
# ANALYSIS: PREFIX-FREE PROPERTY
# ============================================

def verify_prefix_free(codes: Dict[str, str]) -> bool:
    """
    Verify that codes are prefix-free (no code is prefix of another).
    
    This is essential for unambiguous decoding!
    """
    code_list = sorted(codes.values(), key=len)
    
    for i, code in enumerate(code_list):
        for other in code_list[i+1:]:
            if other.startswith(code):
                return False
    
    return True


# ============================================
# USAGE EXAMPLE
# ============================================

if __name__ == "__main__":
    # Example 1: Basic compression
    print("=== Huffman Coding Demo ===\n")
    
    text = "AAAAABBBCCCD"
    print(f"Original text: '{text}'")
    print(f"Length: {len(text)} characters\n")
    
    huffman = HuffmanCoding()
    huffman.build_tree(text)
    huffman.print_codes()
    
    encoded = huffman.encode(text)
    print(f"\nEncoded: {encoded}")
    print(f"Encoded length: {len(encoded)} bits")
    
    decoded = huffman.decode(encoded)
    print(f"Decoded: '{decoded}'")
    print(f"Match: {decoded == text}")
    
    original, compressed, ratio = huffman.get_compression_ratio(text)
    print(f"\nCompression: {original} bits → {compressed} bits ({ratio:.1f}% saved)")
    
    # Verify prefix-free
    print(f"Prefix-free: {verify_prefix_free(huffman.codes)}")
    
    # Example 2: Real text
    print("\n=== Real Text Example ===\n")
    
    sample_text = """
    The quick brown fox jumps over the lazy dog.
    Pack my box with five dozen liquor jugs.
    How vexingly quick daft zebras jump!
    """
    
    huffman2 = HuffmanCoding()
    huffman2.build_tree(sample_text)
    
    print(f"Text length: {len(sample_text)} characters")
    encoded2 = huffman2.encode(sample_text)
    
    original2, compressed2, ratio2 = huffman2.get_compression_ratio(sample_text)
    print(f"Compression: {original2} bits → {compressed2} bits ({ratio2:.1f}% saved)")
    
    print("\nTop 5 most frequent characters:")
    freq = Counter(sample_text)
    for char, count in freq.most_common(5):
        display = repr(char) if char in ' \n\t' else char
        print(f"  {display}: {count} occurrences, code: {huffman2.codes[char]}")

Complexity Analysis

OperationTimeSpace
Build Frequency TableO(N)O(K)
Build HeapO(K log K)O(K)
Build TreeO(K log K)O(K)
EncodeO(N)O(N)
DecodeO(N)O(N)

N = input length, K = unique characters

The Greedy Proof

💡 Tại sao Greedy hoạt động?

Claim: Trong optimal tree, 2 ký tự có frequency nhỏ nhất phải là anh em (siblings) ở độ sâu lớn nhất.

Proof by exchange: Nếu không, swap chúng với siblings ở sâu nhất → cost giảm → mâu thuẫn với optimal.

Do đó, merge 2 node nhỏ nhất trước là an toàn!

Key Properties

PropertyExplanation
Prefix-freeKhông code nào là prefix của code khác
OptimalMinimizes expected code length
LosslessDecode chính xác input gốc
AdaptiveCodes phụ thuộc vào frequency thực tế

Huffman vs Other Compression

AlgorithmTypeUse Case
HuffmanLossless, statisticalPart of DEFLATE (ZIP)
LZWLossless, dictionaryGIF, old Unix compress
LZ77/LZ78Lossless, sliding windowGZIP, PNG
ArithmeticLossless, more optimalJPEG 2000, H.264
JPEGLossy (DCT + Huffman)Photos

💡 HPN's Rule

"Huffman tối ưu cho symbol-by-symbol encoding. Thực tế thường kết hợp với LZ77 (dictionary) để đạt hiệu quả cao hơn."