Giao diện
Huffman Coding - The Compression Engine
"Mỗi lần bạn nén file ZIP, bạn đang chạy Huffman Coding." - HPN
Problem Statement
Nén dữ liệu lossless (không mất thông tin) bằng cách gán mã ngắn hơn cho ký tự xuất hiện nhiều hơn.
Input: "AAAAABBBCCCD"
Fixed-length encoding (3 bits each):
A=000, B=001, C=010, D=011
Total: 12 chars × 3 bits = 36 bits
Huffman encoding (variable-length):
A=1, B=01, C=001, D=000
Total: 5×1 + 3×2 + 3×3 + 1×3 = 5+6+9+3 = 23 bits
Compression ratio: 36% saved!📘 Tại sao Huffman là Greedy?
Tại mỗi bước, chọn 2 node có frequency nhỏ nhất để merge. Đây là lựa chọn greedy → đẩy ký tự ít xuất hiện xuống sâu trong tree.
Real-World Applications
| Domain | Use Case | Chi tiết |
|---|---|---|
| File Compression | ZIP, GZIP, DEFLATE | Lossless compression |
| Image Formats | JPEG, PNG | Part of the compression pipeline |
| Networking | HTTP/2, HPACK | Header compression |
| Databases | Column compression | Storage optimization |
| Bioinformatics | DNA sequence compression | High-frequency base pairs |
The Algorithm
Step 1: Build Frequency Table
"AAAAABBBCCCD"
→ A:5, B:3, C:3, D:1Step 2: Build Min-Heap (Priority Queue)
Initial heap (by frequency):
[D:1, B:3, C:3, A:5]Step 3: Build Huffman Tree
Iteration 1: Pop D(1) and B(3), create internal node DB(4)
Heap: [C:3, DB:4, A:5]
Iteration 2: Pop C(3) and DB(4), create CDB(7)
Heap: [A:5, CDB:7]
Iteration 3: Pop A(5) and CDB(7), create root ACDB(12)
Heap: [ACDB:12] → Done!Step 4: Assign Codes
ACDB(12)
/ \
A(5) CDB(7)
[1] / \
C(3) DB(4)
[01] / \
D(1) B(3)
[001] [000]
Wait, let me recalculate...
Actually the order matters for which gets 0 vs 1
Final codes depend on implementation (left=0, right=1 convention)Visualization
Implementation
python
import heapq
from typing import Dict, Tuple, Optional
from collections import Counter
from dataclasses import dataclass, field
@dataclass(order=True)
class HuffmanNode:
"""Node in Huffman Tree."""
freq: int
char: Optional[str] = field(compare=False)
left: Optional['HuffmanNode'] = field(default=None, compare=False)
right: Optional['HuffmanNode'] = field(default=None, compare=False)
def is_leaf(self) -> bool:
return self.left is None and self.right is None
class HuffmanCoding:
"""
Huffman Coding implementation for lossless compression.
Time Complexity:
- Build tree: O(N log N) where N = unique characters
- Encode: O(M) where M = input length
- Decode: O(M)
Space: O(N) for the tree and codes
"""
def __init__(self):
self.root: Optional[HuffmanNode] = None
self.codes: Dict[str, str] = {}
self.reverse_codes: Dict[str, str] = {}
def build_tree(self, text: str) -> None:
"""
Build Huffman tree from input text.
Greedy approach:
1. Create leaf node for each character
2. Always merge two nodes with smallest frequency
3. Repeat until one node remains (root)
"""
# Count frequencies
freq = Counter(text)
# Create min-heap of leaf nodes
heap = [HuffmanNode(f, c) for c, f in freq.items()]
heapq.heapify(heap)
# Edge case: single character
if len(heap) == 1:
node = heapq.heappop(heap)
self.root = HuffmanNode(node.freq, None, node, None)
self._generate_codes(self.root, "")
return
# Build tree by merging smallest nodes
while len(heap) > 1:
# Pop two smallest
left = heapq.heappop(heap)
right = heapq.heappop(heap)
# Create internal node
merged = HuffmanNode(
freq=left.freq + right.freq,
char=None,
left=left,
right=right
)
heapq.heappush(heap, merged)
self.root = heap[0]
self._generate_codes(self.root, "")
def _generate_codes(self, node: HuffmanNode, code: str) -> None:
"""Generate codes by traversing tree (left=0, right=1)."""
if node is None:
return
if node.is_leaf():
# Assign code (handle edge case of single char)
self.codes[node.char] = code if code else "0"
self.reverse_codes[code if code else "0"] = node.char
return
self._generate_codes(node.left, code + "0")
self._generate_codes(node.right, code + "1")
def encode(self, text: str) -> str:
"""Encode text to binary string."""
if not self.codes:
self.build_tree(text)
return ''.join(self.codes[c] for c in text)
def decode(self, encoded: str) -> str:
"""Decode binary string back to text."""
if not self.root:
raise ValueError("Tree not built. Call build_tree first.")
result = []
current = self.root
for bit in encoded:
if bit == '0':
current = current.left
else:
current = current.right
if current.is_leaf():
result.append(current.char)
current = self.root
return ''.join(result)
def get_compression_ratio(self, text: str) -> Tuple[int, int, float]:
"""
Calculate compression ratio.
Returns: (original_bits, compressed_bits, ratio)
"""
if not self.codes:
self.build_tree(text)
# Original: 8 bits per char (ASCII)
original_bits = len(text) * 8
# Compressed: variable-length codes
compressed_bits = sum(len(self.codes[c]) for c in text)
ratio = (1 - compressed_bits / original_bits) * 100
return original_bits, compressed_bits, ratio
def print_codes(self) -> None:
"""Print character codes sorted by code length."""
sorted_codes = sorted(self.codes.items(), key=lambda x: (len(x[1]), x[0]))
print("Character Codes:")
for char, code in sorted_codes:
display_char = repr(char) if char in ' \n\t' else char
print(f" {display_char}: {code}")
# ============================================
# PRODUCTION: FILE COMPRESSION
# ============================================
def compress_file(input_path: str, output_path: str) -> dict:
"""
Compress a file using Huffman coding.
Returns compression statistics.
"""
with open(input_path, 'r', encoding='utf-8') as f:
text = f.read()
huffman = HuffmanCoding()
encoded = huffman.encode(text)
# Convert binary string to bytes for storage
# Pad to make length multiple of 8
padding = 8 - len(encoded) % 8
encoded = encoded + '0' * padding
# Convert to bytes
byte_array = bytearray()
for i in range(0, len(encoded), 8):
byte_array.append(int(encoded[i:i+8], 2))
# Save padding info and codes for decompression
# (In real implementation, you'd serialize the tree or codes)
with open(output_path, 'wb') as f:
# First byte: padding amount
f.write(bytes([padding]))
f.write(byte_array)
original, compressed, ratio = huffman.get_compression_ratio(text)
return {
'original_size': len(text),
'original_bits': original,
'compressed_bits': compressed,
'compression_ratio': f"{ratio:.2f}%",
'codes': huffman.codes
}
# ============================================
# ANALYSIS: PREFIX-FREE PROPERTY
# ============================================
def verify_prefix_free(codes: Dict[str, str]) -> bool:
"""
Verify that codes are prefix-free (no code is prefix of another).
This is essential for unambiguous decoding!
"""
code_list = sorted(codes.values(), key=len)
for i, code in enumerate(code_list):
for other in code_list[i+1:]:
if other.startswith(code):
return False
return True
# ============================================
# USAGE EXAMPLE
# ============================================
if __name__ == "__main__":
# Example 1: Basic compression
print("=== Huffman Coding Demo ===\n")
text = "AAAAABBBCCCD"
print(f"Original text: '{text}'")
print(f"Length: {len(text)} characters\n")
huffman = HuffmanCoding()
huffman.build_tree(text)
huffman.print_codes()
encoded = huffman.encode(text)
print(f"\nEncoded: {encoded}")
print(f"Encoded length: {len(encoded)} bits")
decoded = huffman.decode(encoded)
print(f"Decoded: '{decoded}'")
print(f"Match: {decoded == text}")
original, compressed, ratio = huffman.get_compression_ratio(text)
print(f"\nCompression: {original} bits → {compressed} bits ({ratio:.1f}% saved)")
# Verify prefix-free
print(f"Prefix-free: {verify_prefix_free(huffman.codes)}")
# Example 2: Real text
print("\n=== Real Text Example ===\n")
sample_text = """
The quick brown fox jumps over the lazy dog.
Pack my box with five dozen liquor jugs.
How vexingly quick daft zebras jump!
"""
huffman2 = HuffmanCoding()
huffman2.build_tree(sample_text)
print(f"Text length: {len(sample_text)} characters")
encoded2 = huffman2.encode(sample_text)
original2, compressed2, ratio2 = huffman2.get_compression_ratio(sample_text)
print(f"Compression: {original2} bits → {compressed2} bits ({ratio2:.1f}% saved)")
print("\nTop 5 most frequent characters:")
freq = Counter(sample_text)
for char, count in freq.most_common(5):
display = repr(char) if char in ' \n\t' else char
print(f" {display}: {count} occurrences, code: {huffman2.codes[char]}")Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| Build Frequency Table | O(N) | O(K) |
| Build Heap | O(K log K) | O(K) |
| Build Tree | O(K log K) | O(K) |
| Encode | O(N) | O(N) |
| Decode | O(N) | O(N) |
N = input length, K = unique characters
The Greedy Proof
💡 Tại sao Greedy hoạt động?
Claim: Trong optimal tree, 2 ký tự có frequency nhỏ nhất phải là anh em (siblings) ở độ sâu lớn nhất.
Proof by exchange: Nếu không, swap chúng với siblings ở sâu nhất → cost giảm → mâu thuẫn với optimal.
Do đó, merge 2 node nhỏ nhất trước là an toàn!
Key Properties
| Property | Explanation |
|---|---|
| Prefix-free | Không code nào là prefix của code khác |
| Optimal | Minimizes expected code length |
| Lossless | Decode chính xác input gốc |
| Adaptive | Codes phụ thuộc vào frequency thực tế |
Huffman vs Other Compression
| Algorithm | Type | Use Case |
|---|---|---|
| Huffman | Lossless, statistical | Part of DEFLATE (ZIP) |
| LZW | Lossless, dictionary | GIF, old Unix compress |
| LZ77/LZ78 | Lossless, sliding window | GZIP, PNG |
| Arithmetic | Lossless, more optimal | JPEG 2000, H.264 |
| JPEG | Lossy (DCT + Huffman) | Photos |
💡 HPN's Rule
"Huffman tối ưu cho symbol-by-symbol encoding. Thực tế thường kết hợp với LZ77 (dictionary) để đạt hiệu quả cao hơn."