Giao diện
HyperLogLog - Counting to a Billion
"Đếm 1 tỷ unique users với chỉ 12 KB RAM." - HPN
Định nghĩa
HyperLogLog (HLL) là thuật toán ước lượng cardinality (số lượng unique elements) trong một tập hợp với:
- Space: O(1) - Constant memory (~12 KB cho 1 tỷ items)
- Accuracy: ~97.7% - Standard error ~2%
📘 Cardinality Problem
"Có bao nhiêu unique visitors truy cập website hôm nay?"
Với 1 tỷ visits, HashSet cần 8 GB RAM. HyperLogLog cần 12 KB RAM với độ chính xác 98%.
Tại sao cần HyperLogLog?
Bài toán: Counting Unique Views
Facebook cần đếm unique viewers cho mỗi post:
| Approach | Memory (1B users) | Accuracy |
|---|---|---|
| HashSet | 8 GB | 100% |
| Bitmap | 125 MB | 100% |
| HyperLogLog | 12 KB | ~98% |
🎯 SCALE
Facebook có hàng triệu posts. HashSet: 1 triệu × 8 GB = 8 Petabytes HyperLogLog: 1 triệu × 12 KB = 12 GB
The Coin Flip Analogy
Core Intuition
Hãy nghĩ như thế này:
- Tung đồng xu, xác suất ra 1 mặt ngửa = 50%
- Tung tiếp, xác suất ra 2 consecutive ngửa = 25%
- Xác suất k consecutive ngửa =
Insight: Nếu bạn thấy 5 consecutive ngửa, có thể bạn đã tung ~32 lần (
Áp dụng cho Counting
- Hash mỗi element thành binary string
- Đếm số leading zeros trong hash
- Nếu max leading zeros =
, ước lượng cardinality ≈
hash("user_123") = "00001101..." → 4 leading zeros
hash("user_456") = "10110001..." → 0 leading zeros
hash("user_789") = "00000011..." → 6 leading zeros ← MAX
Estimated cardinality ≈ 2^6 = 64 unique usersCải tiến: Bucketing (Stochastic Averaging)
Single max value quá nhiễu. Giải pháp: Chia thành nhiều buckets.
Với m = 1024 buckets:
1. Lấy log2(1024) = 10 bits đầu của hash → bucket index
2. Phần còn lại → đếm leading zeros
3. Mỗi bucket lưu MAX leading zeros
4. Ước lượng = Harmonic Mean của tất cả 2^(bucket values)The Math
Raw Estimate
Với:
: Số buckets (thường = 16384) : Max leading zeros trong bucket : Bias correction constant ≈ 0.7213
Standard Error
Với m = 16384 →
Production Implementation
python
# HPN Engineering Standard
# Implementation: HyperLogLog Cardinality Estimator
from typing import Any
import math
import hashlib
class HyperLogLog:
"""
Probabilistic cardinality estimator using constant memory.
Properties:
- Space: O(1) - ~12 KB for 16384 buckets
- Accuracy: ~97.7% (standard error 0.81%)
- Supports merging multiple HLLs
Use Cases:
- Reddit/Facebook: Unique post viewers
- Redis PFCOUNT: Native HLL implementation
- Data pipelines: Distinct count in streaming
"""
def __init__(self, precision: int = 14) -> None:
"""
Initialize HyperLogLog.
Args:
precision: Number of bits for bucket index (4-16)
Higher = more buckets = more accuracy = more memory
Default 14 = 16384 buckets = ~12KB = ~0.81% error
"""
if not 4 <= precision <= 16:
raise ValueError("Precision must be between 4 and 16")
self.precision = precision
self.num_buckets = 1 << precision # 2^precision
self.buckets = [0] * self.num_buckets
# Bias correction constant
if self.num_buckets >= 128:
self.alpha = 0.7213 / (1 + 1.079 / self.num_buckets)
elif self.num_buckets == 64:
self.alpha = 0.709
elif self.num_buckets == 32:
self.alpha = 0.697
else:
self.alpha = 0.673
def add(self, item: Any) -> None:
"""
Add an item to the estimator.
Time: O(1)
"""
# Hash item to 64-bit integer
hash_value = self._hash(item)
# First 'precision' bits = bucket index
bucket_index = hash_value >> (64 - self.precision)
# Remaining bits = count leading zeros
remaining = hash_value << self.precision
leading_zeros = self._count_leading_zeros(remaining) + 1
# Update bucket with max value
self.buckets[bucket_index] = max(
self.buckets[bucket_index],
leading_zeros
)
def count(self) -> int:
"""
Estimate cardinality (unique count).
Returns:
Estimated number of unique items
"""
# Raw harmonic mean estimate
indicator = sum(2.0 ** (-bucket) for bucket in self.buckets)
raw_estimate = self.alpha * (self.num_buckets ** 2) / indicator
# Apply corrections for extreme cases
if raw_estimate <= 2.5 * self.num_buckets:
# Small range correction
zero_buckets = self.buckets.count(0)
if zero_buckets > 0:
# Linear counting
raw_estimate = (
self.num_buckets *
math.log(self.num_buckets / zero_buckets)
)
elif raw_estimate > (1 << 32) / 30:
# Large range correction (for 32-bit hashes)
raw_estimate = -(1 << 32) * math.log(1 - raw_estimate / (1 << 32))
return int(raw_estimate)
def merge(self, other: 'HyperLogLog') -> 'HyperLogLog':
"""
Merge two HLLs (for distributed counting).
Returns:
New HLL representing union of both
"""
if self.precision != other.precision:
raise ValueError("Cannot merge HLLs with different precision")
merged = HyperLogLog(self.precision)
for i in range(self.num_buckets):
merged.buckets[i] = max(self.buckets[i], other.buckets[i])
return merged
def _hash(self, item: Any) -> int:
"""Hash item to 64-bit integer."""
item_bytes = str(item).encode('utf-8')
# Use first 8 bytes of SHA-256
hash_bytes = hashlib.sha256(item_bytes).digest()[:8]
return int.from_bytes(hash_bytes, byteorder='big')
def _count_leading_zeros(self, value: int) -> int:
"""Count leading zeros in 64-bit value."""
if value == 0:
return 64 - self.precision
zeros = 0
mask = 1 << 63
while (value & mask) == 0 and zeros < (64 - self.precision):
zeros += 1
mask >>= 1
return zeros
def memory_usage_bytes(self) -> int:
"""Return memory usage in bytes."""
# Each bucket is typically 6 bits (max value ~64)
# But we use 8-bit integers for simplicity
return self.num_buckets
def standard_error(self) -> float:
"""Return theoretical standard error."""
return 1.04 / math.sqrt(self.num_buckets)
def __repr__(self) -> str:
return (
f"HyperLogLog(precision={self.precision}, "
f"buckets={self.num_buckets}, "
f"memory={self.memory_usage_bytes()} bytes, "
f"error={self.standard_error()*100:.2f}%)"
)
# ============================================
# USAGE EXAMPLE: Counting Unique Visitors
# ============================================
if __name__ == "__main__":
import random
# Create HLL with default precision (14 = 16384 buckets)
hll = HyperLogLog(precision=14)
print(f"Created: {hll}")
print(f"Memory: {hll.memory_usage_bytes() / 1024:.2f} KB")
print(f"Expected error: {hll.standard_error()*100:.2f}%")
# Simulate unique visitors
actual_unique = 100_000
# Some visitors come multiple times
for _ in range(500_000):
visitor_id = f"user_{random.randint(1, actual_unique)}"
hll.add(visitor_id)
estimated = hll.count()
error = abs(estimated - actual_unique) / actual_unique * 100
print(f"\n=== Results ===")
print(f"Actual unique visitors: {actual_unique:,}")
print(f"Estimated by HLL: {estimated:,}")
print(f"Error: {error:.2f}%")
# Demonstrate merging (distributed counting)
print(f"\n=== Distributed Counting ===")
hll1 = HyperLogLog(precision=14)
hll2 = HyperLogLog(precision=14)
# Server 1 sees users 1-70,000
for i in range(1, 70_001):
hll1.add(f"user_{i}")
# Server 2 sees users 30,001-100,000 (overlap!)
for i in range(30_001, 100_001):
hll2.add(f"user_{i}")
# Merge both
merged = hll1.merge(hll2)
print(f"Server 1 estimate: {hll1.count():,}")
print(f"Server 2 estimate: {hll2.count():,}")
print(f"Merged estimate: {merged.count():,}")
print(f"Actual unique: 100,000")
# Memory comparison
print(f"\n=== Memory Comparison (1 billion users) ===")
hashset_size = 1_000_000_000 * 8 # 8 bytes per pointer
hll_size = 16384 # 16KB
print(f"HashSet: {hashset_size / (1024**3):.2f} GB")
print(f"HyperLogLog: {hll_size / 1024:.2f} KB")
print(f"Savings: {hashset_size / hll_size:,.0f}x less memory")Redis PFADD/PFCOUNT
Redis native HyperLogLog:
redis
# Add items
PFADD visitors user1 user2 user3
# Count unique
PFCOUNT visitors
# → (integer) 3
# Merge multiple HLLs
PFMERGE total_visitors visitors_day1 visitors_day2 visitors_day3💡 Redis HLL
Redis HLL chỉ dùng 12 KB per key, bất kể số lượng elements.
Ứng dụng Thực tế
| System | Use Case |
|---|---|
| Redis | PFADD/PFCOUNT commands |
| Apache Spark | approx_count_distinct() |
| Google BigQuery | APPROX_COUNT_DISTINCT() |
| Unique views trên billions of posts | |
| Presto | Distinct count in data lake queries |
Anti-patterns
❌ TRÁNH
1. Dùng cho exact counting
python
# ❌ WRONG: Banking - cần số chính xác
unique_transactions = hll.count()
if unique_transactions > 1:
flag_fraud() # Error rate 2% có thể miss fraud!
# ✅ RIGHT: Dùng cho analytics, dashboards
# "Approximately 1.2M unique users today"2. Precision quá thấp
python
# ❌ WRONG: precision=4 → chỉ 16 buckets → error 26%!
hll = HyperLogLog(precision=4)
# ✅ RIGHT: precision=14 là sweet spot
hll = HyperLogLog(precision=14) # 16384 buckets, 0.81% error3. Quên merge ordering không quan trọng
python
# ✅ GOOD: HLL merge là commutative & associative
# Merge theo bất kỳ thứ tự nào đều cho kết quả giống nhau
total = hll1.merge(hll2).merge(hll3)
# = hll3.merge(hll1).merge(hll2)So sánh Algorithms
| Algorithm | Memory | Accuracy | Delete? | Merge? |
|---|---|---|---|---|
| HashSet | O(N) | 100% | ✅ | ❌ |
| Bitmap | O(max_value) | 100% | ✅ | ✅ |
| Bloom Filter | O(N/8) | ~99% | ❌ | ✅ |
| HyperLogLog | O(1) | ~98% | ❌ | ✅ |
💡 HPN's Rule
"HyperLogLog là magic: Constant memory cho infinite scale. Nhưng chỉ dùng khi 2% error là acceptable."