Skip to content

HyperLogLog - Counting to a Billion

"Đếm 1 tỷ unique users với chỉ 12 KB RAM." - HPN

Định nghĩa

HyperLogLog (HLL) là thuật toán ước lượng cardinality (số lượng unique elements) trong một tập hợp với:

  • Space: O(1) - Constant memory (~12 KB cho 1 tỷ items)
  • Accuracy: ~97.7% - Standard error ~2%

📘 Cardinality Problem

"Có bao nhiêu unique visitors truy cập website hôm nay?"

Với 1 tỷ visits, HashSet cần 8 GB RAM. HyperLogLog cần 12 KB RAM với độ chính xác 98%.

Tại sao cần HyperLogLog?

Bài toán: Counting Unique Views

Facebook cần đếm unique viewers cho mỗi post:

ApproachMemory (1B users)Accuracy
HashSet8 GB100%
Bitmap125 MB100%
HyperLogLog12 KB~98%

🎯 SCALE

Facebook có hàng triệu posts. HashSet: 1 triệu × 8 GB = 8 Petabytes HyperLogLog: 1 triệu × 12 KB = 12 GB

The Coin Flip Analogy

Core Intuition

Hãy nghĩ như thế này:

  • Tung đồng xu, xác suất ra 1 mặt ngửa = 50%
  • Tung tiếp, xác suất ra 2 consecutive ngửa = 25%
  • Xác suất k consecutive ngửa = 1/2k

Insight: Nếu bạn thấy 5 consecutive ngửa, có thể bạn đã tung ~32 lần (25).

Áp dụng cho Counting

  1. Hash mỗi element thành binary string
  2. Đếm số leading zeros trong hash
  3. Nếu max leading zeros = ρ, ước lượng cardinality ≈ 2ρ
hash("user_123") = "00001101..." → 4 leading zeros
hash("user_456") = "10110001..." → 0 leading zeros
hash("user_789") = "00000011..." → 6 leading zeros ← MAX

Estimated cardinality ≈ 2^6 = 64 unique users

Cải tiến: Bucketing (Stochastic Averaging)

Single max value quá nhiễu. Giải pháp: Chia thành nhiều buckets.

Với m = 1024 buckets:
1. Lấy log2(1024) = 10 bits đầu của hash → bucket index
2. Phần còn lại → đếm leading zeros
3. Mỗi bucket lưu MAX leading zeros
4. Ước lượng = Harmonic Mean của tất cả 2^(bucket values)

The Math

Raw Estimate

E=αmm2(j=1m2Mj)1

Với:

  • m: Số buckets (thường 214 = 16384)
  • Mj: Max leading zeros trong bucket j
  • αm: Bias correction constant ≈ 0.7213

Standard Error

σ=1.04m

Với m = 16384 → σ ≈ 0.81% → 97.7% accuracy.

Production Implementation

python
# HPN Engineering Standard
# Implementation: HyperLogLog Cardinality Estimator

from typing import Any
import math
import hashlib


class HyperLogLog:
    """
    Probabilistic cardinality estimator using constant memory.
    
    Properties:
        - Space: O(1) - ~12 KB for 16384 buckets
        - Accuracy: ~97.7% (standard error 0.81%)
        - Supports merging multiple HLLs
    
    Use Cases:
        - Reddit/Facebook: Unique post viewers
        - Redis PFCOUNT: Native HLL implementation
        - Data pipelines: Distinct count in streaming
    """
    
    def __init__(self, precision: int = 14) -> None:
        """
        Initialize HyperLogLog.
        
        Args:
            precision: Number of bits for bucket index (4-16)
                       Higher = more buckets = more accuracy = more memory
                       Default 14 = 16384 buckets = ~12KB = ~0.81% error
        """
        if not 4 <= precision <= 16:
            raise ValueError("Precision must be between 4 and 16")
        
        self.precision = precision
        self.num_buckets = 1 << precision  # 2^precision
        self.buckets = [0] * self.num_buckets
        
        # Bias correction constant
        if self.num_buckets >= 128:
            self.alpha = 0.7213 / (1 + 1.079 / self.num_buckets)
        elif self.num_buckets == 64:
            self.alpha = 0.709
        elif self.num_buckets == 32:
            self.alpha = 0.697
        else:
            self.alpha = 0.673
    
    def add(self, item: Any) -> None:
        """
        Add an item to the estimator.
        
        Time: O(1)
        """
        # Hash item to 64-bit integer
        hash_value = self._hash(item)
        
        # First 'precision' bits = bucket index
        bucket_index = hash_value >> (64 - self.precision)
        
        # Remaining bits = count leading zeros
        remaining = hash_value << self.precision
        leading_zeros = self._count_leading_zeros(remaining) + 1
        
        # Update bucket with max value
        self.buckets[bucket_index] = max(
            self.buckets[bucket_index], 
            leading_zeros
        )
    
    def count(self) -> int:
        """
        Estimate cardinality (unique count).
        
        Returns:
            Estimated number of unique items
        """
        # Raw harmonic mean estimate
        indicator = sum(2.0 ** (-bucket) for bucket in self.buckets)
        raw_estimate = self.alpha * (self.num_buckets ** 2) / indicator
        
        # Apply corrections for extreme cases
        if raw_estimate <= 2.5 * self.num_buckets:
            # Small range correction
            zero_buckets = self.buckets.count(0)
            if zero_buckets > 0:
                # Linear counting
                raw_estimate = (
                    self.num_buckets * 
                    math.log(self.num_buckets / zero_buckets)
                )
        elif raw_estimate > (1 << 32) / 30:
            # Large range correction (for 32-bit hashes)
            raw_estimate = -(1 << 32) * math.log(1 - raw_estimate / (1 << 32))
        
        return int(raw_estimate)
    
    def merge(self, other: 'HyperLogLog') -> 'HyperLogLog':
        """
        Merge two HLLs (for distributed counting).
        
        Returns:
            New HLL representing union of both
        """
        if self.precision != other.precision:
            raise ValueError("Cannot merge HLLs with different precision")
        
        merged = HyperLogLog(self.precision)
        for i in range(self.num_buckets):
            merged.buckets[i] = max(self.buckets[i], other.buckets[i])
        
        return merged
    
    def _hash(self, item: Any) -> int:
        """Hash item to 64-bit integer."""
        item_bytes = str(item).encode('utf-8')
        # Use first 8 bytes of SHA-256
        hash_bytes = hashlib.sha256(item_bytes).digest()[:8]
        return int.from_bytes(hash_bytes, byteorder='big')
    
    def _count_leading_zeros(self, value: int) -> int:
        """Count leading zeros in 64-bit value."""
        if value == 0:
            return 64 - self.precision
        
        zeros = 0
        mask = 1 << 63
        
        while (value & mask) == 0 and zeros < (64 - self.precision):
            zeros += 1
            mask >>= 1
        
        return zeros
    
    def memory_usage_bytes(self) -> int:
        """Return memory usage in bytes."""
        # Each bucket is typically 6 bits (max value ~64)
        # But we use 8-bit integers for simplicity
        return self.num_buckets
    
    def standard_error(self) -> float:
        """Return theoretical standard error."""
        return 1.04 / math.sqrt(self.num_buckets)
    
    def __repr__(self) -> str:
        return (
            f"HyperLogLog(precision={self.precision}, "
            f"buckets={self.num_buckets}, "
            f"memory={self.memory_usage_bytes()} bytes, "
            f"error={self.standard_error()*100:.2f}%)"
        )


# ============================================
# USAGE EXAMPLE: Counting Unique Visitors
# ============================================
if __name__ == "__main__":
    import random
    
    # Create HLL with default precision (14 = 16384 buckets)
    hll = HyperLogLog(precision=14)
    
    print(f"Created: {hll}")
    print(f"Memory: {hll.memory_usage_bytes() / 1024:.2f} KB")
    print(f"Expected error: {hll.standard_error()*100:.2f}%")
    
    # Simulate unique visitors
    actual_unique = 100_000
    
    # Some visitors come multiple times
    for _ in range(500_000):
        visitor_id = f"user_{random.randint(1, actual_unique)}"
        hll.add(visitor_id)
    
    estimated = hll.count()
    error = abs(estimated - actual_unique) / actual_unique * 100
    
    print(f"\n=== Results ===")
    print(f"Actual unique visitors: {actual_unique:,}")
    print(f"Estimated by HLL: {estimated:,}")
    print(f"Error: {error:.2f}%")
    
    # Demonstrate merging (distributed counting)
    print(f"\n=== Distributed Counting ===")
    
    hll1 = HyperLogLog(precision=14)
    hll2 = HyperLogLog(precision=14)
    
    # Server 1 sees users 1-70,000
    for i in range(1, 70_001):
        hll1.add(f"user_{i}")
    
    # Server 2 sees users 30,001-100,000 (overlap!)
    for i in range(30_001, 100_001):
        hll2.add(f"user_{i}")
    
    # Merge both
    merged = hll1.merge(hll2)
    
    print(f"Server 1 estimate: {hll1.count():,}")
    print(f"Server 2 estimate: {hll2.count():,}")
    print(f"Merged estimate: {merged.count():,}")
    print(f"Actual unique: 100,000")
    
    # Memory comparison
    print(f"\n=== Memory Comparison (1 billion users) ===")
    hashset_size = 1_000_000_000 * 8  # 8 bytes per pointer
    hll_size = 16384  # 16KB
    
    print(f"HashSet: {hashset_size / (1024**3):.2f} GB")
    print(f"HyperLogLog: {hll_size / 1024:.2f} KB")
    print(f"Savings: {hashset_size / hll_size:,.0f}x less memory")

Redis PFADD/PFCOUNT

Redis native HyperLogLog:

redis
# Add items
PFADD visitors user1 user2 user3

# Count unique
PFCOUNT visitors
# → (integer) 3

# Merge multiple HLLs
PFMERGE total_visitors visitors_day1 visitors_day2 visitors_day3

💡 Redis HLL

Redis HLL chỉ dùng 12 KB per key, bất kể số lượng elements.

Ứng dụng Thực tế

SystemUse Case
RedisPFADD/PFCOUNT commands
Apache Sparkapprox_count_distinct()
Google BigQueryAPPROX_COUNT_DISTINCT()
FacebookUnique views trên billions of posts
PrestoDistinct count in data lake queries

Anti-patterns

❌ TRÁNH

1. Dùng cho exact counting

python
# ❌ WRONG: Banking - cần số chính xác
unique_transactions = hll.count()
if unique_transactions > 1:
    flag_fraud()  # Error rate 2% có thể miss fraud!

# ✅ RIGHT: Dùng cho analytics, dashboards
# "Approximately 1.2M unique users today"

2. Precision quá thấp

python
# ❌ WRONG: precision=4 → chỉ 16 buckets → error 26%!
hll = HyperLogLog(precision=4)

# ✅ RIGHT: precision=14 là sweet spot
hll = HyperLogLog(precision=14)  # 16384 buckets, 0.81% error

3. Quên merge ordering không quan trọng

python
# ✅ GOOD: HLL merge là commutative & associative
# Merge theo bất kỳ thứ tự nào đều cho kết quả giống nhau
total = hll1.merge(hll2).merge(hll3)
# = hll3.merge(hll1).merge(hll2)

So sánh Algorithms

AlgorithmMemoryAccuracyDelete?Merge?
HashSetO(N)100%
BitmapO(max_value)100%
Bloom FilterO(N/8)~99%
HyperLogLogO(1)~98%

💡 HPN's Rule

"HyperLogLog là magic: Constant memory cho infinite scale. Nhưng chỉ dùng khi 2% error là acceptable."