HyperLogLog — Đếm Tỷ Với 12 KB

Facebook cần đếm unique viewers cho mỗi bài post. Với hàng tỷ users và hàng triệu posts, nếu dùng HashSet để đếm unique viewers cho mỗi post, bộ nhớ cần thiết sẽ vào khoảng 8 Petabytes. Không hệ thống nào trên trái đất đủ RAM cho điều đó.

HyperLogLog giải quyết bài toán này bằng cách đánh đổi: chấp nhận sai số ~2% để giảm bộ nhớ từ 8 GB (cho 1 tỷ items) xuống còn 12 KB — giảm hơn 600,000 lần. Redis tích hợp sẵn HyperLogLog qua lệnh PFADD/PFCOUNT, và hầu hết data warehouse hiện đại (BigQuery, Presto, Spark) đều cung cấp APPROX_COUNT_DISTINCT() dựa trên cùng thuật toán.

Nếu bạn đang xây dựng analytics pipeline, dashboard, hoặc bất kỳ hệ thống nào cần đếm distinct values ở quy mô lớn — đây là kiến thức production bắt buộc.

Bức tranh tư duy

Hình dung bạn đang ngồi ở quán cà phê và tung đồng xu. Mỗi lần tung, bạn ghi lại chuỗi mặt ngửa liên tiếp dài nhất từng thấy:

Lần 1: S → chuỗi dài nhất = 0
Lần 2: N → chuỗi dài nhất = 1
Lần 3: NN → chuỗi dài nhất = 2
Lần 10: NNN → chuỗi dài nhất = 3
Lần 30: NNNN → chuỗi dài nhất = 4

Insight: Xác suất thấy k mặt ngửa liên tiếp = 1/2ᵏ. Nếu chuỗi dài nhất bạn thấy là 5, bạn có thể ước đoán đã tung khoảng 2⁵ = 32 lần.

HyperLogLog áp dụng chính xác trực giác này: hash mỗi element thành bit string, đếm leading zeros dài nhất, rồi suy ra cardinality ≈ 2^(max leading zeros).

hash("user_123") = 0001 1010... → 3 leading zeros
hash("user_456") = 1011 0001... → 0 leading zeros
hash("user_789") = 0000 0011... → 5 leading zeros  ← MAX

Ước lượng thô: 2^5 = 32 unique elements

Vấn đề: ước lượng từ một max value rất nhiễu. Giải pháp: chia thành m buckets, mỗi bucket theo dõi max leading zeros riêng, rồi lấy harmonic mean — giảm variance đáng kể.

Cốt lõi kỹ thuật

Từ LogLog đến HyperLogLog

LogLog (Durand-Flajolet, 2003): Chia thành m buckets, lấy arithmetic mean → standard error = 1.30/√m.

HyperLogLog (Flajolet et al., 2007): Thay arithmetic mean bằng harmonic mean → standard error giảm xuống 1.04/√m — cải thiện ~20% accuracy với cùng bộ nhớ.

Cơ chế hoạt động

Hash mỗi element thành 64-bit integer
p bits đầu xác định bucket index (0 đến 2ᵖ - 1)
Bits còn lại: đếm leading zeros, lưu max vào bucket
Ước lượng: harmonic mean trên tất cả buckets

hash(element) = [p bits: bucket index][còn lại: đếm zeros]
                 ─────────────────────  ──────────────────
                 bucket 0..2^p-1        ρ = leading zeros + 1

Công thức ước lượng

E = α_{m} \cdot m^{2} \cdot {(\sum_{j = 1}^{m} 2^{- M_{j}})}^{- 1}

Trong đó:

m = số buckets (thường 2¹⁴ = 16384)
Mⱼ = max leading zeros + 1 trong bucket j
αₘ = bias correction constant ≈ 0.7213/(1 + 1.079/m)

Standard error: σ = 1.04/√m. Với m = 16384 → σ ≈ 0.81%.

Triển khai cơ bản

C++PythonJavaGo

cpp

#include <vector>
#include <cmath>
#include <functional>
#include <algorithm>
#include <string>

class HyperLogLog {
    int precision_;
    int num_buckets_;
    std::vector<uint8_t> buckets_;
    double alpha_;

    uint64_t hash_item(const std::string& item) const {
        uint64_t h = std::hash<std::string>{}(item);
        h ^= h >> 33; h *= 0xff51afd7ed558ccdULL;
        h ^= h >> 33; h *= 0xc4ceb9fe1a85ec53ULL;
        return h ^ (h >> 33);
    }

    int count_leading_zeros(uint64_t val, int max_bits) const {
        int zeros = 0;
        uint64_t mask = 1ULL << (max_bits - 1);
        while (mask && !(val & mask)) { ++zeros; mask >>= 1; }
        return zeros;
    }

public:
    HyperLogLog(int precision = 14)
        : precision_(precision),
          num_buckets_(1 << precision),
          buckets_(1 << precision, 0) {
        alpha_ = 0.7213 / (1.0 + 1.079 / num_buckets_);
    }

    void add(const std::string& item) {
        uint64_t h = hash_item(item);
        int bucket = h >> (64 - precision_);
        uint64_t remaining = h << precision_;
        int rho = count_leading_zeros(remaining, 64 - precision_) + 1;
        buckets_[bucket] = std::max(buckets_[bucket], static_cast<uint8_t>(rho));
    }

    uint64_t count() const {
        double indicator = 0;
        for (auto b : buckets_) indicator += std::pow(2.0, -b);
        double raw = alpha_ * num_buckets_ * num_buckets_ / indicator;

        if (raw <= 2.5 * num_buckets_) {
            int zeros = std::count(buckets_.begin(), buckets_.end(), 0);
            if (zeros > 0)
                raw = num_buckets_ * std::log(static_cast<double>(num_buckets_) / zeros);
        }
        return static_cast<uint64_t>(raw);
    }
};

python

import math
import hashlib
from typing import Any


class HyperLogLog:
    def __init__(self, precision: int = 14) -> None:
        if not 4 <= precision <= 16:
            raise ValueError("Precision phải từ 4 đến 16")
        self.precision = precision
        self.num_buckets = 1 << precision
        self.buckets = [0] * self.num_buckets
        self.alpha = 0.7213 / (1 + 1.079 / self.num_buckets)

    def _hash(self, item: Any) -> int:
        data = str(item).encode("utf-8")
        return int(hashlib.sha256(data).digest()[:8].hex(), 16)

    def _leading_zeros(self, value: int, max_bits: int) -> int:
        if value == 0:
            return max_bits
        zeros = 0
        mask = 1 << (max_bits - 1)
        while not (value & mask):
            zeros += 1
            mask >>= 1
        return zeros

    def add(self, item: Any) -> None:
        h = self._hash(item)
        bucket = h >> (64 - self.precision)
        remaining = h << self.precision & ((1 << 64) - 1)
        rho = self._leading_zeros(remaining, 64 - self.precision) + 1
        self.buckets[bucket] = max(self.buckets[bucket], rho)

    def count(self) -> int:
        indicator = sum(2.0 ** (-b) for b in self.buckets)
        raw = self.alpha * self.num_buckets ** 2 / indicator

        if raw <= 2.5 * self.num_buckets:
            zeros = self.buckets.count(0)
            if zeros > 0:
                raw = self.num_buckets * math.log(self.num_buckets / zeros)
        return int(raw)

    def merge(self, other: "HyperLogLog") -> "HyperLogLog":
        assert self.precision == other.precision
        merged = HyperLogLog(self.precision)
        for i in range(self.num_buckets):
            merged.buckets[i] = max(self.buckets[i], other.buckets[i])
        return merged

java

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;

public class HyperLogLog {
    private final int precision;
    private final int numBuckets;
    private final int[] buckets;
    private final double alpha;

    public HyperLogLog(int precision) {
        this.precision = precision;
        this.numBuckets = 1 << precision;
        this.buckets = new int[numBuckets];
        this.alpha = 0.7213 / (1.0 + 1.079 / numBuckets);
    }

    public HyperLogLog() { this(14); }

    private long hash(String item) {
        try {
            MessageDigest md = MessageDigest.getInstance("SHA-256");
            byte[] digest = md.digest(item.getBytes(StandardCharsets.UTF_8));
            long h = 0;
            for (int i = 0; i < 8; i++) h = (h << 8) | (digest[i] & 0xFF);
            return h;
        } catch (Exception e) { throw new RuntimeException(e); }
    }

    public void add(String item) {
        long h = hash(item);
        int bucket = (int) (h >>> (64 - precision));
        long remaining = h << precision;
        int rho = Long.numberOfLeadingZeros(remaining) + 1;
        rho = Math.min(rho, 64 - precision + 1);
        buckets[bucket] = Math.max(buckets[bucket], rho);
    }

    public long count() {
        double indicator = 0;
        for (int b : buckets) indicator += Math.pow(2.0, -b);
        double raw = alpha * numBuckets * numBuckets / indicator;

        if (raw <= 2.5 * numBuckets) {
            int zeros = 0;
            for (int b : buckets) if (b == 0) zeros++;
            if (zeros > 0)
                raw = numBuckets * Math.log((double) numBuckets / zeros);
        }
        return (long) raw;
    }
}

package hll

import (
	"crypto/sha256"
	"encoding/binary"
	"math"
	"math/bits"
)

type HyperLogLog struct {
	precision  int
	numBuckets int
	buckets    []uint8
	alpha      float64
}

func New(precision int) *HyperLogLog {
	m := 1 << precision
	return &HyperLogLog{
		precision:  precision,
		numBuckets: m,
		buckets:    make([]uint8, m),
		alpha:      0.7213 / (1.0 + 1.079/float64(m)),
	}
}

func (hll *HyperLogLog) Add(item string) {
	h := sha256.Sum256([]byte(item))
	val := binary.BigEndian.Uint64(h[:8])
	bucket := val >> (64 - uint(hll.precision))
	remaining := val << uint(hll.precision)
	rho := uint8(bits.LeadingZeros64(remaining) + 1)
	if rho > hll.buckets[bucket] {
		hll.buckets[bucket] = rho
	}
}

func (hll *HyperLogLog) Count() uint64 {
	indicator := 0.0
	for _, b := range hll.buckets {
		indicator += math.Pow(2.0, -float64(b))
	}
	raw := hll.alpha * float64(hll.numBuckets*hll.numBuckets) / indicator

	if raw <= 2.5*float64(hll.numBuckets) {
		zeros := 0
		for _, b := range hll.buckets {
			if b == 0 { zeros++ }
		}
		if zeros > 0 {
			raw = float64(hll.numBuckets) *
				math.Log(float64(hll.numBuckets)/float64(zeros))
		}
	}
	return uint64(raw)
}

func (hll *HyperLogLog) Merge(other *HyperLogLog) *HyperLogLog {
	merged := New(hll.precision)
	for i := range hll.buckets {
		if hll.buckets[i] > other.buckets[i] {
			merged.buckets[i] = hll.buckets[i]
		} else {
			merged.buckets[i] = other.buckets[i]
		}
	}
	return merged
}

Merge: Sức mạnh của Distributed Counting

HyperLogLog hỗ trợ merge — lấy max giữa các bucket tương ứng. Đây là lý do nó phù hợp cho hệ thống phân tán:

python

# Server 1 đếm users ở Hà Nội
hll_hanoi = HyperLogLog(precision=14)
for user in hanoi_users:
    hll_hanoi.add(user)

# Server 2 đếm users ở TP.HCM
hll_hcm = HyperLogLog(precision=14)
for user in hcm_users:
    hll_hcm.add(user)

# Merge: users trùng nhau tự động được deduplicate
total = hll_hanoi.merge(hll_hcm)
print(f"Unique users toàn quốc: {total.count():,}")

Merge là commutative và associative — thứ tự merge không ảnh hưởng kết quả.

Thực chiến

Tình huống: Real-time Analytics Dashboard

Bối cảnh: Hệ thống SaaS theo dõi unique users cho mỗi trang. 10,000 trang, mỗi trang nhận hàng triệu pageviews/ngày. Dashboard cần hiển thị unique visitors real-time.

Mục tiêu: Đếm unique visitors cho mỗi trang và toàn site mà không bùng nổ bộ nhớ.

PythonJava

python

class PageAnalytics:
    def __init__(self) -> None:
        self.page_hlls: dict[str, HyperLogLog] = {}

    def track_visit(self, page_id: str, user_id: str) -> None:
        if page_id not in self.page_hlls:
            self.page_hlls[page_id] = HyperLogLog(precision=14)
        self.page_hlls[page_id].add(user_id)

    def unique_visitors(self, page_id: str) -> int:
        hll = self.page_hlls.get(page_id)
        return hll.count() if hll else 0

    def total_unique_visitors(self) -> int:
        if not self.page_hlls:
            return 0
        merged = HyperLogLog(precision=14)
        for hll in self.page_hlls.values():
            merged = merged.merge(hll)
        return merged.count()

    def memory_usage_kb(self) -> float:
        return len(self.page_hlls) * 16384 / 1024  # ~16 KB per HLL

java

public class PageAnalytics {
    private final Map<String, HyperLogLog> pageHlls = new ConcurrentHashMap<>();

    public void trackVisit(String pageId, String userId) {
        pageHlls.computeIfAbsent(pageId, k -> new HyperLogLog())
                 .add(userId);
    }

    public long uniqueVisitors(String pageId) {
        HyperLogLog hll = pageHlls.get(pageId);
        return hll != null ? hll.count() : 0;
    }

    public long totalUniqueVisitors() {
        HyperLogLog merged = new HyperLogLog();
        for (HyperLogLog hll : pageHlls.values()) {
            merged = merged.merge(hll);
        }
        return merged.count();
    }
}

Phân tích:

10,000 trang × 16 KB/HLL = 160 MB tổng bộ nhớ — so với HashSet cần hàng TB
Merge cho phép tính total unique cross-page mà tự động deduplicate
Standard error ~0.81% → đủ chính xác cho dashboard analytics

Tình huống: Redis PFCOUNT trong production

Redis cung cấp HyperLogLog native qua 3 lệnh:

redis

PFADD page:homepage user_001 user_002 user_003
PFADD page:homepage user_001  # Duplicate, không ảnh hưởng

PFCOUNT page:homepage
# → (integer) 3

# Merge nhiều keys
PFADD page:about user_002 user_004
PFMERGE total_visitors page:homepage page:about
PFCOUNT total_visitors
# → (integer) 4  (user_002 tự động deduplicate)

Mỗi key chỉ dùng 12 KB bất kể số lượng elements — đây là lý do Redis là lựa chọn phổ biến nhất cho HLL trong production.

Sai lầm điển hình

❌ Sai lầm 1: Dùng cho exact counting

Vấn đề: Dùng HLL ở nơi cần con số chính xác tuyệt đối.

python

# SAI: Banking — cần số giao dịch chính xác
unique_transactions = hll.count()
if unique_transactions > threshold:
    trigger_fraud_alert()  # Error 2% có thể miss fraud!

Tại sao sai: HLL có standard error ~2%. Với threshold = 100, kết quả thực tế dao động 96-104. Trong banking, sai lệch này là không chấp nhận được.

python

# ĐÚNG: Dùng cho analytics, dashboards, monitoring
# "Khoảng 1.2 triệu unique users hôm nay" — đủ tốt
daily_uniques = hll.count()
update_dashboard(approximate_count=daily_uniques)

❌ Sai lầm 2: Precision quá thấp

Vấn đề: Chọn precision thấp để tiết kiệm memory nhưng mất accuracy.

python

# SAI: precision=4 → 16 buckets → error 26%
hll = HyperLogLog(precision=4)  # Gần như vô dụng

Tại sao sai: σ = 1.04/√m. Với m = 16 (precision=4), σ ≈ 26% — ước lượng dao động ±26% so với giá trị thực.

python

# ĐÚNG: precision=14 là sweet spot (16384 buckets, 0.81% error, ~16 KB)
hll = HyperLogLog(precision=14)

❌ Sai lầm 3: Merge HLLs với precision khác nhau

Vấn đề: Merge hai HLL có precision khác nhau.

python

# SAI: Precision không khớp → kết quả sai
hll_a = HyperLogLog(precision=10)
hll_b = HyperLogLog(precision=14)
merged = hll_a.merge(hll_b)  # Buckets không tương ứng!

Tại sao sai: Merge lấy max giữa bucket tương ứng. Nếu số buckets khác nhau, mapping 1-1 không tồn tại → kết quả hoàn toàn sai.

python

# ĐÚNG: Tất cả HLLs trong hệ thống dùng cùng precision
SYSTEM_PRECISION = 14  # Quy ước toàn hệ thống
hll_a = HyperLogLog(precision=SYSTEM_PRECISION)
hll_b = HyperLogLog(precision=SYSTEM_PRECISION)

Under the Hood

Hiệu năng

Thao tác	Time	Space
`add`	O(1)	—
`count`	O(m)	—
`merge`	O(m)	O(m)

Approach	Bộ nhớ (1 tỷ items)	Accuracy	Merge?
HashSet	~8 GB	100%	Không
Bitmap	~125 MB	100%	Có (OR)
HyperLogLog (p=14)	~16 KB	~97.7%	Có (max)

Nội bộ triển khai

Bucket storage: Mỗi bucket cần lưu max leading zeros. Với hash 64-bit, giá trị max = 64 → cần 6 bits/bucket. Redis dùng 6 bits/bucket → 16384 × 6 / 8 = 12,288 bytes ≈ 12 KB.

Sparse representation: Khi phần lớn buckets vẫn bằng 0 (ít elements), Redis dùng encoding đặc biệt chiếm ít hơn 12 KB. Chuyển sang dense representation khi cần.

Bias correction: Công thức raw estimate cần correction ở hai vùng:

Small range (E < 2.5m): Dùng Linear Counting thay vì harmonic mean — chính xác hơn khi nhiều bucket = 0
Large range (E > 2³²/30): Correction cho hash collision ở 32-bit — ít gặp với 64-bit hash

Trade-offs

Khi NÊN dùng	Khi KHÔNG NÊN dùng
Đếm unique ở quy mô lớn (>100K)	Cần exact count
Analytics, dashboards, metrics	Counting nhỏ (< 10K) — HashSet đủ
Distributed counting (merge)	Cần biết elements cụ thể
Real-time streaming data	Cần delete/remove elements
Bộ nhớ là ràng buộc chính	Error 2% không chấp nhận được

Checklist ghi nhớ

✅ Checklist triển khai

Thiết kế

[ ] Xác nhận use case chấp nhận sai số ~2% (analytics, monitoring, dashboards)
[ ] Chọn precision = 14 làm mặc định (sweet spot: 16 KB, 0.81% error)
[ ] Quy ước precision thống nhất trong toàn hệ thống cho merge

Triển khai

[ ] Hash function phân phối đều trên 64-bit space
[ ] Xử lý small range correction (Linear Counting) cho trường hợp ít elements
[ ] Implement merge (max per bucket) cho distributed counting

Production

[ ] Ưu tiên Redis PFADD/PFCOUNT nếu đã có Redis infrastructure
[ ] Monitor estimated error rate so với ground truth (sampling)
[ ] Serialize buckets array cho persistence (chỉ 12-16 KB)
[ ] Dùng sparse encoding cho HLLs có ít elements

Bài tập luyện tập

Bài 1: Tính memory savings — Foundation

Đề bài: Website có 50 triệu unique users/ngày. So sánh memory cần thiết cho HashSet vs HyperLogLog (precision=14). Tính ratio tiết kiệm.

🧠 Quiz

Câu hỏi: HyperLogLog precision=14 tiết kiệm bao nhiêu lần memory so với HashSet cho 50M users?

[ ] A. ~100 lần
[ ] B. ~1,000 lần
[x] C. ~25,000 lần
[ ] D. ~1,000,000 lần Giải thích: HashSet ≈ 50M × 8 bytes = 400 MB. HLL = 16 KB. Ratio = 400MB / 16KB ≈ 25,000 lần.

✅ Lời giải

python

hashset_bytes = 50_000_000 * 8  # 8 bytes per pointer
hll_bytes = 16384               # 16 KB for precision=14

ratio = hashset_bytes / hll_bytes
print(f"HashSet: {hashset_bytes / 1024 / 1024:.0f} MB")
print(f"HLL: {hll_bytes / 1024:.0f} KB")
print(f"Savings: {ratio:,.0f}x")
# HashSet: 381 MB, HLL: 16 KB, Savings: 24,414x

Bài 2: Distributed Merge — Intermediate

Đề bài: 3 server theo dõi user visits. Server A thấy users {1..70000}, Server B thấy {30001..100000}, Server C thấy {60001..120000}. Triển khai merge và so sánh estimated vs actual unique count.

🧠 Quiz

Câu hỏi: Khi merge 3 HLLs, thứ tự merge có ảnh hưởng kết quả không?

[x] A. Không — merge là commutative và associative
[ ] B. Có — phải merge theo thứ tự thời gian
[ ] C. Có — phải merge từ nhỏ đến lớn
[ ] D. Tùy thuộc vào precision Giải thích: Merge lấy max per bucket — phép max là commutative (a⊕b = b⊕a) và associative ((a⊕b)⊕c = a⊕(b⊕c)).

✅ Lời giải

python

hll_a = HyperLogLog(14)
hll_b = HyperLogLog(14)
hll_c = HyperLogLog(14)

for i in range(1, 70001):     hll_a.add(f"user_{i}")
for i in range(30001, 100001): hll_b.add(f"user_{i}")
for i in range(60001, 120001): hll_c.add(f"user_{i}")

merged = hll_a.merge(hll_b).merge(hll_c)
actual = 120000  # users 1..120000
estimated = merged.count()
error = abs(estimated - actual) / actual * 100
print(f"Actual: {actual:,}, Estimated: {estimated:,}, Error: {error:.2f}%")

Phân tích: Overlap tự động được handle — merge lấy max nên duplicate users không tăng estimate. Error dự kiến < 2%.

Liên kết học tiếp

Từ khóa glossary: HyperLogLog, cardinality estimation, probabilistic counting, harmonic mean, leading zeros, Redis PFCOUNT, distinct count

Tìm kiếm liên quan: hyperloglog là gì, đếm unique users, ước lượng cardinality, Redis PFCOUNT, approximate count distinct

HyperLogLog — Đếm Tỷ Với 12 KB ​

Bức tranh tư duy ​

Cốt lõi kỹ thuật ​

Từ LogLog đến HyperLogLog ​

Cơ chế hoạt động ​

Công thức ước lượng ​

Triển khai cơ bản ​

Merge: Sức mạnh của Distributed Counting ​

Thực chiến ​

Tình huống: Real-time Analytics Dashboard ​

Tình huống: Redis PFCOUNT trong production ​

Sai lầm điển hình ​

❌ Sai lầm 1: Dùng cho exact counting ​

❌ Sai lầm 2: Precision quá thấp ​

❌ Sai lầm 3: Merge HLLs với precision khác nhau ​

Under the Hood ​

Hiệu năng ​

Nội bộ triển khai ​

Trade-offs ​

Checklist ghi nhớ ​

Bài tập luyện tập ​

Bài 1: Tính memory savings — Foundation ​

Bài 2: Distributed Merge — Intermediate ​

Liên kết học tiếp ​

HyperLogLog — Đếm Tỷ Với 12 KB

Bức tranh tư duy

Cốt lõi kỹ thuật

Từ LogLog đến HyperLogLog

Cơ chế hoạt động

Công thức ước lượng

Triển khai cơ bản

Merge: Sức mạnh của Distributed Counting

Thực chiến

Tình huống: Real-time Analytics Dashboard

Tình huống: Redis PFCOUNT trong production

Sai lầm điển hình

❌ Sai lầm 1: Dùng cho exact counting

❌ Sai lầm 2: Precision quá thấp

❌ Sai lầm 3: Merge HLLs với precision khác nhau

Under the Hood

Hiệu năng

Nội bộ triển khai

Trade-offs

Checklist ghi nhớ

Bài tập luyện tập

Bài 1: Tính memory savings — Foundation

Bài 2: Distributed Merge — Intermediate

Liên kết học tiếp