Giao diện
Bloom Filter - The Efficient Gatekeeper
"Tôi có thể nói 'CHẮC CHẮN KHÔNG' hoặc 'CÓ THỂ CÓ', nhưng không bao giờ sai khi nói KHÔNG." - HPN
Định nghĩa
Bloom Filter là cấu trúc dữ liệu xác suất (probabilistic) để test set membership:
- Nếu trả về "No" → Element CHẮC CHẮN không có trong set
- Nếu trả về "Maybe" → Element CÓ THỂ có trong set (False Positive possible)
📘 Key Properties
- False Negatives = 0% (Không bao giờ miss)
- False Positives > 0% (Có thể báo nhầm "có")
- Space: O(N/8) bits thay vì O(N × data_size)
Tại sao cần Bloom Filter?
Bài toán: Database Disk Lookup
Cassandra cần kiểm tra xem row có tồn tại trước khi đọc từ disk:
| Approach | Cost | Latency |
|---|---|---|
| Disk lookup mỗi lần | High I/O | ~10ms |
| Cache toàn bộ keys | 100GB RAM | ~0.1ms |
| Bloom Filter | 100MB RAM | ~0.001ms |
🎯 MATH
10 triệu rows × 100 bytes/key = 1 GB cho HashSet
Bloom Filter: 10 triệu bits ÷ 8 = 1.25 MB (800x savings!)
Cơ chế Hoạt động
Component
- Bit Array: Mảng m bits, khởi tạo = 0
- k Hash Functions: Map element → k vị trí trong bit array
Insert Operation
Insert "apple":
1. hash1("apple") mod m = 3 → bits[3] = 1
2. hash2("apple") mod m = 7 → bits[7] = 1
3. hash3("apple") mod m = 11 → bits[11] = 1
Bit Array: [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]
↑ ↑ ↑Query Operation
Check "apple":
1. bits[3] == 1? ✓
2. bits[7] == 1? ✓
3. bits[11] == 1? ✓
→ "MAYBE" (all bits are 1)
Check "banana":
1. bits[5] == 0? ✗
→ "DEFINITELY NOT" (at least one bit is 0)False Positive Example
Đã insert: "apple", "orange"
bits[3], bits[7], bits[11] = 1 (apple)
bits[2], bits[7], bits[14] = 1 (orange)
Check "grape":
hash1("grape") = 3 → bits[3] = 1 ✓ (from apple)
hash2("grape") = 7 → bits[7] = 1 ✓ (from both)
hash3("grape") = 14 → bits[14] = 1 ✓ (from orange)
→ "MAYBE" (FALSE POSITIVE! grape was NEVER inserted)Visualization
The Math: Optimal Parameters
False Positive Probability
Với:
: Số bits trong array : Số elements đã insert : Số hash functions
Optimal k (số hash functions)
Optimal m (size of bit array)
Ví dụ: Với n = 1 triệu elements, muốn
Production Implementation
python
# HPN Engineering Standard
# Implementation: Bloom Filter with Optimal Parameters
from typing import List, Any
import math
import hashlib
class BloomFilter:
"""
Space-efficient probabilistic set membership.
Properties:
- False Negatives: 0% (Never misses)
- False Positives: Tunable (default ~1%)
- Space: O(n) bits instead of O(n × data_size)
Use Cases:
- Cassandra: Avoid disk lookups for non-existent rows
- Chrome: Check URLs against malware database
- Medium: Recommend articles user hasn't read
"""
def __init__(
self,
expected_items: int,
false_positive_rate: float = 0.01
) -> None:
"""
Initialize Bloom Filter with optimal parameters.
Args:
expected_items: Expected number of items to insert
false_positive_rate: Target FP rate (default 1%)
"""
# Calculate optimal bit array size
self.size = self._optimal_size(expected_items, false_positive_rate)
# Calculate optimal number of hash functions
self.num_hashes = self._optimal_hashes(self.size, expected_items)
# Initialize bit array (using bytearray for efficiency)
self.bit_array = bytearray((self.size + 7) // 8)
self.count = 0
self.expected_items = expected_items
self.fp_rate = false_positive_rate
def add(self, item: Any) -> None:
"""
Add an item to the filter.
Time: O(k) where k = num_hashes
"""
for seed in range(self.num_hashes):
index = self._hash(item, seed)
self._set_bit(index)
self.count += 1
def might_contain(self, item: Any) -> bool:
"""
Check if item MIGHT be in the set.
Returns:
True = "Maybe" (could be false positive)
False = "Definitely not" (100% certain)
"""
for seed in range(self.num_hashes):
index = self._hash(item, seed)
if not self._get_bit(index):
return False # Definitely not in set
return True # Maybe in set
def __contains__(self, item: Any) -> bool:
"""Support 'in' operator."""
return self.might_contain(item)
def _hash(self, item: Any, seed: int) -> int:
"""
Generate hash value for item with given seed.
Uses double hashing technique for multiple hash functions:
h(item, i) = (h1(item) + i × h2(item)) mod m
"""
# Convert item to bytes
item_bytes = str(item).encode('utf-8')
# Use MD5 for h1 and SHA1 for h2 (fast, good distribution)
h1 = int(hashlib.md5(item_bytes).hexdigest(), 16)
h2 = int(hashlib.sha1(item_bytes).hexdigest(), 16)
# Double hashing
return (h1 + seed * h2) % self.size
def _set_bit(self, index: int) -> None:
"""Set bit at index to 1."""
byte_index = index // 8
bit_offset = index % 8
self.bit_array[byte_index] |= (1 << bit_offset)
def _get_bit(self, index: int) -> bool:
"""Get bit value at index."""
byte_index = index // 8
bit_offset = index % 8
return bool(self.bit_array[byte_index] & (1 << bit_offset))
@staticmethod
def _optimal_size(n: int, p: float) -> int:
"""Calculate optimal bit array size."""
m = -(n * math.log(p)) / (math.log(2) ** 2)
return int(m)
@staticmethod
def _optimal_hashes(m: int, n: int) -> int:
"""Calculate optimal number of hash functions."""
k = (m / n) * math.log(2)
return max(1, int(k))
def estimated_fp_rate(self) -> float:
"""Calculate current estimated false positive rate."""
# Estimate based on fill ratio
fill_ratio = sum(bin(byte).count('1') for byte in self.bit_array) / self.size
return fill_ratio ** self.num_hashes
def memory_usage_bytes(self) -> int:
"""Return memory usage in bytes."""
return len(self.bit_array)
def __repr__(self) -> str:
return (
f"BloomFilter(items={self.count}, "
f"size={self.size} bits, "
f"hashes={self.num_hashes}, "
f"memory={self.memory_usage_bytes()} bytes)"
)
# ============================================
# USAGE EXAMPLE: URL Blacklist
# ============================================
if __name__ == "__main__":
# Create filter for 1 million URLs with 0.1% FP rate
bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.001)
print(f"Created: {bf}")
print(f"Memory: {bf.memory_usage_bytes() / 1024:.2f} KB")
print(f"Hash functions: {bf.num_hashes}")
# Add malicious URLs
malicious_urls = [
"evil.com/malware",
"phishing.net/login",
"scam.org/bitcoin"
]
for url in malicious_urls:
bf.add(url)
# Check URLs
test_urls = [
"evil.com/malware", # Should be "Maybe" (true positive)
"google.com", # Should be "Definitely not"
"phishing.net/login", # Should be "Maybe" (true positive)
"safe-website.com", # Should be "Definitely not"
]
print("\n=== URL Check Results ===")
for url in test_urls:
if url in bf:
print(f"⚠️ BLOCKED: {url} (possibly malicious)")
else:
print(f"✅ ALLOWED: {url} (definitely safe)")
# Measure actual false positive rate
import random
import string
false_positives = 0
test_count = 10000
for _ in range(test_count):
random_url = ''.join(random.choices(string.ascii_lowercase, k=20))
if bf.might_contain(random_url):
false_positives += 1
print(f"\n=== False Positive Rate ===")
print(f"Expected: {bf.fp_rate * 100:.3f}%")
print(f"Actual: {false_positives / test_count * 100:.3f}%")
)Ứng dụng Thực tế
| System | Use Case |
|---|---|
| Cassandra | Skip disk read for non-existent rows |
| Chrome/Firefox | Safe Browsing malware URL check |
| Medium | Filter already-read articles |
| Akamai CDN | "One-hit wonders" cache optimization |
| Bitcoin | SPV clients filter relevant transactions |
Counting Bloom Filter
Standard Bloom Filter không hỗ trợ delete. Giải pháp: Counting Bloom Filter.
python
# Thay vì bit array (0/1), dùng counter array
counters = [0] * m
def add(item):
for i in hash_indices(item):
counters[i] += 1
def remove(item): # Now possible!
for i in hash_indices(item):
counters[i] -= 1
def contains(item):
return all(counters[i] > 0 for i in hash_indices(item))⚠️ Trade-off
Counting BF cần 4 bits/counter thay vì 1 bit → 4x memory.
Anti-patterns
❌ TRÁNH
1. Dùng cho exact membership
python
# ❌ WRONG: Banking - cần 100% chính xác
if user_id in bloom_filter:
approve_transaction() # FALSE POSITIVE có thể mất tiền!
# ✅ RIGHT: Dùng như pre-filter
if user_id not in bloom_filter:
reject_fast() # Definitely not a customer
else:
verify_in_database() # Double-check2. Xóa khỏi standard Bloom Filter
python
# ❌ WRONG: Không thể xóa
bf.remove("item") # Sẽ corrupt filter!
# ✅ RIGHT: Dùng Counting Bloom Filter hoặc rebuild3. Undersize bit array
python
# ❌ WRONG: Quá nhỏ → FP rate cao
bf = BloomFilter(expected_items=1_000_000)
bf.size = 1000 # Sẽ có >90% false positives
# ✅ RIGHT: Luôn dùng optimal parameters💡 HPN's Rule
"Bloom Filter là gatekeeper: Nói KHÔNG thì chắc, nói CÓ thì cần verify."