Giao diện
🔷 NoSQL Modeling
Level: Advanced Solves: Thiết kế data models cho NoSQL databases với query-driven approach
NoSQL Modeling Philosophy
💡 Giáo sư Tom
Quên hết những gì bạn biết về normalization khi làm việc với NoSQL. Ở đây, bạn design cho queries, không phải cho data integrity. Mỗi access pattern là một table design decision.
Relational vs NoSQL Mindset
┌─────────────────────────────────────────────────────────────────┐
│ RELATIONAL vs NoSQL MINDSET │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RELATIONAL (Data-Driven) NoSQL (Query-Driven) │
│ ──────────────────────── ──────────────────── │
│ │
│ 1. Identify entities 1. Identify access patterns │
│ 2. Define relationships 2. Design for each pattern │
│ 3. Normalize 3. Denormalize aggressively │
│ 4. Add indexes for queries 4. Duplicate data as needed │
│ 5. Optimize later 5. Optimize upfront │
│ │
│ "What data do we have?" "How will we query data?" │
│ │
│ TRADE-OFFS: │
│ Relational: Flexible queries, data integrity │
│ NoSQL: Predictable performance, horizontal scale │
│ │
└─────────────────────────────────────────────────────────────────┘Access Pattern Analysis
Step 1: Document Access Patterns
┌─────────────────────────────────────────────────────────────────┐
│ ACCESS PATTERN WORKSHEET │
├─────────────────────────────────────────────────────────────────┤
│ │
│ E-COMMERCE EXAMPLE: │
│ │
│ Pattern │ Description │ Frequency │ Latency Req │
│ ────────┼──────────────────────────┼───────────┼───────────── │
│ AP1 │ Get customer by ID │ Very High │ < 10ms │
│ AP2 │ Get orders by customer │ High │ < 50ms │
│ AP3 │ Get order details │ High │ < 20ms │
│ AP4 │ Get products by category │ High │ < 100ms │
│ AP5 │ Search products by name │ Medium │ < 200ms │
│ AP6 │ Get customer's addresses │ Medium │ < 20ms │
│ AP7 │ Get order history (30d) │ Low │ < 500ms │
│ │
│ QUESTIONS TO ASK: │
│ • What are the most frequent queries? │
│ • What latency is acceptable? │
│ • What's the read/write ratio? │
│ • How will data grow over time? │
│ │
└─────────────────────────────────────────────────────────────────┘DynamoDB Modeling
Single Table Design
┌─────────────────────────────────────────────────────────────────┐
│ SINGLE TABLE DESIGN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PK │ SK │ Data │
│ ────────────────┼─────────────────────┼────────────────────── │
│ CUSTOMER#123 │ PROFILE │ {name, email, ...} │
│ CUSTOMER#123 │ ADDRESS#1 │ {street, city, ...} │
│ CUSTOMER#123 │ ADDRESS#2 │ {street, city, ...} │
│ CUSTOMER#123 │ ORDER#2024-001 │ {total, status, ...} │
│ CUSTOMER#123 │ ORDER#2024-002 │ {total, status, ...} │
│ ORDER#2024-001 │ ITEM#1 │ {product, qty, ...} │
│ ORDER#2024-001 │ ITEM#2 │ {product, qty, ...} │
│ PRODUCT#P001 │ METADATA │ {name, price, ...} │
│ PRODUCT#P001 │ CATEGORY#Electronics│ {category_name} │
│ │
│ ACCESS PATTERNS SUPPORTED: │
│ • Get customer: PK = CUSTOMER#123, SK = PROFILE │
│ • Get customer addresses: PK = CUSTOMER#123, SK begins_with ADDRESS │
│ • Get customer orders: PK = CUSTOMER#123, SK begins_with ORDER │
│ • Get order items: PK = ORDER#2024-001, SK begins_with ITEM │
│ │
└─────────────────────────────────────────────────────────────────┘Partition Key Design
⚠️ Hot Partitions Kill Performance
Partition key quyết định data distribution. Chọn sai = hot partition = throttling = unhappy users.
┌─────────────────────────────────────────────────────────────────┐
│ PARTITION KEY STRATEGIES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ BAD: Low cardinality │
│ PK = status (only 5 values) │
│ → All "active" items in one partition │
│ │
│ ❌ BAD: Time-based without sharding │
│ PK = date (2024-01-15) │
│ → Today's partition is hot, old partitions cold │
│ │
│ ✅ GOOD: High cardinality, even distribution │
│ PK = customer_id │
│ → Each customer's data in separate partition │
│ │
│ ✅ GOOD: Composite with sharding │
│ PK = order_id#shard_number │
│ → Distribute hot items across partitions │
│ │
│ WRITE SHARDING PATTERN: │
│ PK = "ORDERS#" + (order_id % 10) │
│ → Spreads writes across 10 partitions │
│ → Requires scatter-gather for reads │
│ │
└─────────────────────────────────────────────────────────────────┘Global Secondary Index (GSI)
┌─────────────────────────────────────────────────────────────────┐
│ GSI DESIGN PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BASE TABLE: │
│ PK: customer_id SK: order_id │
│ │
│ GSI1 (Orders by Status): │
│ PK: status SK: created_at │
│ → Query: Get all pending orders, sorted by date │
│ │
│ GSI2 (Orders by Product): │
│ PK: product_id SK: order_id │
│ → Query: Get all orders containing product X │
│ │
│ SPARSE INDEX PATTERN: │
│ GSI3 (Flagged Orders): │
│ PK: flagged SK: created_at │
│ → Only items with 'flagged' attribute appear in index │
│ → Efficient for filtering rare conditions │
│ │
└─────────────────────────────────────────────────────────────────┘MongoDB Modeling
Embedding vs Referencing
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING vs REFERENCING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ EMBEDDING (Denormalized) REFERENCING (Normalized) │
│ ──────────────────────── ──────────────────────── │
│ │
│ { // customers collection │
│ _id: "customer123", { │
│ name: "Alice", _id: "customer123", │
│ addresses: [ name: "Alice" │
│ {street: "123 Main"}, } │
│ {street: "456 Oak"} │
│ ], // addresses collection │
│ orders: [ { │
│ {id: "order1", total: 100}, _id: "addr1", │
│ {id: "order2", total: 200} customer_id: "customer123"│
│ ] street: "123 Main" │
│ } } │
│ │
│ ✅ Single query for all data ✅ No data duplication │
│ ✅ Atomic updates ✅ Smaller documents │
│ ❌ Document size limit (16MB) ❌ Multiple queries needed │
│ ❌ Data duplication ❌ No atomic cross-doc ops │
│ │
└─────────────────────────────────────────────────────────────────┘
]}}When to Embed vs Reference
| Criteria | Embed | Reference |
|---|---|---|
| Relationship | 1:1, 1:few | 1:many, many:many |
| Access pattern | Always together | Sometimes separate |
| Update frequency | Rarely changes | Frequently changes |
| Data size | Small, bounded | Large, unbounded |
| Consistency | Strong needed | Eventual OK |
MongoDB Schema Example
javascript
// E-commerce: Hybrid approach
// customers collection
{
_id: ObjectId("..."),
email: "alice@example.com",
name: "Alice Smith",
// Embed: Always accessed together, bounded
addresses: [
{ type: "shipping", street: "123 Main", city: "NYC" },
{ type: "billing", street: "456 Oak", city: "LA" }
],
// Embed: Summary for quick access
order_summary: {
total_orders: 15,
total_spent: 2500.00,
last_order_date: ISODate("2024-01-15")
}
}
// orders collection (separate - unbounded, queried independently)
{
_id: ObjectId("..."),
customer_id: ObjectId("..."), // Reference
order_date: ISODate("2024-01-15"),
status: "shipped",
// Embed: Order items always accessed with order
items: [
{
product_id: ObjectId("..."),
product_name: "iPhone 15", // Denormalized for display
quantity: 1,
price: 999.00
}
],
// Embed: Snapshot at order time
shipping_address: {
street: "123 Main",
city: "NYC"
}
}
// products collection
{
_id: ObjectId("..."),
name: "iPhone 15",
price: 999.00,
category: "Electronics",
// Embed: Bounded, always needed
specifications: {
storage: "256GB",
color: "Blue"
}
}Cassandra Modeling
Primary Key Design
┌─────────────────────────────────────────────────────────────────┐
│ CASSANDRA PRIMARY KEY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PRIMARY KEY = (partition_key, clustering_columns) │
│ │
│ EXAMPLE: Time-series sensor data │
│ │
│ CREATE TABLE sensor_readings ( │
│ sensor_id UUID, │
│ reading_date DATE, │
│ reading_time TIMESTAMP, │
│ temperature DECIMAL, │
│ humidity DECIMAL, │
│ PRIMARY KEY ((sensor_id, reading_date), reading_time) │
│ ) WITH CLUSTERING ORDER BY (reading_time DESC); │
│ │
│ PARTITION KEY: (sensor_id, reading_date) │
│ → Data for one sensor, one day = one partition │
│ → Prevents unbounded partition growth │
│ │
│ CLUSTERING COLUMN: reading_time DESC │
│ → Data sorted within partition │
│ → Latest readings first │
│ │
│ QUERY PATTERNS SUPPORTED: │
│ ✅ Get readings for sensor X on date Y │
│ ✅ Get latest N readings for sensor X on date Y │
│ ❌ Get all readings for sensor X (requires multiple queries) │
│ │
└─────────────────────────────────────────────────────────────────┘Query-First Table Design
sql
-- Access Pattern 1: Get user's recent posts
CREATE TABLE posts_by_user (
user_id UUID,
post_id TIMEUUID,
content TEXT,
created_at TIMESTAMP,
PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);
-- Access Pattern 2: Get posts by hashtag
CREATE TABLE posts_by_hashtag (
hashtag TEXT,
post_id TIMEUUID,
user_id UUID,
content TEXT,
PRIMARY KEY (hashtag, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);
-- Access Pattern 3: Get post details
CREATE TABLE posts_by_id (
post_id TIMEUUID PRIMARY KEY,
user_id UUID,
content TEXT,
hashtags SET<TEXT>,
created_at TIMESTAMP
);
-- Same data, three tables, three access patterns
-- Write to all three on post creationAnti-Patterns
┌─────────────────────────────────────────────────────────────────┐
│ NoSQL ANTI-PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ❌ RELATIONAL THINKING │
│ Normalizing data, expecting JOINs │
│ → Denormalize, duplicate data │
│ │
│ ❌ UNBOUNDED ARRAYS │
│ Embedding unlimited items in document │
│ → Use bucket pattern or reference │
│ │
│ ❌ LARGE DOCUMENTS │
│ Storing everything in one document │
│ → Split into multiple documents │
│ │
│ ❌ HOT PARTITIONS │
│ All traffic to one partition key │
│ → Add sharding suffix, time bucketing │
│ │
│ ❌ SCAN OPERATIONS │
│ Full table scans for queries │
│ → Design tables for specific access patterns │
│ │
│ ❌ IGNORING CONSISTENCY │
│ Assuming strong consistency everywhere │
│ → Understand eventual consistency implications │
│ │
└─────────────────────────────────────────────────────────────────┘🔍 Common Review Comments
⚠️ NoSQL Design Review Feedback
NoSQL designs are unforgiving — schema changes are expensive and hot partitions can take down production.
| Review Comment | Ý nghĩa | Cách khắc phục |
|---|---|---|
| "Partition key này sẽ gây hot partition during peak" | High-cardinality events funneled to single partition | Add randomized suffix hoặc time-bucket |
| "GSI không có sort key đúng cho query này" | GSI exists nhưng không support access pattern | Redesign GSI với correct sort key |
| "Document size sẽ vượt 400KB limit" | Unbounded arrays trong documents | Split to separate collection/table |
| "Scan operation as primary access path" | Full table scan cho common queries | Redesign với appropriate indexes |
| "RCU/WCU provisioning không match pattern" | Under/over provisioned capacity | Profile actual usage |
| "No TTL strategy for time-series" | Data grows unbounded | Enable TTL, partition by time |
🔥 Hot Partition Deep-Dive
❌ Hot Partitions = Production Outages
Một partition absorbs disproportionate traffic → throttling → cascading failures.
┌─────────────────────────────────────────────────────────────────┐
│ HOT PARTITION SCENARIOS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO 1: Celebrity Effect (DynamoDB) │
│ PK: user_id — Celebrity post → single partition overloaded │
│ │
│ SCENARIO 2: Time-based Hot Key (Cassandra) │
│ PK: date — All writes go to TODAY's partition │
│ Solution: Composite key (date, bucket) │
│ │
│ SCENARIO 3: Sequential IDs │
│ PK: auto-increment → All writes to last partition │
│ Solution: UUID/ULID for random distribution │
│ │
└─────────────────────────────────────────────────────────────────┘| Mitigation Strategy | Trade-off |
|---|---|
| Write sharding (random suffix) | Scatter-gather reads |
| Time bucketing | Query complexity |
| Read replicas (DAX, Global Tables) | Cost, eventual consistency |
⚠️ Anti-patterns (Expanded)
Anti-pattern 5: Unbounded Embedded Arrays
❌ PROBLEM (MongoDB):
{ comments: [ /* 100,000 items */ ] }
CONSEQUENCES:
• Document size limit hit (16MB MongoDB, 400KB DynamoDB)
• Every comment update locks entire document
• Memory pressure from large documents
✅ SOLUTION: Separate collection với referenceAnti-pattern 6: Scan as Primary Access
❌ PROBLEM (DynamoDB):
dynamodb.scan({ FilterExpression: 'status = pending' })
CONSEQUENCES:
• Scan reads ALL items, filter is post-read
• 10M items × 1KB = 1,250,000 RCU (vs 250 RCU with index)
✅ SOLUTION: GSI on status-created_at🔄 Migration Notes
DynamoDB Schema Evolution
| Change | Difficulty | Strategy |
|---|---|---|
| Add attribute | Easy | Backfill in batches, handle missing in app |
| Modify GSI | Medium | Delete + recreate (hours for large tables) |
| Change PK | Hard | New table + DynamoDB Streams migration |
MongoDB Schema Evolution
javascript
// Versioned documents pattern
{
_id: "...",
schemaVersion: 2,
firstName: "John", lastName: "Doe" // V2
}
// Application handles both versions during migration📎 Cross-References
- 📎 Distributed Systems - Sharding và consistency
- 📎 OLTP Modeling - So sánh với relational approach
- 📎 Schema Evolution - Migration strategies chi tiết
- 📎 System Design - Database selection criteria