Skip to content

🔷 NoSQL Modeling

Level: Advanced Solves: Thiết kế data models cho NoSQL databases với query-driven approach

NoSQL Modeling Philosophy

💡 Giáo sư Tom

Quên hết những gì bạn biết về normalization khi làm việc với NoSQL. Ở đây, bạn design cho queries, không phải cho data integrity. Mỗi access pattern là một table design decision.

Relational vs NoSQL Mindset

┌─────────────────────────────────────────────────────────────────┐
│              RELATIONAL vs NoSQL MINDSET                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  RELATIONAL (Data-Driven)         NoSQL (Query-Driven)          │
│  ────────────────────────         ────────────────────          │
│                                                                 │
│  1. Identify entities             1. Identify access patterns   │
│  2. Define relationships          2. Design for each pattern    │
│  3. Normalize                     3. Denormalize aggressively   │
│  4. Add indexes for queries       4. Duplicate data as needed   │
│  5. Optimize later                5. Optimize upfront           │
│                                                                 │
│  "What data do we have?"          "How will we query data?"     │
│                                                                 │
│  TRADE-OFFS:                                                    │
│  Relational: Flexible queries, data integrity                   │
│  NoSQL: Predictable performance, horizontal scale               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Access Pattern Analysis

Step 1: Document Access Patterns

┌─────────────────────────────────────────────────────────────────┐
│              ACCESS PATTERN WORKSHEET                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  E-COMMERCE EXAMPLE:                                            │
│                                                                 │
│  Pattern │ Description              │ Frequency │ Latency Req   │
│  ────────┼──────────────────────────┼───────────┼─────────────  │
│  AP1     │ Get customer by ID       │ Very High │ < 10ms        │
│  AP2     │ Get orders by customer   │ High      │ < 50ms        │
│  AP3     │ Get order details        │ High      │ < 20ms        │
│  AP4     │ Get products by category │ High      │ < 100ms       │
│  AP5     │ Search products by name  │ Medium    │ < 200ms       │
│  AP6     │ Get customer's addresses │ Medium    │ < 20ms        │
│  AP7     │ Get order history (30d)  │ Low       │ < 500ms       │
│                                                                 │
│  QUESTIONS TO ASK:                                              │
│  • What are the most frequent queries?                          │
│  • What latency is acceptable?                                  │
│  • What's the read/write ratio?                                 │
│  • How will data grow over time?                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

DynamoDB Modeling

Single Table Design

┌─────────────────────────────────────────────────────────────────┐
│              SINGLE TABLE DESIGN                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PK              │ SK                  │ Data                   │
│  ────────────────┼─────────────────────┼──────────────────────  │
│  CUSTOMER#123    │ PROFILE             │ {name, email, ...}     │
│  CUSTOMER#123    │ ADDRESS#1           │ {street, city, ...}    │
│  CUSTOMER#123    │ ADDRESS#2           │ {street, city, ...}    │
│  CUSTOMER#123    │ ORDER#2024-001      │ {total, status, ...}   │
│  CUSTOMER#123    │ ORDER#2024-002      │ {total, status, ...}   │
│  ORDER#2024-001  │ ITEM#1              │ {product, qty, ...}    │
│  ORDER#2024-001  │ ITEM#2              │ {product, qty, ...}    │
│  PRODUCT#P001    │ METADATA            │ {name, price, ...}     │
│  PRODUCT#P001    │ CATEGORY#Electronics│ {category_name}        │
│                                                                 │
│  ACCESS PATTERNS SUPPORTED:                                     │
│  • Get customer: PK = CUSTOMER#123, SK = PROFILE                │
│  • Get customer addresses: PK = CUSTOMER#123, SK begins_with ADDRESS │
│  • Get customer orders: PK = CUSTOMER#123, SK begins_with ORDER │
│  • Get order items: PK = ORDER#2024-001, SK begins_with ITEM    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Partition Key Design

⚠️ Hot Partitions Kill Performance

Partition key quyết định data distribution. Chọn sai = hot partition = throttling = unhappy users.

┌─────────────────────────────────────────────────────────────────┐
│              PARTITION KEY STRATEGIES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ BAD: Low cardinality                                        │
│     PK = status (only 5 values)                                 │
│     → All "active" items in one partition                       │
│                                                                 │
│  ❌ BAD: Time-based without sharding                            │
│     PK = date (2024-01-15)                                      │
│     → Today's partition is hot, old partitions cold             │
│                                                                 │
│  ✅ GOOD: High cardinality, even distribution                   │
│     PK = customer_id                                            │
│     → Each customer's data in separate partition                │
│                                                                 │
│  ✅ GOOD: Composite with sharding                               │
│     PK = order_id#shard_number                                  │
│     → Distribute hot items across partitions                    │
│                                                                 │
│  WRITE SHARDING PATTERN:                                        │
│     PK = "ORDERS#" + (order_id % 10)                            │
│     → Spreads writes across 10 partitions                       │
│     → Requires scatter-gather for reads                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Global Secondary Index (GSI)

┌─────────────────────────────────────────────────────────────────┐
│              GSI DESIGN PATTERNS                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BASE TABLE:                                                    │
│  PK: customer_id    SK: order_id                                │
│                                                                 │
│  GSI1 (Orders by Status):                                       │
│  PK: status         SK: created_at                              │
│  → Query: Get all pending orders, sorted by date                │
│                                                                 │
│  GSI2 (Orders by Product):                                      │
│  PK: product_id     SK: order_id                                │
│  → Query: Get all orders containing product X                   │
│                                                                 │
│  SPARSE INDEX PATTERN:                                          │
│  GSI3 (Flagged Orders):                                         │
│  PK: flagged        SK: created_at                              │
│  → Only items with 'flagged' attribute appear in index          │
│  → Efficient for filtering rare conditions                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

MongoDB Modeling

Embedding vs Referencing

┌─────────────────────────────────────────────────────────────────┐
│              EMBEDDING vs REFERENCING                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EMBEDDING (Denormalized)           REFERENCING (Normalized)    │
│  ────────────────────────           ────────────────────────    │
│                                                                 │
│  {                                  // customers collection     │
│    _id: "customer123",              {                           │
│    name: "Alice",                     _id: "customer123",       │
│    addresses: [                       name: "Alice"             │
│      {street: "123 Main"},          }                           │
│      {street: "456 Oak"}                                        │
│    ],                               // addresses collection     │
│    orders: [                        {                           │
│      {id: "order1", total: 100},      _id: "addr1",             │
│      {id: "order2", total: 200}       customer_id: "customer123"│
│    ]                                  street: "123 Main"        │
│  }                                  }                           │
│                                                                 │
│  ✅ Single query for all data       ✅ No data duplication      │
│  ✅ Atomic updates                  ✅ Smaller documents        │
│  ❌ Document size limit (16MB)      ❌ Multiple queries needed  │
│  ❌ Data duplication                ❌ No atomic cross-doc ops  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
]}}

When to Embed vs Reference

CriteriaEmbedReference
Relationship1:1, 1:few1:many, many:many
Access patternAlways togetherSometimes separate
Update frequencyRarely changesFrequently changes
Data sizeSmall, boundedLarge, unbounded
ConsistencyStrong neededEventual OK

MongoDB Schema Example

javascript
// E-commerce: Hybrid approach
// customers collection
{
  _id: ObjectId("..."),
  email: "alice@example.com",
  name: "Alice Smith",
  
  // Embed: Always accessed together, bounded
  addresses: [
    { type: "shipping", street: "123 Main", city: "NYC" },
    { type: "billing", street: "456 Oak", city: "LA" }
  ],
  
  // Embed: Summary for quick access
  order_summary: {
    total_orders: 15,
    total_spent: 2500.00,
    last_order_date: ISODate("2024-01-15")
  }
}

// orders collection (separate - unbounded, queried independently)
{
  _id: ObjectId("..."),
  customer_id: ObjectId("..."),  // Reference
  order_date: ISODate("2024-01-15"),
  status: "shipped",
  
  // Embed: Order items always accessed with order
  items: [
    { 
      product_id: ObjectId("..."),
      product_name: "iPhone 15",  // Denormalized for display
      quantity: 1,
      price: 999.00
    }
  ],
  
  // Embed: Snapshot at order time
  shipping_address: {
    street: "123 Main",
    city: "NYC"
  }
}

// products collection
{
  _id: ObjectId("..."),
  name: "iPhone 15",
  price: 999.00,
  category: "Electronics",
  
  // Embed: Bounded, always needed
  specifications: {
    storage: "256GB",
    color: "Blue"
  }
}

Cassandra Modeling

Primary Key Design

┌─────────────────────────────────────────────────────────────────┐
│              CASSANDRA PRIMARY KEY                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PRIMARY KEY = (partition_key, clustering_columns)              │
│                                                                 │
│  EXAMPLE: Time-series sensor data                               │
│                                                                 │
│  CREATE TABLE sensor_readings (                                 │
│      sensor_id UUID,                                            │
│      reading_date DATE,                                         │
│      reading_time TIMESTAMP,                                    │
│      temperature DECIMAL,                                       │
│      humidity DECIMAL,                                          │
│      PRIMARY KEY ((sensor_id, reading_date), reading_time)      │
│  ) WITH CLUSTERING ORDER BY (reading_time DESC);                │
│                                                                 │
│  PARTITION KEY: (sensor_id, reading_date)                       │
│  → Data for one sensor, one day = one partition                 │
│  → Prevents unbounded partition growth                          │
│                                                                 │
│  CLUSTERING COLUMN: reading_time DESC                           │
│  → Data sorted within partition                                 │
│  → Latest readings first                                        │
│                                                                 │
│  QUERY PATTERNS SUPPORTED:                                      │
│  ✅ Get readings for sensor X on date Y                         │
│  ✅ Get latest N readings for sensor X on date Y                │
│  ❌ Get all readings for sensor X (requires multiple queries)   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Query-First Table Design

sql
-- Access Pattern 1: Get user's recent posts
CREATE TABLE posts_by_user (
    user_id UUID,
    post_id TIMEUUID,
    content TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

-- Access Pattern 2: Get posts by hashtag
CREATE TABLE posts_by_hashtag (
    hashtag TEXT,
    post_id TIMEUUID,
    user_id UUID,
    content TEXT,
    PRIMARY KEY (hashtag, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

-- Access Pattern 3: Get post details
CREATE TABLE posts_by_id (
    post_id TIMEUUID PRIMARY KEY,
    user_id UUID,
    content TEXT,
    hashtags SET<TEXT>,
    created_at TIMESTAMP
);

-- Same data, three tables, three access patterns
-- Write to all three on post creation

Anti-Patterns

┌─────────────────────────────────────────────────────────────────┐
│              NoSQL ANTI-PATTERNS                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ❌ RELATIONAL THINKING                                         │
│     Normalizing data, expecting JOINs                           │
│     → Denormalize, duplicate data                               │
│                                                                 │
│  ❌ UNBOUNDED ARRAYS                                            │
│     Embedding unlimited items in document                       │
│     → Use bucket pattern or reference                           │
│                                                                 │
│  ❌ LARGE DOCUMENTS                                             │
│     Storing everything in one document                          │
│     → Split into multiple documents                             │
│                                                                 │
│  ❌ HOT PARTITIONS                                              │
│     All traffic to one partition key                            │
│     → Add sharding suffix, time bucketing                       │
│                                                                 │
│  ❌ SCAN OPERATIONS                                             │
│     Full table scans for queries                                │
│     → Design tables for specific access patterns                │
│                                                                 │
│  ❌ IGNORING CONSISTENCY                                        │
│     Assuming strong consistency everywhere                      │
│     → Understand eventual consistency implications              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔍 Common Review Comments

⚠️ NoSQL Design Review Feedback

NoSQL designs are unforgiving — schema changes are expensive and hot partitions can take down production.

Review CommentÝ nghĩaCách khắc phục
"Partition key này sẽ gây hot partition during peak"High-cardinality events funneled to single partitionAdd randomized suffix hoặc time-bucket
"GSI không có sort key đúng cho query này"GSI exists nhưng không support access patternRedesign GSI với correct sort key
"Document size sẽ vượt 400KB limit"Unbounded arrays trong documentsSplit to separate collection/table
"Scan operation as primary access path"Full table scan cho common queriesRedesign với appropriate indexes
"RCU/WCU provisioning không match pattern"Under/over provisioned capacityProfile actual usage
"No TTL strategy for time-series"Data grows unboundedEnable TTL, partition by time

🔥 Hot Partition Deep-Dive

❌ Hot Partitions = Production Outages

Một partition absorbs disproportionate traffic → throttling → cascading failures.

┌─────────────────────────────────────────────────────────────────┐
│               HOT PARTITION SCENARIOS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCENARIO 1: Celebrity Effect (DynamoDB)                        │
│  PK: user_id — Celebrity post → single partition overloaded     │
│                                                                 │
│  SCENARIO 2: Time-based Hot Key (Cassandra)                     │
│  PK: date — All writes go to TODAY's partition                  │
│  Solution: Composite key (date, bucket)                         │
│                                                                 │
│  SCENARIO 3: Sequential IDs                                     │
│  PK: auto-increment → All writes to last partition              │
│  Solution: UUID/ULID for random distribution                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Mitigation StrategyTrade-off
Write sharding (random suffix)Scatter-gather reads
Time bucketingQuery complexity
Read replicas (DAX, Global Tables)Cost, eventual consistency

⚠️ Anti-patterns (Expanded)

Anti-pattern 5: Unbounded Embedded Arrays

❌ PROBLEM (MongoDB):
{ comments: [ /* 100,000 items */ ] }

CONSEQUENCES:
• Document size limit hit (16MB MongoDB, 400KB DynamoDB)
• Every comment update locks entire document
• Memory pressure from large documents

✅ SOLUTION: Separate collection với reference

Anti-pattern 6: Scan as Primary Access

❌ PROBLEM (DynamoDB):
dynamodb.scan({ FilterExpression: 'status = pending' })

CONSEQUENCES:
• Scan reads ALL items, filter is post-read
• 10M items × 1KB = 1,250,000 RCU (vs 250 RCU with index)

✅ SOLUTION: GSI on status-created_at

🔄 Migration Notes

DynamoDB Schema Evolution

ChangeDifficultyStrategy
Add attributeEasyBackfill in batches, handle missing in app
Modify GSIMediumDelete + recreate (hours for large tables)
Change PKHardNew table + DynamoDB Streams migration

MongoDB Schema Evolution

javascript
// Versioned documents pattern
{
  _id: "...",
  schemaVersion: 2,
  firstName: "John", lastName: "Doe"  // V2
}

// Application handles both versions during migration

📎 Cross-References