Skip to content

🏗️ LLM App Architecture

🎓 Page Overview

Trang này cung cấp kiến thức nền tảng về kiến trúc LLM applications, từ components cơ bản đến orchestration patterns cho production-ready systems.

Level: Foundation Solves: Thiết kế architecture cho LLM-powered applications với proper separation of concerns và scalability

🎯 Core Concepts

Tầng Kiến Trúc LLM Application

Component Responsibilities

ComponentResponsibilityKey Considerations
LLM OrchestratorCoordinate request flow, manage contextRetry logic, timeout handling, fallback chains
Model RouterSelect optimal model per requestCost, latency, capability matching
Semantic CacheCache similar queriesSimilarity threshold, cache invalidation
Embedding ServiceConvert text to vectorsBatch processing, model versioning
Vector DBStore và retrieve embeddingsIndex strategy, hybrid search support
RerankerRe-score retrieved documentsLatency budget, accuracy trade-offs

🔄 Data Flow Patterns

Request-Response Flow

Streaming vs Batch Patterns

PatternUse CaseImplementation
StreamingInteractive chat, real-timeSSE/WebSocket, token-by-token
BatchDocument processing, offlineQueue-based, async workers
HybridInitial answer + enhancementStream first response, batch enrichment

🧠 Orchestration Patterns

1. Simple Chain Pattern

User Query → Retrieval → LLM → Response
  • Use when: Single-turn Q&A, simple RAG
  • Pros: Low latency, easy to debug
  • Cons: Limited reasoning capability

2. Agent Pattern

User Query → Planning → [Tool Use Loop] → Response
  • Use when: Complex tasks, multi-step reasoning
  • Pros: Flexible, can handle diverse tasks
  • Cons: Higher latency, harder to control

3. Multi-Agent Pattern

User Query → Router → [Specialist Agent 1..N] → Aggregator → Response
  • Use when: Domain-specific expertise required
  • Pros: Better quality per domain
  • Cons: Complex orchestration, coordination overhead

4. Workflow Pattern

User Query → [Predefined Steps 1..N] → Response
  • Use when: Structured processes, compliance requirements
  • Pros: Predictable, auditable
  • Cons: Less flexible

⚙️ Context Management

Context Window Strategy

StrategyDescriptionTrade-offs
Sliding WindowKeep last N tokensSimple, may lose important context
SummarizationCompress old contextPreserves key info, adds latency
HierarchicalMultiple summary levelsBest retention, highest complexity
SelectiveKeep relevant sections onlyEfficient, requires relevance scoring

Memory Architecture

🔧 Production Considerations

Reliability Patterns

PatternImplementationPurpose
Circuit BreakerTrack failure rate, open circuit on thresholdPrevent cascade failures
Retry with BackoffExponential backoff with jitterHandle transient failures
Fallback ChainPrimary → Secondary → Tertiary modelEnsure availability
Request HedgingSend to multiple providers, use first responseReduce tail latency

Latency Optimization

  1. Parallel Retrieval: Fetch from multiple sources concurrently
  2. Speculative Execution: Pre-generate likely follow-ups
  3. Edge Caching: Cache embeddings and common responses at edge
  4. Model Quantization: Use quantized models for low-latency paths

Security Architecture

📋 Architecture Checklist

Design Phase

  • [ ] Define latency SLO (p50, p95, p99)
  • [ ] Identify fallback model chain
  • [ ] Design context management strategy
  • [ ] Plan caching layer (semantic + exact match)
  • [ ] Define security boundaries

Implementation Phase

  • [ ] Implement circuit breaker for LLM calls
  • [ ] Set up observability (traces, metrics, logs)
  • [ ] Configure rate limiting per user/tenant
  • [ ] Implement request/response logging (PII-aware)
  • [ ] Set up async processing for batch operations

Production Readiness

  • [ ] Load test with realistic traffic patterns
  • [ ] Test fallback chain end-to-end
  • [ ] Verify cache hit rates meet targets
  • [ ] Validate security controls (injection, DLP)
  • [ ] Document runbook for common failures

🔗 Cross-References

📚 Further Reading

  • "Building LLM Applications for Production" - Chip Huyen
  • "Patterns for LLM-based Applications" - Anthropic Cookbook
  • "Production-Ready LLM Applications" - LangChain Documentation