Skip to content

📚 RAG Engineering

🎓 Page Overview

Trang này cung cấp kiến thức chuyên sâu về Retrieval-Augmented Generation (RAG), từ chunking strategies đến retrieval optimization và citation management.

Level: Core Solves: Xây dựng production-ready RAG systems với high retrieval quality và proper grounding

🎯 RAG Architecture Overview

Standard RAG Pipeline

RAG Components

ComponentPurposeKey Decisions
ChunkingSplit documents into retrievable unitsSize, overlap, semantic boundaries
EmbeddingConvert text to dense vectorsModel selection, dimension trade-offs
Vector StoreIndex và search embeddingsIndex type, hybrid search capability
RetrievalFind relevant chunksTop-K, similarity threshold
RerankingRe-score retrieved resultsModel choice, latency budget
GenerationProduce grounded answersContext window management, citation

📦 Chunking Strategies

Chunking Methods Comparison

MethodDescriptionBest For
Fixed SizeSplit by character/token countUniform documents, simple setup
Sentence-basedSplit on sentence boundariesWell-structured prose
Paragraph-basedSplit on paragraph breaksDocuments with clear structure
SemanticSplit on topic changesMixed content, technical docs
RecursiveHierarchical splittingComplex documents, code

Chunking Parameters

python
# Example configuration
CHUNK_CONFIG = {
    "chunk_size": 512,           # tokens
    "chunk_overlap": 50,         # tokens
    "separators": ["\n\n", "\n", ". ", " "],
    "length_function": "tiktoken",
    "keep_separator": True
}

Chunk Size Trade-offs

Chunk SizeRetrieval PrecisionContext RichnessEmbedding Quality
Small (128-256)HighLowHigh
Medium (512-1024)MediumMediumMedium
Large (1024-2048)LowHighLower

💡 Best Practice

Start with 512 tokens với 10% overlap. Adjust based on retrieval quality metrics.

Advanced Chunking: Hierarchical

Benefits:

  • Parent chunks provide broader context
  • Child chunks enable precise retrieval
  • Can dynamically expand context when needed

🔍 Retrieval Optimization

Retrieval Strategies

StrategyDescriptionUse Case
Dense RetrievalSemantic similarity via embeddingsConceptual matching
Sparse RetrievalBM25/TF-IDF keyword matchingExact term matching
HybridCombine dense + sparseBest of both worlds
Multi-VectorMultiple embeddings per chunkDiverse query types

Hybrid Search Implementation

Fusion Formula (RRF):

score(d) = Σ 1 / (k + rank_i(d))

Query Transformation

TechniqueDescriptionWhen to Use
Query ExpansionAdd synonyms và related termsBroad topic coverage
HyDEGenerate hypothetical answer, embed thatImproved semantic matching
Multi-QueryGenerate query variationsDiverse result coverage
Step-backGenerate more abstract queryConceptual understanding

🎯 Reranking

Reranker Types

TypeLatencyQualityCost
Cross-EncoderHighHighestHigh
ColBERTMediumHighMedium
LLM-basedVery HighVariableHighest
Hybrid ScoreLowMediumLow

Reranking Pipeline

Implementation Considerations

python
# Reranking configuration
RERANK_CONFIG = {
    "initial_k": 50,        # Retrieve more for reranking
    "final_k": 5,           # Return top after rerank
    "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
    "batch_size": 32,
    "score_threshold": 0.5  # Optional filtering
}

📝 Context Assembly

Context Window Management

Context Ordering Strategies

StrategyDescriptionImpact
Relevance FirstMost relevant at topBest for short contexts
Lost in MiddleRelevant at start and endMitigates attention bias
ChronologicalTime-orderedGood for temporal reasoning
HierarchicalStructured by importanceGood for complex topics

Context Compression

markdown
## Original Context (500 tokens)
[Long detailed passage about machine learning...]

## Compressed Context (150 tokens)
Key points: ML involves training models on data. 
Main approaches: supervised, unsupervised, reinforcement learning.
Critical consideration: data quality and quantity.

📎 Citation & Grounding

Citation Patterns

PatternExampleUse Case
Inline"The answer is X [1]"Academic style
Section-based"According to Section 3.2..."Document reference
Verbatim Quote"As stated: 'exact quote'"Legal/compliance
Confidence-tagged"X (confidence: high)"Uncertainty awareness

Grounding Verification

Citation Policy Template

markdown
## Citation Requirements

1. **Every factual claim** must reference a source document
2. **Format**: Use [Source: document_id, page X] inline
3. **Unsupported claims**: Prefix with "Based on general knowledge:"
4. **Conflicting sources**: Present both views with sources
5. **No source found**: Explicitly state "No relevant source found"

📊 Evaluation Metrics

Retrieval Metrics

MetricFormulaTarget
Recall@KRelevant retrieved / Total relevant> 0.8
Precision@KRelevant retrieved / K> 0.6
MRRMean Reciprocal Rank> 0.7
NDCGNormalized DCG> 0.75

End-to-End Metrics

MetricDescriptionMeasurement
FaithfulnessAnswer grounded in contextLLM-as-judge
Answer RelevanceAnswer addresses queryLLM-as-judge
Context RelevanceRetrieved context qualityEmbedding similarity
GroundednessClaims supported by sourcesCitation verification

📋 RAG Engineering Checklist

Design Phase

  • [ ] Define retrieval quality targets (Recall@K, MRR)
  • [ ] Choose chunking strategy based on document types
  • [ ] Select embedding model (dimension, domain fit)
  • [ ] Design hybrid search weights

Implementation Phase

  • [ ] Implement chunking with overlap handling
  • [ ] Set up vector store với proper indexing
  • [ ] Add reranking layer
  • [ ] Implement citation extraction

Production Phase

  • [ ] Monitor retrieval metrics continuously
  • [ ] Set up A/B testing for prompt variations
  • [ ] Implement feedback loop for relevance
  • [ ] Regular re-indexing schedule

🔗 Cross-References

📚 Further Reading

  • "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" - Lewis et al.
  • "LlamaIndex Documentation" - RAG implementation patterns
  • "LangChain RAG Guide" - Production RAG patterns