Giao diện
Longest Common Subsequence (LCS) - The Diff Engine
"Mỗi lần bạn dùng
git diff, bạn đang chạy LCS." - HPN
Problem Statement
Cho 2 strings, tìm subsequence chung dài nhất.
Subsequence = Một phần của string, giữ nguyên thứ tự, không cần liên tiếp.
String 1: "ABCDGH"
String 2: "AEDFHR"
LCS = "ADH" (length 3)
A B C D G H
↓ ↓ ↓
A E D F H R📘 Subsequence vs Substring
- Substring: Liên tiếp → "BCD" trong "ABCDE"
- Subsequence: Không cần liên tiếp → "ACE" trong "ABCDE" ✅
Real-World Applications
| Domain | Ứng dụng | Chi tiết |
|---|---|---|
| Version Control | git diff | So sánh file changes, hiển thị +/- lines |
| Bioinformatics | DNA Sequencing | So sánh genome sequences, tìm evolutionary relationship |
| Spell Check | Auto-correct | Gợi ý từ gần đúng nhất |
| Plagiarism Detection | Turnitin | Tìm đoạn văn giống nhau giữa 2 documents |
| File Comparison | diff command | Merge conflicts, patch files |
Phân Tích DP
Nhận diện DP patterns
| Checklist | Có? | Giải thích |
|---|---|---|
| Overlapping Subproblems | ✅ | lcs(i,j) được gọi nhiều lần |
| Optimal Substructure | ✅ | LCS của prefixes → LCS của cả string |
| "Longest/Shortest" signal | ✅ | "Longest common..." |
State Definition
dp[i][j] = Length of LCS between:
- s1[0..i-1] (i ký tự đầu của s1)
- s2[0..j-1] (j ký tự đầu của s2)Transition (Công thức chuyển trạng thái)
python
if s1[i-1] == s2[j-1]:
# Ký tự cuối giống nhau → Thêm vào LCS
dp[i][j] = dp[i-1][j-1] + 1
else:
# Khác nhau → Bỏ 1 trong 2 ký tự, lấy max
dp[i][j] = max(dp[i-1][j], dp[i][j-1])2D DP Table Visualization
s1 = "AGGTAB"
s2 = "GXTXAYB"
s2 →
"" G X T X A Y B
┌─────────────────────────────────┐
"" │ 0 0 0 0 0 0 0 0 │
A │ 0 0 0 0 0 1 1 1 │
s1 G │ 0 1 1 1 1 1 1 1 │
↓ G │ 0 1 1 1 1 1 1 1 │
T │ 0 1 1 2 2 2 2 2 │
A │ 0 1 1 2 2 3 3 3 │
B │ 0 1 1 2 2 3 3 4 │
└─────────────────────────────────┘
↑
LCS length = 4
Traceback: GTAB💡 Đọc bảng như thế nào?
dp[i][j]= LCS length củas1[:i]vàs2[:j]- Nếu
s1[i-1] == s2[j-1]: đi chéo ↖ (thêm vào LCS) - Nếu khác: đi theo hướng có giá trị lớn hơn (↑ hoặc ←)
Step-by-Step Implementation
Step 1: Recursive (Intuition) - O(2^N) ❌
python
def lcs_naive(s1: str, s2: str, i: int, j: int) -> int:
"""
❌ BRUTE FORCE - Chỉ để hiểu logic.
Time: O(2^(M+N))
Space: O(M+N) call stack
"""
# Base case: Một trong hai string rỗng
if i == 0 or j == 0:
return 0
# Nếu ký tự cuối giống nhau → Đã tìm thấy 1 common char
if s1[i-1] == s2[j-1]:
return 1 + lcs_naive(s1, s2, i-1, j-1)
# Khác nhau → Thử bỏ từng ký tự, lấy max
return max(
lcs_naive(s1, s2, i-1, j), # Bỏ ký tự cuối của s1
lcs_naive(s1, s2, i, j-1) # Bỏ ký tự cuối của s2
)Step 2: Top-Down Memoization - O(M×N) ✅
python
from functools import lru_cache
def lcs_memo(s1: str, s2: str) -> int:
"""
✅ MEMOIZATION - Cache subproblems.
Time: O(M × N)
Space: O(M × N)
"""
@lru_cache(maxsize=None)
def dp(i: int, j: int) -> int:
if i == 0 or j == 0:
return 0
if s1[i-1] == s2[j-1]:
return 1 + dp(i-1, j-1)
return max(dp(i-1, j), dp(i, j-1))
return dp(len(s1), len(s2))Step 3: Bottom-Up Tabulation - O(M×N) ✅
python
def lcs_tabulation(s1: str, s2: str) -> int:
"""
✅ CLASSIC DP - Build table iteratively.
Time: O(M × N)
Space: O(M × N)
"""
m, n = len(s1), len(s2)
# dp[i][j] = LCS length của s1[:i] và s2[:j]
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]Step 4: Traceback - Get Actual LCS String ✅
python
def lcs_with_string(s1: str, s2: str) -> tuple[int, str]:
"""
✅ PRODUCTION - Returns both length AND actual LCS string.
Time: O(M × N)
Space: O(M × N)
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Build DP table
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Traceback to reconstruct LCS
lcs = []
i, j = m, n
while i > 0 and j > 0:
if s1[i-1] == s2[j-1]:
lcs.append(s1[i-1])
i -= 1
j -= 1
elif dp[i-1][j] > dp[i][j-1]:
i -= 1
else:
j -= 1
lcs.reverse()
return dp[m][n], ''.join(lcs)Step 5: Space-Optimized - O(min(M,N)) ✅✅
python
def lcs_optimized(s1: str, s2: str) -> int:
"""
✅✅ PRODUCTION-READY - Space optimization.
Time: O(M × N)
Space: O(min(M, N))
⚠️ Không thể traceback với approach này!
"""
# Đảm bảo s2 ngắn hơn để tối ưu space
if len(s1) < len(s2):
s1, s2 = s2, s1
m, n = len(s1), len(s2)
prev = [0] * (n + 1)
curr = [0] * (n + 1)
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
curr[j] = prev[j-1] + 1
else:
curr[j] = max(prev[j], curr[j-1])
prev, curr = curr, prev
return prev[n]Production Code (Full Version)
python
# HPN Engineering Standard
# Implementation: LCS - All Approaches + Real-World Applications
from functools import lru_cache
from typing import List, Tuple
from enum import Enum
# ============================================
# CORE IMPLEMENTATIONS
# ============================================
def lcs_length(s1: str, s2: str) -> int:
"""
✅ O(M×N) time, O(min(M,N)) space.
Returns: LCS length only (space-optimized).
"""
if len(s1) < len(s2):
s1, s2 = s2, s1
n = len(s2)
prev = [0] * (n + 1)
curr = [0] * (n + 1)
for i in range(len(s1)):
for j in range(n):
if s1[i] == s2[j]:
curr[j+1] = prev[j] + 1
else:
curr[j+1] = max(prev[j+1], curr[j])
prev, curr = curr, prev
return prev[n]
def lcs_string(s1: str, s2: str) -> Tuple[int, str]:
"""
✅ Returns both LCS length and the actual LCS string.
Requires full DP table for traceback.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Traceback
lcs = []
i, j = m, n
while i > 0 and j > 0:
if s1[i-1] == s2[j-1]:
lcs.append(s1[i-1])
i -= 1
j -= 1
elif dp[i-1][j] > dp[i][j-1]:
i -= 1
else:
j -= 1
return dp[m][n], ''.join(reversed(lcs))
# ============================================
# REAL-WORLD: GIT DIFF IMPLEMENTATION
# ============================================
class DiffOp(Enum):
KEEP = " "
ADD = "+"
DELETE = "-"
def compute_diff(old_lines: List[str], new_lines: List[str]) -> List[Tuple[DiffOp, str]]:
"""
Simplified git diff using LCS.
Returns list of (operation, line) tuples.
"""
m, n = len(old_lines), len(new_lines)
# Build DP table
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if old_lines[i-1] == new_lines[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Traceback để tạo diff
result = []
i, j = m, n
while i > 0 or j > 0:
if i > 0 and j > 0 and old_lines[i-1] == new_lines[j-1]:
result.append((DiffOp.KEEP, old_lines[i-1]))
i -= 1
j -= 1
elif j > 0 and (i == 0 or dp[i][j-1] >= dp[i-1][j]):
result.append((DiffOp.ADD, new_lines[j-1]))
j -= 1
else:
result.append((DiffOp.DELETE, old_lines[i-1]))
i -= 1
result.reverse()
return result
def format_diff(diff: List[Tuple[DiffOp, str]], context: int = 3) -> str:
"""
Format diff output like git diff.
"""
lines = []
for op, line in diff:
prefix = op.value
lines.append(f"{prefix} {line}")
return '\n'.join(lines)
# ============================================
# VARIANTS
# ============================================
def longest_common_substring(s1: str, s2: str) -> Tuple[int, str]:
"""
Variant: Longest Common SUBSTRING (contiguous).
Key difference: Reset to 0 when chars don't match.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
max_len = 0
end_pos = 0
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
if dp[i][j] > max_len:
max_len = dp[i][j]
end_pos = i
# else: dp[i][j] = 0 (already initialized)
substring = s1[end_pos - max_len : end_pos]
return max_len, substring
def edit_distance(s1: str, s2: str) -> int:
"""
Related: Levenshtein Distance (Minimum Edit Distance).
Số operations tối thiểu (insert, delete, replace) để biến s1 → s2.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Base cases
for i in range(m + 1):
dp[i][0] = i # Delete all chars from s1
for j in range(n + 1):
dp[0][j] = j # Insert all chars to get s2
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] # No operation needed
else:
dp[i][j] = 1 + min(
dp[i-1][j], # Delete from s1
dp[i][j-1], # Insert to s1
dp[i-1][j-1] # Replace
)
return dp[m][n]
def shortest_common_supersequence(s1: str, s2: str) -> str:
"""
Related: Shortest Common Supersequence.
String ngắn nhất chứa cả s1 và s2 như subsequence.
Length = len(s1) + len(s2) - LCS_length
"""
m, n = len(s1), len(s2)
# Build LCS DP table
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Build SCS by traceback
scs = []
i, j = m, n
while i > 0 and j > 0:
if s1[i-1] == s2[j-1]:
scs.append(s1[i-1])
i -= 1
j -= 1
elif dp[i-1][j] > dp[i][j-1]:
scs.append(s1[i-1])
i -= 1
else:
scs.append(s2[j-1])
j -= 1
# Add remaining
while i > 0:
scs.append(s1[i-1])
i -= 1
while j > 0:
scs.append(s2[j-1])
j -= 1
return ''.join(reversed(scs))
# ============================================
# USAGE EXAMPLE
# ============================================
if __name__ == "__main__":
# Basic LCS
s1 = "AGGTAB"
s2 = "GXTXAYB"
print(f"=== Longest Common Subsequence ===")
print(f"s1 = '{s1}'")
print(f"s2 = '{s2}'")
length, lcs = lcs_string(s1, s2)
print(f"LCS = '{lcs}' (length {length})")
# Git Diff simulation
print(f"\n=== Git Diff Simulation ===")
old_file = [
"def hello():",
" print('Hello')",
" return True",
]
new_file = [
"def hello():",
" print('Hello, World!')",
" log('called')",
" return True",
]
diff = compute_diff(old_file, new_file)
print(format_diff(diff))
# Edit distance
print(f"\n=== Edit Distance ===")
word1, word2 = "kitten", "sitting"
dist = edit_distance(word1, word2)
print(f"'{word1}' → '{word2}': {dist} operations")
# Longest Common Substring
print(f"\n=== Longest Common Substring ===")
s1, s2 = "OldSite:GeeksforGeeks.org", "NewSite:GeeksQuiz.com"
length, substr = longest_common_substring(s1, s2)
print(f"Substring: '{substr}' (length {length})")Complexity Analysis
| Approach | Time | Space | Use Case |
|---|---|---|---|
| Naive Recursion | Chỉ để học | ||
| Memoization | General | ||
| Tabulation | Cần traceback | ||
| Space-Optimized | Chỉ cần length |
LCS Family Problems
Interview Cheat Sheet
| Problem | Clue | Approach |
|---|---|---|
| LCS | "Common", "Both sequences" | 2D DP, traceback |
| Substring | "Contiguous", "Substring" | 2D DP, reset on mismatch |
| Edit Distance | "Minimum operations", "Transform" | 3 choices: insert, delete, replace |
| SCS | "Supersequence", "Contains both" | LCS + include non-matching |
💡 HPN's Rule
"Thấy 2 strings + 'longest/shortest common' → LCS với 2D DP. Traceback để lấy actual string."