Skip to content

Longest Common Subsequence (LCS) - The Diff Engine

"Mỗi lần bạn dùng git diff, bạn đang chạy LCS." - HPN

Problem Statement

Cho 2 strings, tìm subsequence chung dài nhất.

Subsequence = Một phần của string, giữ nguyên thứ tự, không cần liên tiếp.

String 1: "ABCDGH"
String 2: "AEDFHR"

LCS = "ADH" (length 3)

    A B C D G H
    ↓     ↓   ↓
    A E D F H R

📘 Subsequence vs Substring

  • Substring: Liên tiếp → "BCD" trong "ABCDE"
  • Subsequence: Không cần liên tiếp → "ACE" trong "ABCDE" ✅

Real-World Applications

DomainỨng dụngChi tiết
Version Controlgit diffSo sánh file changes, hiển thị +/- lines
BioinformaticsDNA SequencingSo sánh genome sequences, tìm evolutionary relationship
Spell CheckAuto-correctGợi ý từ gần đúng nhất
Plagiarism DetectionTurnitinTìm đoạn văn giống nhau giữa 2 documents
File Comparisondiff commandMerge conflicts, patch files

Phân Tích DP

Nhận diện DP patterns

ChecklistCó?Giải thích
Overlapping Subproblemslcs(i,j) được gọi nhiều lần
Optimal SubstructureLCS của prefixes → LCS của cả string
"Longest/Shortest" signal"Longest common..."

State Definition

dp[i][j] = Length of LCS between:
           - s1[0..i-1] (i ký tự đầu của s1)
           - s2[0..j-1] (j ký tự đầu của s2)

Transition (Công thức chuyển trạng thái)

python
if s1[i-1] == s2[j-1]:
    # Ký tự cuối giống nhau → Thêm vào LCS
    dp[i][j] = dp[i-1][j-1] + 1
else:
    # Khác nhau → Bỏ 1 trong 2 ký tự, lấy max
    dp[i][j] = max(dp[i-1][j], dp[i][j-1])

2D DP Table Visualization

s1 = "AGGTAB"
s2 = "GXTXAYB"

         s2 →
         ""  G   X   T   X   A   Y   B
       ┌─────────────────────────────────┐
    "" │ 0   0   0   0   0   0   0   0   │
     A │ 0   0   0   0   0   1   1   1   │
  s1 G │ 0   1   1   1   1   1   1   1   │
   ↓ G │ 0   1   1   1   1   1   1   1   │
     T │ 0   1   1   2   2   2   2   2   │
     A │ 0   1   1   2   2   3   3   3   │
     B │ 0   1   1   2   2   3   3   4   │
       └─────────────────────────────────┘

                                LCS length = 4
                                   
Traceback: GTAB

💡 Đọc bảng như thế nào?

  • dp[i][j] = LCS length của s1[:i]s2[:j]
  • Nếu s1[i-1] == s2[j-1]: đi chéo ↖ (thêm vào LCS)
  • Nếu khác: đi theo hướng có giá trị lớn hơn (↑ hoặc ←)

Step-by-Step Implementation

Step 1: Recursive (Intuition) - O(2^N)

python
def lcs_naive(s1: str, s2: str, i: int, j: int) -> int:
    """
    ❌ BRUTE FORCE - Chỉ để hiểu logic.
    Time: O(2^(M+N))
    Space: O(M+N) call stack
    """
    # Base case: Một trong hai string rỗng
    if i == 0 or j == 0:
        return 0
    
    # Nếu ký tự cuối giống nhau → Đã tìm thấy 1 common char
    if s1[i-1] == s2[j-1]:
        return 1 + lcs_naive(s1, s2, i-1, j-1)
    
    # Khác nhau → Thử bỏ từng ký tự, lấy max
    return max(
        lcs_naive(s1, s2, i-1, j),   # Bỏ ký tự cuối của s1
        lcs_naive(s1, s2, i, j-1)    # Bỏ ký tự cuối của s2
    )

Step 2: Top-Down Memoization - O(M×N)

python
from functools import lru_cache

def lcs_memo(s1: str, s2: str) -> int:
    """
    ✅ MEMOIZATION - Cache subproblems.
    Time: O(M × N)
    Space: O(M × N)
    """
    @lru_cache(maxsize=None)
    def dp(i: int, j: int) -> int:
        if i == 0 or j == 0:
            return 0
        
        if s1[i-1] == s2[j-1]:
            return 1 + dp(i-1, j-1)
        
        return max(dp(i-1, j), dp(i, j-1))
    
    return dp(len(s1), len(s2))

Step 3: Bottom-Up Tabulation - O(M×N)

python
def lcs_tabulation(s1: str, s2: str) -> int:
    """
    ✅ CLASSIC DP - Build table iteratively.
    Time: O(M × N)
    Space: O(M × N)
    """
    m, n = len(s1), len(s2)
    
    # dp[i][j] = LCS length của s1[:i] và s2[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]

Step 4: Traceback - Get Actual LCS String

python
def lcs_with_string(s1: str, s2: str) -> tuple[int, str]:
    """
    ✅ PRODUCTION - Returns both length AND actual LCS string.
    Time: O(M × N)
    Space: O(M × N)
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Build DP table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    # Traceback to reconstruct LCS
    lcs = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if s1[i-1] == s2[j-1]:
            lcs.append(s1[i-1])
            i -= 1
            j -= 1
        elif dp[i-1][j] > dp[i][j-1]:
            i -= 1
        else:
            j -= 1
    
    lcs.reverse()
    return dp[m][n], ''.join(lcs)

Step 5: Space-Optimized - O(min(M,N))

python
def lcs_optimized(s1: str, s2: str) -> int:
    """
    ✅✅ PRODUCTION-READY - Space optimization.
    Time: O(M × N)
    Space: O(min(M, N))
    
    ⚠️ Không thể traceback với approach này!
    """
    # Đảm bảo s2 ngắn hơn để tối ưu space
    if len(s1) < len(s2):
        s1, s2 = s2, s1
    
    m, n = len(s1), len(s2)
    
    prev = [0] * (n + 1)
    curr = [0] * (n + 1)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                curr[j] = prev[j-1] + 1
            else:
                curr[j] = max(prev[j], curr[j-1])
        
        prev, curr = curr, prev
    
    return prev[n]

Production Code (Full Version)

python
# HPN Engineering Standard
# Implementation: LCS - All Approaches + Real-World Applications

from functools import lru_cache
from typing import List, Tuple
from enum import Enum


# ============================================
# CORE IMPLEMENTATIONS
# ============================================

def lcs_length(s1: str, s2: str) -> int:
    """
    ✅ O(M×N) time, O(min(M,N)) space.
    Returns: LCS length only (space-optimized).
    """
    if len(s1) < len(s2):
        s1, s2 = s2, s1
    
    n = len(s2)
    prev = [0] * (n + 1)
    curr = [0] * (n + 1)
    
    for i in range(len(s1)):
        for j in range(n):
            if s1[i] == s2[j]:
                curr[j+1] = prev[j] + 1
            else:
                curr[j+1] = max(prev[j+1], curr[j])
        prev, curr = curr, prev
    
    return prev[n]


def lcs_string(s1: str, s2: str) -> Tuple[int, str]:
    """
    ✅ Returns both LCS length and the actual LCS string.
    Requires full DP table for traceback.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    # Traceback
    lcs = []
    i, j = m, n
    while i > 0 and j > 0:
        if s1[i-1] == s2[j-1]:
            lcs.append(s1[i-1])
            i -= 1
            j -= 1
        elif dp[i-1][j] > dp[i][j-1]:
            i -= 1
        else:
            j -= 1
    
    return dp[m][n], ''.join(reversed(lcs))


# ============================================
# REAL-WORLD: GIT DIFF IMPLEMENTATION
# ============================================

class DiffOp(Enum):
    KEEP = " "
    ADD = "+"
    DELETE = "-"


def compute_diff(old_lines: List[str], new_lines: List[str]) -> List[Tuple[DiffOp, str]]:
    """
    Simplified git diff using LCS.
    
    Returns list of (operation, line) tuples.
    """
    m, n = len(old_lines), len(new_lines)
    
    # Build DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if old_lines[i-1] == new_lines[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    # Traceback để tạo diff
    result = []
    i, j = m, n
    
    while i > 0 or j > 0:
        if i > 0 and j > 0 and old_lines[i-1] == new_lines[j-1]:
            result.append((DiffOp.KEEP, old_lines[i-1]))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j-1] >= dp[i-1][j]):
            result.append((DiffOp.ADD, new_lines[j-1]))
            j -= 1
        else:
            result.append((DiffOp.DELETE, old_lines[i-1]))
            i -= 1
    
    result.reverse()
    return result


def format_diff(diff: List[Tuple[DiffOp, str]], context: int = 3) -> str:
    """
    Format diff output like git diff.
    """
    lines = []
    for op, line in diff:
        prefix = op.value
        lines.append(f"{prefix} {line}")
    return '\n'.join(lines)


# ============================================
# VARIANTS
# ============================================

def longest_common_substring(s1: str, s2: str) -> Tuple[int, str]:
    """
    Variant: Longest Common SUBSTRING (contiguous).
    
    Key difference: Reset to 0 when chars don't match.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    max_len = 0
    end_pos = 0
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
                if dp[i][j] > max_len:
                    max_len = dp[i][j]
                    end_pos = i
            # else: dp[i][j] = 0 (already initialized)
    
    substring = s1[end_pos - max_len : end_pos]
    return max_len, substring


def edit_distance(s1: str, s2: str) -> int:
    """
    Related: Levenshtein Distance (Minimum Edit Distance).
    
    Số operations tối thiểu (insert, delete, replace) để biến s1 → s2.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Base cases
    for i in range(m + 1):
        dp[i][0] = i  # Delete all chars from s1
    for j in range(n + 1):
        dp[0][j] = j  # Insert all chars to get s2
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]  # No operation needed
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],      # Delete from s1
                    dp[i][j-1],      # Insert to s1
                    dp[i-1][j-1]     # Replace
                )
    
    return dp[m][n]


def shortest_common_supersequence(s1: str, s2: str) -> str:
    """
    Related: Shortest Common Supersequence.
    
    String ngắn nhất chứa cả s1 và s2 như subsequence.
    Length = len(s1) + len(s2) - LCS_length
    """
    m, n = len(s1), len(s2)
    
    # Build LCS DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    # Build SCS by traceback
    scs = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if s1[i-1] == s2[j-1]:
            scs.append(s1[i-1])
            i -= 1
            j -= 1
        elif dp[i-1][j] > dp[i][j-1]:
            scs.append(s1[i-1])
            i -= 1
        else:
            scs.append(s2[j-1])
            j -= 1
    
    # Add remaining
    while i > 0:
        scs.append(s1[i-1])
        i -= 1
    while j > 0:
        scs.append(s2[j-1])
        j -= 1
    
    return ''.join(reversed(scs))


# ============================================
# USAGE EXAMPLE
# ============================================

if __name__ == "__main__":
    # Basic LCS
    s1 = "AGGTAB"
    s2 = "GXTXAYB"
    
    print(f"=== Longest Common Subsequence ===")
    print(f"s1 = '{s1}'")
    print(f"s2 = '{s2}'")
    
    length, lcs = lcs_string(s1, s2)
    print(f"LCS = '{lcs}' (length {length})")
    
    # Git Diff simulation
    print(f"\n=== Git Diff Simulation ===")
    old_file = [
        "def hello():",
        "    print('Hello')",
        "    return True",
    ]
    new_file = [
        "def hello():",
        "    print('Hello, World!')",
        "    log('called')",
        "    return True",
    ]
    
    diff = compute_diff(old_file, new_file)
    print(format_diff(diff))
    
    # Edit distance
    print(f"\n=== Edit Distance ===")
    word1, word2 = "kitten", "sitting"
    dist = edit_distance(word1, word2)
    print(f"'{word1}' → '{word2}': {dist} operations")
    
    # Longest Common Substring
    print(f"\n=== Longest Common Substring ===")
    s1, s2 = "OldSite:GeeksforGeeks.org", "NewSite:GeeksQuiz.com"
    length, substr = longest_common_substring(s1, s2)
    print(f"Substring: '{substr}' (length {length})")

Complexity Analysis

ApproachTimeSpaceUse Case
Naive RecursionO(2M+N)O(M+N)Chỉ để học
MemoizationO(M×N)O(M×N)General
TabulationO(M×N)O(M×N)Cần traceback
Space-OptimizedO(M×N)O(min(M,N))Chỉ cần length

LCS Family Problems

Interview Cheat Sheet

ProblemClueApproach
LCS"Common", "Both sequences"2D DP, traceback
Substring"Contiguous", "Substring"2D DP, reset on mismatch
Edit Distance"Minimum operations", "Transform"3 choices: insert, delete, replace
SCS"Supersequence", "Contains both"LCS + include non-matching

💡 HPN's Rule

"Thấy 2 strings + 'longest/shortest common' → LCS với 2D DP. Traceback để lấy actual string."