Performance Engineering

"Đo lường, đừng đoán. Profile, đừng giả định."

1. Benchmarking với Criterion

Tại sao Criterion?

          Criterion vs Built-in Benchmarks
┌────────────────────────────────────────────────────────┐
│                                                        │
│   Tính năng            cargo bench    Criterion        │
│   ────────             ───────────    ─────────        │
│                                                        │
│   Thống kê chặt chẽ    Cơ bản         Nâng cao         │
│   Warm-up runs         Không          Có               │
│   Phát hiện outlier    Không          Có               │
│   So sánh lịch sử      Không          Có               │
│   Báo cáo HTML         Không          Có               │
│   Stable Rust          Không (nightly) Có              │
│                                                        │
└────────────────────────────────────────────────────────┘

Thiết lập

toml

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "my_benchmark"
harness = false

Benchmark cơ bản

rust

// benches/my_benchmark.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| {
        b.iter(|| fibonacci(black_box(20)))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Hiểu về black_box

rust

use criterion::black_box;

// black_box ngăn compiler:
// 1. Tối ưu bỏ kết quả không dùng
// 2. Constant-folding inputs
// 3. Di chuyển code ra ngoài timing loop

// ❌ Không có black_box - có thể bị tối ưu bỏ
b.iter(|| fibonacci(20));

// ✅ Có black_box - buộc tính toán
b.iter(|| fibonacci(black_box(20)));

So sánh các triển khai

rust

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};

fn fibonacci_recursive(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        n => fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2),
    }
}

fn fibonacci_iterative(n: u64) -> u64 {
    let mut a = 0;
    let mut b = 1;
    for _ in 0..n {
        let temp = a;
        a = b;
        b = temp + b;
    }
    a
}

fn comparison_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Fibonacci");
    
    for n in [10, 20, 30].iter() {
        group.bench_with_input(
            BenchmarkId::new("Đệ quy", n),
            n,
            |b, &n| b.iter(|| fibonacci_recursive(black_box(n))),
        );
        
        group.bench_with_input(
            BenchmarkId::new("Vòng lặp", n),
            n,
            |b, &n| b.iter(|| fibonacci_iterative(black_box(n))),
        );
    }
    
    group.finish();
}

Throughput Benchmarks

rust

use criterion::{criterion_group, criterion_main, Criterion, Throughput};

fn process_data(data: &[u8]) -> u64 {
    data.iter().map(|&x| x as u64).sum()
}

fn throughput_benchmark(c: &mut Criterion) {
    let data: Vec<u8> = (0..1024).map(|i| i as u8).collect();
    
    let mut group = c.benchmark_group("Xử lý");
    group.throughput(Throughput::Bytes(data.len() as u64));
    
    group.bench_function("sum", |b| {
        b.iter(|| process_data(black_box(&data)))
    });
    
    group.finish();
}

Chạy Benchmarks

bash

# Chạy tất cả benchmarks
cargo bench

# Chạy benchmark cụ thể
cargo bench -- "fib"

# Tạo báo cáo HTML
# Mở trong browser: target/criterion/report/index.html
cargo bench

# So sánh với baseline
cargo bench -- --save-baseline main
# ... thực hiện thay đổi ...
cargo bench -- --baseline main

2. Profile-Guided Optimization (PGO)

PGO là gì?

                   Quy trình PGO
┌────────────────────────────────────────────────────────┐
│                                                        │
│   Bước 1: Build có Instrumentation                     │
│   ────────────────────────────────                     │
│   • Biên dịch với profiling instrumentation            │
│   • Binary ghi lại dữ liệu thực thi                    │
│                                                        │
│   Bước 2: Thu thập Profile                             │
│   ────────────────────────                             │
│   • Chạy workload đại diện                             │
│   • Tạo ra files .profdata                             │
│                                                        │
│   Bước 3: Build đã tối ưu                              │
│   ─────────────────────                                │
│   • Biên dịch lại dùng profile data                    │
│   • Compiler đưa ra quyết định có thông tin:           │
│     - Function inlining                                │
│     - Branch prediction hints                          │
│     - Tối ưu code layout                               │
│                                                        │
└────────────────────────────────────────────────────────┘

Các bước PGO

bash

# Bước 1: Build với instrumentation
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" \
    cargo build --release

# Bước 2: Chạy workload đại diện
./target/release/my_app < typical_input.txt

# Bước 3: Merge profile data
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data

# Bước 4: Build với profile data
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" \
    cargo build --release

Tác động của PGO

              Ví dụ cải thiện hiệu năng
┌────────────────────────────────────────────────────────┐
│                                                        │
│   Workload                    Cải thiện                │
│   ────────                    ─────────                │
│                                                        │
│   Compiler (rustc)            5-15%                    │
│   Web servers                 10-20%                   │
│   Database queries            15-25%                   │
│   Parsers                     10-30%                   │
│   Crypto operations           5-10%                    │
│                                                        │
│   Tốt nhất cho: Code có hot paths dự đoán được         │
│   Ít hiệu quả: Code polymorphic cao                    │
│                                                        │
└────────────────────────────────────────────────────────┘

3. Tối ưu cấp thấp

SIMD với std::simd (Nightly)

rust

#![feature(portable_simd)]
use std::simd::f32x4;

fn dot_product_simd(a: &[f32; 4], b: &[f32; 4]) -> f32 {
    let va = f32x4::from_array(*a);
    let vb = f32x4::from_array(*b);
    (va * vb).reduce_sum()
}

// Hoặc dùng crate packed_simd2 stable

Data Layout thân thiện Cache

rust

// ❌ Array of Structs (AoS) - cache locality kém
struct Particle {
    position: [f32; 3],
    velocity: [f32; 3],
    mass: f32,
}
let particles: Vec<Particle> = vec![...];

// ✅ Struct of Arrays (SoA) - thân thiện cache
struct ParticleSystem {
    positions_x: Vec<f32>,
    positions_y: Vec<f32>,
    positions_z: Vec<f32>,
    velocities_x: Vec<f32>,
    velocities_y: Vec<f32>,
    velocities_z: Vec<f32>,
    masses: Vec<f32>,
}

// Cập nhật tất cả X positions (truy cập memory tuần tự)
fn update_x(sys: &mut ParticleSystem, dt: f32) {
    for i in 0..sys.positions_x.len() {
        sys.positions_x[i] += sys.velocities_x[i] * dt;
    }
}

Tránh Allocation

rust

// ❌ Allocate mỗi lần gọi
fn process(data: &[u8]) -> Vec<u8> {
    data.iter().map(|&x| x * 2).collect()
}

// ✅ Tái sử dụng buffer
fn process_into(data: &[u8], buffer: &mut Vec<u8>) {
    buffer.clear();
    buffer.extend(data.iter().map(|&x| x * 2));
}

// ✅ Hoặc dùng stack array cho kích thước nhỏ
fn process_small(data: &[u8; 32]) -> [u8; 32] {
    let mut result = [0u8; 32];
    for i in 0..32 {
        result[i] = data[i] * 2;
    }
    result
}

4. Profiling Tools

CPU Profiling với perf

bash

# Linux: CPU profiling
perf record -g ./target/release/my_app
perf report

# Tạo flamegraph
cargo install flamegraph
cargo flamegraph

Memory Profiling với DHAT

rust

// Dùng dhat crate cho heap profiling
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();
    
    // Code của bạn ở đây
}

Cachegrind cho phân tích Cache

bash

valgrind --tool=cachegrind ./target/release/my_app
cg_annotate cachegrind.out.<pid>

5. Compiler Optimization Flags

Tinh chỉnh Release Profile

toml

# Cargo.toml

[profile.release]
opt-level = 3           # Tối ưu tối đa
lto = "fat"             # Full LTO (build chậm hơn, binary nhanh hơn)
codegen-units = 1       # Single codegen unit
panic = "abort"         # Không unwinding
target-cpu = "native"   # Tối ưu cho CPU hiện tại

[profile.release-fast]  # Custom profile
inherits = "release"
debug = true            # Bao gồm debug info cho profiling

[profile.release-small]
inherits = "release"
opt-level = "z"
lto = true
strip = true

Flags đặc thù Target

bash

# Bật các features đặc thù CPU
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Bật SIMD cụ thể
RUSTFLAGS="-C target-feature=+avx2" cargo build --release

# Xem những gì có sẵn
rustc --print target-features

🎯 Best Practices

Checklist tối ưu

Bước	Công cụ	Khi nào
1. Profile trước	`cargo flamegraph`	Luôn luôn
2. Thuật toán	Xem xét độ phức tạp	Tối ưu đầu tiên
3. Data layout	Phân tích cache	Code memory-bound
4. Release mode	`--release`	Luôn cho perf testing
5. LTO	`lto = true`	Build cuối cùng
6. PGO	Profile-guided	Deploy production

Lỗi thường gặp

rust

// ❌ Tối ưu sớm
fn process(v: Vec<i32>) -> i32 {
    // Đừng dùng unsafe cho 1% cải thiện
    unsafe { ... }
}

// ❌ Micro-benchmarking không có context
// Hàm nhanh hơn 10x nhưng chỉ gọi một lần thì không quan trọng

// ❌ Bỏ qua độ phức tạp thuật toán
// Tối ưu O(n²) với SIMD vẫn là O(n²)

// ✅ Profile → Xác định hotspot → Tối ưu → Đo lường

Performance Engineering ​

1. Benchmarking với Criterion ​

Tại sao Criterion? ​

Thiết lập ​

Benchmark cơ bản ​

Hiểu về black_box ​

So sánh các triển khai ​

Throughput Benchmarks ​

Chạy Benchmarks ​

2. Profile-Guided Optimization (PGO) ​

PGO là gì? ​

Các bước PGO ​

Tác động của PGO ​

3. Tối ưu cấp thấp ​

SIMD với std::simd (Nightly) ​

Data Layout thân thiện Cache ​

Tránh Allocation ​

4. Profiling Tools ​

CPU Profiling với perf ​

Memory Profiling với DHAT ​

Cachegrind cho phân tích Cache ​

5. Compiler Optimization Flags ​

Tinh chỉnh Release Profile ​

Flags đặc thù Target ​

🎯 Best Practices ​

Checklist tối ưu ​

Lỗi thường gặp ​

Performance Engineering

1. Benchmarking với Criterion

Tại sao Criterion?

Thiết lập

Benchmark cơ bản

Hiểu về black_box

So sánh các triển khai

Throughput Benchmarks

Chạy Benchmarks

2. Profile-Guided Optimization (PGO)

PGO là gì?

Các bước PGO

Tác động của PGO

3. Tối ưu cấp thấp

SIMD với std::simd (Nightly)

Data Layout thân thiện Cache

Tránh Allocation

4. Profiling Tools

CPU Profiling với perf

Memory Profiling với DHAT

Cachegrind cho phân tích Cache

5. Compiler Optimization Flags

Tinh chỉnh Release Profile

Flags đặc thù Target

🎯 Best Practices

Checklist tối ưu

Lỗi thường gặp