Skip to content

🔬 Profiling Performance

"Premature optimization is the root of all evil." — Donald Knuth

Measure first với Valgrind, Linux Perf, và FlameGraphs. Chỉ optimize sau khi có data.

Memory vs CPU Profiling

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROFILING TYPES                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   MEMORY PROFILING                                                      │
│   ─────────────────                                                     │
│   Question: Code leak memory? Allocate quá nhiều? Where?                │
│   Tools:                                                                │
│   • Valgrind Memcheck  → Memory errors, leaks                           │
│   • Valgrind Massif    → Memory usage over time                         │
│   • ASan leak detector → Fast leak detection                            │
│                                                                         │
│   CPU PROFILING                                                         │
│   ──────────────                                                        │
│   Question: Code chậm ở đâu? Function nào tốn CPU nhất?                 │
│   Tools:                                                                │
│   • Linux Perf         → Kernel-level sampling (fastest)                │
│   • Valgrind Callgrind → Call graph (slow but accurate)                 │
│   • gprof              → Legacy, đừng xài                               │
│                                                                         │
│   HPN POLICY:                                                           │
│   • Perf cho production profiling (no slowdown)                         │
│   • Valgrind cho development (20x slowdown acceptable)                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Valgrind Memcheck — Memory Leak Detection

Basic Usage

bash
# Compile with debug symbols
g++ -g -o myapp myapp.cpp

# Run with Valgrind
valgrind --leak-check=full \
         --show-leak-kinds=all \
         --track-origins=yes \
         ./myapp

Example: Finding Memory Leak

cpp
// leak.cpp
#include <cstring>

void LeakyFunction() {
    int* data = new int[100];  // Allocated
    // ... do work ...
    // Forgot to delete[]!
}

int main() {
    for (int i = 0; i < 10; ++i) {
        LeakyFunction();
    }
    return 0;
}

Valgrind Output

==12345== HEAP SUMMARY:
==12345==     in use at exit: 4,000 bytes in 10 blocks
==12345==   total heap usage: 10 allocs, 0 frees, 4,000 bytes allocated
==12345==
==12345== 4,000 bytes in 10 blocks are definitely lost in loss record 1 of 1
==12345==    at 0x4C2E80F: operator new[](unsigned long)
==12345==    by 0x401156: LeakyFunction() (leak.cpp:5)
==12345==    by 0x401180: main (leak.cpp:11)
==12345==
==12345== LEAK SUMMARY:
==12345==    definitely lost: 4,000 bytes in 10 blocks
==12345==    indirectly lost: 0 bytes in 0 blocks
==12345==      possibly lost: 0 bytes in 0 blocks
==12345==    still reachable: 0 bytes in 0 blocks
==12345==         suppressed: 0 bytes in 0 blocks

Valgrind Massif — Memory Usage Over Time

bash
# Profile memory usage
valgrind --tool=massif ./myapp

# View results
ms_print massif.out.12345

Output (ASCII Graph)

    MB
12.50^                                                    #
     |                                                   @#
     |                                                  @@#
     |                                                 @@@#
     |                                                @@@@#
     |                                               @@@@@#
     |                                              @@@@@@#
     |                                             @@@@@@@#:
     |                                            @@@@@@@@#:
     |                                           @@@@@@@@@#::
     |                                          @@@@@@@@@@#:::
     |                                         @@@@@@@@@@@#:::@
     |                                        @@@@@@@@@@@@#:::@:
     |                                       @@@@@@@@@@@@@#:::@::
     |                 ::::::::::::::::::::::@@@@@@@@@@@@@@#:::@::
   0 +--------------------------------------------------------------> time

Linux Perf — The Ultimate CPU Profiler (@[/perf-profile])

Installation

bash
# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-generic

# Verify
perf --version

Basic Profiling

bash
# Record performance data (samples at 99Hz by default)
perf record -g ./myapp

# View report
perf report

Perf Report Interface

Samples: 5K of event 'cycles', Event count: 8234567890
Overhead  Command  Shared Object      Symbol
  45.32%  myapp    myapp              [.] ProcessPackets
  23.10%  myapp    libc.so.6          [.] malloc
  12.45%  myapp    libstdc++.so.6     [.] std::string::operator+
   8.76%  myapp    myapp              [.] MatchRules
   5.21%  myapp    libc.so.6          [.] memcpy
   ...

Finding Hotspots

bash
# Top functions by CPU
perf top -p $(pidof myapp)

# Record with call graph
perf record -g --call-graph dwarf ./myapp

# Flamegraph preparation
perf script > perf.script

FlameGraphs — Visual CPU Profiling

What is a FlameGraph?

┌─────────────────────────────────────────────────────────────────────────┐
│                    FLAMEGRAPH ANATOMY                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  main()   │
│   │●●●●●●●●●●●●●●●●●│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│○○○○○○○○○○○○○○○○│            │
│   │  ProcessData    │    NetworkIO       │   Logging     │            │
│   │●●●●●●│●●●●●●●●●│▓▓▓▓▓▓▓│▓▓▓▓▓▓▓▓▓▓▓│○○○○○│○○○○○○○○○│            │
│   │Parse │ Transform│Connect│   Read    │Format│  Write  │            │
│                                                                         │
│   HOW TO READ:                                                          │
│   • Y-axis: Call stack depth (bottom = entry point)                     │
│   • X-axis: Time spent in function (wider = more CPU)                   │
│   • Color: Random (no meaning, just visibility)                         │
│                                                                         │
│   WHAT TO LOOK FOR:                                                     │
│   • Wide plateaus = CPU hotspots                                        │
│   • Deep stacks = Many function calls                                   │
│   • Unexpected functions = Suspicious code                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Generating FlameGraphs

bash
# Clone FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git

# Record with perf
perf record -g --call-graph dwarf ./myapp

# Generate FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl | \
              ./FlameGraph/flamegraph.pl > flamegraph.svg

# Open in browser
firefox flamegraph.svg

Interactive Example

bash
# Profile HPN Tunnel for 10 seconds
sudo perf record -g -p $(pidof hpn-tunnel) sleep 10

# Generate flamegraph
sudo perf script | \
    ./FlameGraph/stackcollapse-perf.pl | \
    ./FlameGraph/flamegraph.pl \
    --title "HPN Tunnel CPU Profile" \
    --subtitle "10 second sample" \
    > hpn-tunnel-flame.svg

Case Study: HPN Tunnel Optimization

Before Optimization (FlameGraph shows)

ProcessPackets ────────────────────────────────────────────── 100%
├── std::string::operator+ ─────────────────────── 35% 🔥
├── Packet::GetPayload (memcpy) ────────────────── 25% 🔥
├── MatchRules (strcmp loop) ───────────────────── 30% 🔥
└── ApplyRule ──────────────────────────────────── 10%

Optimizations Applied

HotspotProblemFix
string::operator+malloc() per packetPre-allocated buffer
GetPayload()Returns by valueReturn by const ref
MatchRules()O(n) linear searchO(1) hash lookup

After Optimization (FlameGraph shows)

ProcessPackets ─────────────────────────────────────── 100%
├── ApplyRule ────────────────────────────────── 60%
├── RuleCache::Find ──────────────────────────── 25%
└── Logger::LogAsync ─────────────────────────── 15%

🚀 RESULT

  • Before: 10K packets/s
  • After: 80K packets/s
  • Improvement: 8x throughput

Best Practices

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROFILING BEST PRACTICES                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ✅ DO                                                                  │
│   ─────                                                                 │
│   • Profile with realistic workload (not toy examples)                  │
│   • Compare before/after with same workload                             │
│   • Profile release build (-O2 or -O3)                                  │
│   • Use FlameGraphs for visualization                                   │
│   • Focus on the WIDEST bars first                                      │
│                                                                         │
│   ❌ DON'T                                                               │
│   ───────                                                               │
│   • Don't profile debug builds (misleading data)                        │
│   • Don't optimize without profiling first                              │
│   • Don't ignore small samples (may be statistical noise)               │
│   • Don't over-optimize cold paths                                      │
│                                                                         │
│   📊 TARGETS                                                             │
│   ──────────                                                            │
│   • API latency: p99 < 200ms                                            │
│   • Hot path: < 50µs per operation                                      │
│   • Memory: Zero malloc in tight loops                                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Bước tiếp theo

🔍 GDB → — Core Dumps, Interactive Debugging, TUI Mode