Giao diện
🔬 Profiling Performance
"Premature optimization is the root of all evil." — Donald Knuth
Measure first với Valgrind, Linux Perf, và FlameGraphs. Chỉ optimize sau khi có data.
Measure first với Valgrind, Linux Perf, và FlameGraphs. Chỉ optimize sau khi có data.
Memory vs CPU Profiling
┌─────────────────────────────────────────────────────────────────────────┐
│ PROFILING TYPES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MEMORY PROFILING │
│ ───────────────── │
│ Question: Code leak memory? Allocate quá nhiều? Where? │
│ Tools: │
│ • Valgrind Memcheck → Memory errors, leaks │
│ • Valgrind Massif → Memory usage over time │
│ • ASan leak detector → Fast leak detection │
│ │
│ CPU PROFILING │
│ ────────────── │
│ Question: Code chậm ở đâu? Function nào tốn CPU nhất? │
│ Tools: │
│ • Linux Perf → Kernel-level sampling (fastest) │
│ • Valgrind Callgrind → Call graph (slow but accurate) │
│ • gprof → Legacy, đừng xài │
│ │
│ HPN POLICY: │
│ • Perf cho production profiling (no slowdown) │
│ • Valgrind cho development (20x slowdown acceptable) │
│ │
└─────────────────────────────────────────────────────────────────────────┘Valgrind Memcheck — Memory Leak Detection
Basic Usage
bash
# Compile with debug symbols
g++ -g -o myapp myapp.cpp
# Run with Valgrind
valgrind --leak-check=full \
--show-leak-kinds=all \
--track-origins=yes \
./myappExample: Finding Memory Leak
cpp
// leak.cpp
#include <cstring>
void LeakyFunction() {
int* data = new int[100]; // Allocated
// ... do work ...
// Forgot to delete[]!
}
int main() {
for (int i = 0; i < 10; ++i) {
LeakyFunction();
}
return 0;
}Valgrind Output
==12345== HEAP SUMMARY:
==12345== in use at exit: 4,000 bytes in 10 blocks
==12345== total heap usage: 10 allocs, 0 frees, 4,000 bytes allocated
==12345==
==12345== 4,000 bytes in 10 blocks are definitely lost in loss record 1 of 1
==12345== at 0x4C2E80F: operator new[](unsigned long)
==12345== by 0x401156: LeakyFunction() (leak.cpp:5)
==12345== by 0x401180: main (leak.cpp:11)
==12345==
==12345== LEAK SUMMARY:
==12345== definitely lost: 4,000 bytes in 10 blocks
==12345== indirectly lost: 0 bytes in 0 blocks
==12345== possibly lost: 0 bytes in 0 blocks
==12345== still reachable: 0 bytes in 0 blocks
==12345== suppressed: 0 bytes in 0 blocksValgrind Massif — Memory Usage Over Time
bash
# Profile memory usage
valgrind --tool=massif ./myapp
# View results
ms_print massif.out.12345Output (ASCII Graph)
MB
12.50^ #
| @#
| @@#
| @@@#
| @@@@#
| @@@@@#
| @@@@@@#
| @@@@@@@#:
| @@@@@@@@#:
| @@@@@@@@@#::
| @@@@@@@@@@#:::
| @@@@@@@@@@@#:::@
| @@@@@@@@@@@@#:::@:
| @@@@@@@@@@@@@#:::@::
| ::::::::::::::::::::::@@@@@@@@@@@@@@#:::@::
0 +--------------------------------------------------------------> timeLinux Perf — The Ultimate CPU Profiler (@[/perf-profile])
Installation
bash
# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-generic
# Verify
perf --versionBasic Profiling
bash
# Record performance data (samples at 99Hz by default)
perf record -g ./myapp
# View report
perf reportPerf Report Interface
Samples: 5K of event 'cycles', Event count: 8234567890
Overhead Command Shared Object Symbol
45.32% myapp myapp [.] ProcessPackets
23.10% myapp libc.so.6 [.] malloc
12.45% myapp libstdc++.so.6 [.] std::string::operator+
8.76% myapp myapp [.] MatchRules
5.21% myapp libc.so.6 [.] memcpy
...Finding Hotspots
bash
# Top functions by CPU
perf top -p $(pidof myapp)
# Record with call graph
perf record -g --call-graph dwarf ./myapp
# Flamegraph preparation
perf script > perf.scriptFlameGraphs — Visual CPU Profiling
What is a FlameGraph?
┌─────────────────────────────────────────────────────────────────────────┐
│ FLAMEGRAPH ANATOMY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ main() │
│ │●●●●●●●●●●●●●●●●●│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│○○○○○○○○○○○○○○○○│ │
│ │ ProcessData │ NetworkIO │ Logging │ │
│ │●●●●●●│●●●●●●●●●│▓▓▓▓▓▓▓│▓▓▓▓▓▓▓▓▓▓▓│○○○○○│○○○○○○○○○│ │
│ │Parse │ Transform│Connect│ Read │Format│ Write │ │
│ │
│ HOW TO READ: │
│ • Y-axis: Call stack depth (bottom = entry point) │
│ • X-axis: Time spent in function (wider = more CPU) │
│ • Color: Random (no meaning, just visibility) │
│ │
│ WHAT TO LOOK FOR: │
│ • Wide plateaus = CPU hotspots │
│ • Deep stacks = Many function calls │
│ • Unexpected functions = Suspicious code │
│ │
└─────────────────────────────────────────────────────────────────────────┘Generating FlameGraphs
bash
# Clone FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
# Record with perf
perf record -g --call-graph dwarf ./myapp
# Generate FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl > flamegraph.svg
# Open in browser
firefox flamegraph.svgInteractive Example
bash
# Profile HPN Tunnel for 10 seconds
sudo perf record -g -p $(pidof hpn-tunnel) sleep 10
# Generate flamegraph
sudo perf script | \
./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl \
--title "HPN Tunnel CPU Profile" \
--subtitle "10 second sample" \
> hpn-tunnel-flame.svgCase Study: HPN Tunnel Optimization
Before Optimization (FlameGraph shows)
ProcessPackets ────────────────────────────────────────────── 100%
├── std::string::operator+ ─────────────────────── 35% 🔥
├── Packet::GetPayload (memcpy) ────────────────── 25% 🔥
├── MatchRules (strcmp loop) ───────────────────── 30% 🔥
└── ApplyRule ──────────────────────────────────── 10%Optimizations Applied
| Hotspot | Problem | Fix |
|---|---|---|
string::operator+ | malloc() per packet | Pre-allocated buffer |
GetPayload() | Returns by value | Return by const ref |
MatchRules() | O(n) linear search | O(1) hash lookup |
After Optimization (FlameGraph shows)
ProcessPackets ─────────────────────────────────────── 100%
├── ApplyRule ────────────────────────────────── 60%
├── RuleCache::Find ──────────────────────────── 25%
└── Logger::LogAsync ─────────────────────────── 15%🚀 RESULT
- Before: 10K packets/s
- After: 80K packets/s
- Improvement: 8x throughput
Best Practices
┌─────────────────────────────────────────────────────────────────────────┐
│ PROFILING BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ✅ DO │
│ ───── │
│ • Profile with realistic workload (not toy examples) │
│ • Compare before/after with same workload │
│ • Profile release build (-O2 or -O3) │
│ • Use FlameGraphs for visualization │
│ • Focus on the WIDEST bars first │
│ │
│ ❌ DON'T │
│ ─────── │
│ • Don't profile debug builds (misleading data) │
│ • Don't optimize without profiling first │
│ • Don't ignore small samples (may be statistical noise) │
│ • Don't over-optimize cold paths │
│ │
│ 📊 TARGETS │
│ ────────── │
│ • API latency: p99 < 200ms │
│ • Hot path: < 50µs per operation │
│ • Memory: Zero malloc in tight loops │
│ │
└─────────────────────────────────────────────────────────────────────────┘