Giao diện
Optimize Backend Performance
1. Objective
Making the system faster by doing less work.
The primary objective of this workflow is to methodically reduce the resource consumption (CPU, Memory, I/O) or execution time of a specific backend component. "Performance" is often vague; this workflow demands specificity—are we optimizing for Latency (response time) or Throughput (requests per second)? The goal is to move from "feeling slow" to "measured improvement" using data-driven profiling rather than guessing. Success is defined by a statistically significant improvement in the target metric without compromising correctness or maintainability.
2. Context & Scope
In Scope
This workflow covers application-level and database-level optimization for backend services. This includes:
- Code Profiling: Using tools like
pprof(Go) orcProfile/py-spy(Python) to find CPU hot spots. - Query Tuning: analyzing SQL execution plans and adding indexes.
- Concurrency: Tuning thread pools, worker counts, or converting sync I/O to async.
- Caching: Implementing application-side caching (Redis/Memcached).
Assumption: You have a baseline metric. You cannot optimize what you cannot measure. Assumption: The code is functional and has passing tests.
Out of Scope
This workflow does not cover:
- Infrastructure Scaling: Simply adding more servers (Vertical/Horizontal scaling) is a capacity decision, not code optimization (see Capacity Planning).
- Frontend Performance: Optimizing React bundles or CSS (see Frontend Performance).
- Network Architecture: Reducing geographical latency via CDNs.
3. When to Use / When Not to Use
✅ Use This Workflow When
- An endpoint violates its Service Level Objective (SLO) (e.g., p99 > 500ms).
- A background job takes longer than its scheduling interval (the "overlapping job" problem).
- Cloud costs are exploding due to excessive CPU/Memory usage.
- You are preparing for a known high-traffic even (Black Friday).
❌ Do NOT Use This Workflow When
- "Premature Optimization": You are guessing where the bottleneck is before writing the code.
- The system is currently unstable/crashing (fix the Stability first).
- The gain is negligible (e.g., saving 1ms on a 5-minute job).
4. Inputs (Required/Optional)
Required Inputs
| Input | Description | Format | Example |
|---|---|---|---|
| TARGET_METRIC | The specific number to improve. | String | "POST /search p95 latency" |
| LOAD_SCRIPT | Script to generate valid traffic. | File Path | k6/load.js, locustfile.py |
Optional Inputs
| Input | Description | Default | Condition |
|---|---|---|---|
| APM_DASHBOARD | Link to Datadog/NewRelic/Jaeger. | URL | If observability is configured. |
| PROFILER_OUTPUT | Existing profiles. | .pprof | If capture already exists. |
5. Outputs (Artifacts)
| Artifact | Format | Destination | Quality Criteria |
|---|---|---|---|
| Baseline Report | Text/Markdown | docs/perf/baseline.md | Clear "Before" numbers (RPS/Latency). |
| Root Cause Analysis | Text | Incident Ticket | Identifies strictly why it was slow. |
| Optimization PR | Git Diff | Repo | Includes code changes and benchmark tests. |
| Verification Report | Text/Markdown | docs/perf/result.md | Clear "After" numbers showing improvement. |
6. Operating Modes
⚡ Fast Mode
Timebox: 1 Hour Scope: Low Hanging Fruit Details: Check for N+1 query issues. Check for missing database indexes. Enable basic caching.
🎯 Standard Mode (Default)
Timebox: 4 Hours Scope: Code/Algo Optimization Details: capture CPU/Memory profiles. Identify hot functions. Refactor algorithms (O(n^2) -> O(n)). optimize allocations.
🔬 Deep Mode
Timebox: 2-3 Days Scope: Architecture/Kernel Details: Rewrite critical paths in a faster language (Rust/Go extensions). Tune Garbage Collector settings (GOGC). Offload processing to async queues.
7. Constraints & Guardrails
Technical Constraints
- Correctness First: An optimization that produces wrong results is a bug, not a feature. The test suite MUST pass.
- Readability: Do not write obfuscated bit-twiddling code unless absolutely necessary. Comment "Why" complex logic exists.
- Reproducibility: Benchmarks must be run on isolated hardware (or consistently noisy neighbors) to be valid.
Security & Privacy
CAUTION
Mandatory Security Considerations
- Cache Poisoning: When implementing caching, ensure the cache key includes all security contexts (e.g., UserID). Do not serve User A's data to User B because it was cached globally.
- Side Channels: Be aware that timing differences can sometimes leak information (Timing Attacks), though this is rare for general backend logic.
8. Procedure
Phase 1: Establish Baseline (Preparation)
Objective: Measure the current performance to ensure we can prove improvement later.
Do not touch the code yet. First, run the LOAD_SCRIPT against a staging environment that mirrors production (or use a local isolation chamber). Record the throughput (RPS) and Latency (p50, p95, p99). This is your Baseline.
Next, capture a profile while the load script is running.
- Go:
curl -o cpu.pprof http://localhost:6060/debug/pprof/profile?seconds=30 - Python:
py-spy record -o profile.svg --pid [[PID]] - Database: Runs
EXPLAIN ANALYZEon the suspect queries.
Verify: You have a hard number (e.g., "140ms") and a visualization (Flamegraph) showing where the time is spent.
Phase 2: Diagnosis & Tuning (Execution)
Objective: Identify the bottleneck and apply the fix.
Analyze the Flamegraph. Look for the widest bars (CPU time) or the deepest stacks. Identify if the application is CPU Bound (spending time in calculation/serialization) or I/O Bound (waiting for DB/Network).
For I/O Bound:
- Check for "N+1 Queries". If you see many small calls to the DB, batch them.
- Check for missing indexes. If the database does a "Seq Scan", add an index.
- Introduce Caching. If the data is read often/written rarely, wrap the call in a Redis
Get/Setblock.
For CPU Bound:
- Inspect the hot function. Are you allocating too much memory in a loop?
- Are you using an inefficient algorithm? (Nested loops).
- Can you parallelize the work? (Worker pools).
Apply one change at a time. Do not apply 5 fixes at once; you won't know which one worked.
Verify: Run a quick micro-benchmark (Go
benchmark/ Pythontimeit) to confirm the specific function is faster.
Phase 3: Verification (Finalization)
Objective: Prove system-level improvement and regression check.
Re-run the LOAD_SCRIPT under the exact same conditions as Phase 1. Compare the new RPS and Latency numbers against the Baseline. Calculate the % improvement.
Run the full regression test suite. Specialized optimizations often break edge cases (e.g., caching might return stale data). Ensure data consistency is maintained. If you added a cache, verify the invalidation strategy works (e.g., updating the resource clears the cache).
Verify: "After" metrics > "Before" metrics. Tests pass. Code is committed with a clear description of the perf gain.
9. Technical Considerations
The Law of Diminishing Returns: optimization follows a power law. 80% of the gain comes from 20% of the code. Once you fix the main bottleneck, the next bottleneck will be much smaller. Know when to stop. Readable code is often better than slightly faster code.
Allocations (Memory/GC): In garbage-collected languages (Go/Python/Java), high memory allocation creates CPU pressure because the GC has to clean it up. Profiling "Alloc Objects" is often more fruitful than profiling "CPU Time" directly. Reusing buffers (sync.Pool) or effectively sizing slices can dramatically reduce latency jitter caused by GC pauses.
Distributed Systems & Fan-out: If your service calls 5 downstream services, your latency is defined by the slowest dependency (tail latency). Optimization here often involves parallel requests, timeouts, and "hedged requests" (sending 2, taking the first), rather than optimizing your own CPU loop.
10. Quality Gates (Definition of Done)
Checklist
- [ ] Performance measurement shows >10% improvement (or target met).
- [ ] No regression in functional tests.
- [ ] Optimization is documented (why did we do this weird thing?).
- [ ] Build does not fail.
Validation
| Criterion | Method | Threshold |
|---|---|---|
| Latency | Load Test (k6) | < Target (e.g., 200ms) |
| Throughput | Load Test | > Target (e.g., 1000 RPS) |
| Correctness | Unit Tests | 100% Pass |
11. Failure Modes & Recovery
| Failure Mode | Symptoms | Recovery Action |
|---|---|---|
| Optimization broke logic | Faster results, but wrong data. | Revert immediately. Add a regression test case covering the specific data scenario. |
| Memory Leak | Optimization involved caching, now RAM is full. | Limit cache size (LRU/TTL). Ensure cache keys are bounded. |
| Thundering Herd | Cache expires, DB falls over. | Implement "Request Coalescing" (singleflight) or "Probabilistic Early Expiration". |
12. Copy-Paste Prompt
TIP
One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.
text
## Task: Optimize Backend Performance
### Context
I need to improve the performance of a specific component.
- **Metric**: [Target: TARGET_METRIC]
- **Role**: Backend Engineer
### Inputs
- **Base Code**: [Repo Path]
- **Load Script**: [Provide: LOAD_SCRIPT]
### Constraints
- Must include "Before" and "After" benchmarks.
- No functional regressions.
- Use profiling data, don't guess.
### Instructions
Execute the following procedure:
**Phase 1: Baseline**
Run the load script. Capture p50, p95, p99. Capture a CPU profile (flamegraph).
**Phase 2: Diagnosis & Tuning**
Analyze profile. Identify the bottleneck (CPU vs IO). Apply targeted fix (Index, Cache, or Refactor).
**Phase 3: Verification**
Re-run load script. Validate improvement. Run regression tests.
### Quality Gates
- [ ] Improvement Verified.
- [ ] Tests Pass.
### Output Format
- Benchmark Report (Before/After).
- Optimization Patch.