Skip to content

Optimize Backend Performance


1. Objective

Making the system faster by doing less work.

The primary objective of this workflow is to methodically reduce the resource consumption (CPU, Memory, I/O) or execution time of a specific backend component. "Performance" is often vague; this workflow demands specificity—are we optimizing for Latency (response time) or Throughput (requests per second)? The goal is to move from "feeling slow" to "measured improvement" using data-driven profiling rather than guessing. Success is defined by a statistically significant improvement in the target metric without compromising correctness or maintainability.


2. Context & Scope

In Scope

This workflow covers application-level and database-level optimization for backend services. This includes:

  • Code Profiling: Using tools like pprof (Go) or cProfile/py-spy (Python) to find CPU hot spots.
  • Query Tuning: analyzing SQL execution plans and adding indexes.
  • Concurrency: Tuning thread pools, worker counts, or converting sync I/O to async.
  • Caching: Implementing application-side caching (Redis/Memcached).

Assumption: You have a baseline metric. You cannot optimize what you cannot measure. Assumption: The code is functional and has passing tests.

Out of Scope

This workflow does not cover:

  • Infrastructure Scaling: Simply adding more servers (Vertical/Horizontal scaling) is a capacity decision, not code optimization (see Capacity Planning).
  • Frontend Performance: Optimizing React bundles or CSS (see Frontend Performance).
  • Network Architecture: Reducing geographical latency via CDNs.

3. When to Use / When Not to Use

Use This Workflow When

  • An endpoint violates its Service Level Objective (SLO) (e.g., p99 > 500ms).
  • A background job takes longer than its scheduling interval (the "overlapping job" problem).
  • Cloud costs are exploding due to excessive CPU/Memory usage.
  • You are preparing for a known high-traffic even (Black Friday).

Do NOT Use This Workflow When

  • "Premature Optimization": You are guessing where the bottleneck is before writing the code.
  • The system is currently unstable/crashing (fix the Stability first).
  • The gain is negligible (e.g., saving 1ms on a 5-minute job).

4. Inputs (Required/Optional)

Required Inputs

InputDescriptionFormatExample
TARGET_METRICThe specific number to improve.String"POST /search p95 latency"
LOAD_SCRIPTScript to generate valid traffic.File Pathk6/load.js, locustfile.py

Optional Inputs

InputDescriptionDefaultCondition
APM_DASHBOARDLink to Datadog/NewRelic/Jaeger.URLIf observability is configured.
PROFILER_OUTPUTExisting profiles..pprofIf capture already exists.

5. Outputs (Artifacts)

ArtifactFormatDestinationQuality Criteria
Baseline ReportText/Markdowndocs/perf/baseline.mdClear "Before" numbers (RPS/Latency).
Root Cause AnalysisTextIncident TicketIdentifies strictly why it was slow.
Optimization PRGit DiffRepoIncludes code changes and benchmark tests.
Verification ReportText/Markdowndocs/perf/result.mdClear "After" numbers showing improvement.

6. Operating Modes

Fast Mode

Timebox: 1 Hour Scope: Low Hanging Fruit Details: Check for N+1 query issues. Check for missing database indexes. Enable basic caching.

🎯 Standard Mode (Default)

Timebox: 4 Hours Scope: Code/Algo Optimization Details: capture CPU/Memory profiles. Identify hot functions. Refactor algorithms (O(n^2) -> O(n)). optimize allocations.

🔬 Deep Mode

Timebox: 2-3 Days Scope: Architecture/Kernel Details: Rewrite critical paths in a faster language (Rust/Go extensions). Tune Garbage Collector settings (GOGC). Offload processing to async queues.


7. Constraints & Guardrails

Technical Constraints

  • Correctness First: An optimization that produces wrong results is a bug, not a feature. The test suite MUST pass.
  • Readability: Do not write obfuscated bit-twiddling code unless absolutely necessary. Comment "Why" complex logic exists.
  • Reproducibility: Benchmarks must be run on isolated hardware (or consistently noisy neighbors) to be valid.

Security & Privacy

CAUTION

Mandatory Security Considerations

  • Cache Poisoning: When implementing caching, ensure the cache key includes all security contexts (e.g., UserID). Do not serve User A's data to User B because it was cached globally.
  • Side Channels: Be aware that timing differences can sometimes leak information (Timing Attacks), though this is rare for general backend logic.

8. Procedure

Phase 1: Establish Baseline (Preparation)

Objective: Measure the current performance to ensure we can prove improvement later.

Do not touch the code yet. First, run the LOAD_SCRIPT against a staging environment that mirrors production (or use a local isolation chamber). Record the throughput (RPS) and Latency (p50, p95, p99). This is your Baseline.

Next, capture a profile while the load script is running.

  • Go: curl -o cpu.pprof http://localhost:6060/debug/pprof/profile?seconds=30
  • Python: py-spy record -o profile.svg --pid [[PID]]
  • Database: Runs EXPLAIN ANALYZE on the suspect queries.

Verify: You have a hard number (e.g., "140ms") and a visualization (Flamegraph) showing where the time is spent.

Phase 2: Diagnosis & Tuning (Execution)

Objective: Identify the bottleneck and apply the fix.

Analyze the Flamegraph. Look for the widest bars (CPU time) or the deepest stacks. Identify if the application is CPU Bound (spending time in calculation/serialization) or I/O Bound (waiting for DB/Network).

  • For I/O Bound:

    • Check for "N+1 Queries". If you see many small calls to the DB, batch them.
    • Check for missing indexes. If the database does a "Seq Scan", add an index.
    • Introduce Caching. If the data is read often/written rarely, wrap the call in a Redis Get/Set block.
  • For CPU Bound:

    • Inspect the hot function. Are you allocating too much memory in a loop?
    • Are you using an inefficient algorithm? (Nested loops).
    • Can you parallelize the work? (Worker pools).

Apply one change at a time. Do not apply 5 fixes at once; you won't know which one worked.

Verify: Run a quick micro-benchmark (Go benchmark / Python timeit) to confirm the specific function is faster.

Phase 3: Verification (Finalization)

Objective: Prove system-level improvement and regression check.

Re-run the LOAD_SCRIPT under the exact same conditions as Phase 1. Compare the new RPS and Latency numbers against the Baseline. Calculate the % improvement.

Run the full regression test suite. Specialized optimizations often break edge cases (e.g., caching might return stale data). Ensure data consistency is maintained. If you added a cache, verify the invalidation strategy works (e.g., updating the resource clears the cache).

Verify: "After" metrics > "Before" metrics. Tests pass. Code is committed with a clear description of the perf gain.


9. Technical Considerations

The Law of Diminishing Returns: optimization follows a power law. 80% of the gain comes from 20% of the code. Once you fix the main bottleneck, the next bottleneck will be much smaller. Know when to stop. Readable code is often better than slightly faster code.

Allocations (Memory/GC): In garbage-collected languages (Go/Python/Java), high memory allocation creates CPU pressure because the GC has to clean it up. Profiling "Alloc Objects" is often more fruitful than profiling "CPU Time" directly. Reusing buffers (sync.Pool) or effectively sizing slices can dramatically reduce latency jitter caused by GC pauses.

Distributed Systems & Fan-out: If your service calls 5 downstream services, your latency is defined by the slowest dependency (tail latency). Optimization here often involves parallel requests, timeouts, and "hedged requests" (sending 2, taking the first), rather than optimizing your own CPU loop.


10. Quality Gates (Definition of Done)

Checklist

  • [ ] Performance measurement shows >10% improvement (or target met).
  • [ ] No regression in functional tests.
  • [ ] Optimization is documented (why did we do this weird thing?).
  • [ ] Build does not fail.

Validation

CriterionMethodThreshold
LatencyLoad Test (k6)< Target (e.g., 200ms)
ThroughputLoad Test> Target (e.g., 1000 RPS)
CorrectnessUnit Tests100% Pass

11. Failure Modes & Recovery

Failure ModeSymptomsRecovery Action
Optimization broke logicFaster results, but wrong data.Revert immediately. Add a regression test case covering the specific data scenario.
Memory LeakOptimization involved caching, now RAM is full.Limit cache size (LRU/TTL). Ensure cache keys are bounded.
Thundering HerdCache expires, DB falls over.Implement "Request Coalescing" (singleflight) or "Probabilistic Early Expiration".

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
## Task: Optimize Backend Performance

### Context
I need to improve the performance of a specific component.
- **Metric**: [Target: TARGET_METRIC]
- **Role**: Backend Engineer

### Inputs
- **Base Code**: [Repo Path]
- **Load Script**: [Provide: LOAD_SCRIPT]

### Constraints
- Must include "Before" and "After" benchmarks.
- No functional regressions.
- Use profiling data, don't guess.

### Instructions
Execute the following procedure:

**Phase 1: Baseline**
Run the load script. Capture p50, p95, p99. Capture a CPU profile (flamegraph).

**Phase 2: Diagnosis & Tuning**
Analyze profile. Identify the bottleneck (CPU vs IO). Apply targeted fix (Index, Cache, or Refactor).

**Phase 3: Verification**
Re-run load script. Validate improvement. Run regression tests.

### Quality Gates
- [ ] Improvement Verified.
- [ ] Tests Pass.

### Output Format
- Benchmark Report (Before/After).
- Optimization Patch.

Cập nhật lần cuối: