Optimize Backend Performance

1. Objective

Making the system faster by doing less work.

The primary objective of this workflow is to methodically reduce the resource consumption (CPU, Memory, I/O) or execution time of a specific backend component. "Performance" is often vague; this workflow demands specificity—are we optimizing for Latency (response time) or Throughput (requests per second)? The goal is to move from "feeling slow" to "measured improvement" using data-driven profiling rather than guessing. Success is defined by a statistically significant improvement in the target metric without compromising correctness or maintainability.

2. Context & Scope

In Scope

This workflow covers application-level and database-level optimization for backend services. This includes:

Code Profiling: Using tools like pprof (Go) or cProfile/py-spy (Python) to find CPU hot spots.
Query Tuning: analyzing SQL execution plans and adding indexes.
Concurrency: Tuning thread pools, worker counts, or converting sync I/O to async.
Caching: Implementing application-side caching (Redis/Memcached).

Assumption: You have a baseline metric. You cannot optimize what you cannot measure. Assumption: The code is functional and has passing tests.

Out of Scope

This workflow does not cover:

Infrastructure Scaling: Simply adding more servers (Vertical/Horizontal scaling) is a capacity decision, not code optimization (see Capacity Planning).
Frontend Performance: Optimizing React bundles or CSS (see Frontend Performance).
Network Architecture: Reducing geographical latency via CDNs.

3. When to Use / When Not to Use

✅ Use This Workflow When

An endpoint violates its Service Level Objective (SLO) (e.g., p99 > 500ms).
A background job takes longer than its scheduling interval (the "overlapping job" problem).
Cloud costs are exploding due to excessive CPU/Memory usage.
You are preparing for a known high-traffic even (Black Friday).

❌ Do NOT Use This Workflow When

"Premature Optimization": You are guessing where the bottleneck is before writing the code.
The system is currently unstable/crashing (fix the Stability first).
The gain is negligible (e.g., saving 1ms on a 5-minute job).

4. Inputs (Required/Optional)

Required Inputs

Input	Description	Format	Example
TARGET_METRIC	The specific number to improve.	String	"POST /search p95 latency"
LOAD_SCRIPT	Script to generate valid traffic.	File Path	`k6/load.js`, `locustfile.py`

Optional Inputs

Input	Description	Default	Condition
APM_DASHBOARD	Link to Datadog/NewRelic/Jaeger.	URL	If observability is configured.
PROFILER_OUTPUT	Existing profiles.	`.pprof`	If capture already exists.

5. Outputs (Artifacts)

Artifact	Format	Destination	Quality Criteria
Baseline Report	Text/Markdown	`docs/perf/baseline.md`	Clear "Before" numbers (RPS/Latency).
Root Cause Analysis	Text	Incident Ticket	Identifies strictly why it was slow.
Optimization PR	Git Diff	Repo	Includes code changes and benchmark tests.
Verification Report	Text/Markdown	`docs/perf/result.md`	Clear "After" numbers showing improvement.

6. Operating Modes

⚡ Fast Mode

Timebox: 1 Hour Scope: Low Hanging Fruit Details: Check for N+1 query issues. Check for missing database indexes. Enable basic caching.

🎯 Standard Mode (Default)

Timebox: 4 Hours Scope: Code/Algo Optimization Details: capture CPU/Memory profiles. Identify hot functions. Refactor algorithms (O(n^2) -> O(n)). optimize allocations.

🔬 Deep Mode

Timebox: 2-3 Days Scope: Architecture/Kernel Details: Rewrite critical paths in a faster language (Rust/Go extensions). Tune Garbage Collector settings (GOGC). Offload processing to async queues.

7. Constraints & Guardrails

Technical Constraints

Correctness First: An optimization that produces wrong results is a bug, not a feature. The test suite MUST pass.
Readability: Do not write obfuscated bit-twiddling code unless absolutely necessary. Comment "Why" complex logic exists.
Reproducibility: Benchmarks must be run on isolated hardware (or consistently noisy neighbors) to be valid.

Security & Privacy

CAUTION

Mandatory Security Considerations

Cache Poisoning: When implementing caching, ensure the cache key includes all security contexts (e.g., UserID). Do not serve User A's data to User B because it was cached globally.
Side Channels: Be aware that timing differences can sometimes leak information (Timing Attacks), though this is rare for general backend logic.

8. Procedure

Phase 1: Establish Baseline (Preparation)

Objective: Measure the current performance to ensure we can prove improvement later.

Do not touch the code yet. First, run the LOAD_SCRIPT against a staging environment that mirrors production (or use a local isolation chamber). Record the throughput (RPS) and Latency (p50, p95, p99). This is your Baseline.

Next, capture a profile while the load script is running.

Go: curl -o cpu.pprof http://localhost:6060/debug/pprof/profile?seconds=30
Python: py-spy record -o profile.svg --pid [[PID]]
Database: Runs EXPLAIN ANALYZE on the suspect queries.

Verify: You have a hard number (e.g., "140ms") and a visualization (Flamegraph) showing where the time is spent.

Phase 2: Diagnosis & Tuning (Execution)

Objective: Identify the bottleneck and apply the fix.

Analyze the Flamegraph. Look for the widest bars (CPU time) or the deepest stacks. Identify if the application is CPU Bound (spending time in calculation/serialization) or I/O Bound (waiting for DB/Network).

For I/O Bound:
- Check for "N+1 Queries". If you see many small calls to the DB, batch them.
- Check for missing indexes. If the database does a "Seq Scan", add an index.
- Introduce Caching. If the data is read often/written rarely, wrap the call in a Redis Get/Set block.
For CPU Bound:
- Inspect the hot function. Are you allocating too much memory in a loop?
- Are you using an inefficient algorithm? (Nested loops).
- Can you parallelize the work? (Worker pools).

Apply one change at a time. Do not apply 5 fixes at once; you won't know which one worked.

Verify: Run a quick micro-benchmark (Go benchmark / Python timeit) to confirm the specific function is faster.

Phase 3: Verification (Finalization)

Objective: Prove system-level improvement and regression check.

Re-run the LOAD_SCRIPT under the exact same conditions as Phase 1. Compare the new RPS and Latency numbers against the Baseline. Calculate the % improvement.

Run the full regression test suite. Specialized optimizations often break edge cases (e.g., caching might return stale data). Ensure data consistency is maintained. If you added a cache, verify the invalidation strategy works (e.g., updating the resource clears the cache).

Verify: "After" metrics > "Before" metrics. Tests pass. Code is committed with a clear description of the perf gain.

9. Technical Considerations

The Law of Diminishing Returns: optimization follows a power law. 80% of the gain comes from 20% of the code. Once you fix the main bottleneck, the next bottleneck will be much smaller. Know when to stop. Readable code is often better than slightly faster code.

Allocations (Memory/GC): In garbage-collected languages (Go/Python/Java), high memory allocation creates CPU pressure because the GC has to clean it up. Profiling "Alloc Objects" is often more fruitful than profiling "CPU Time" directly. Reusing buffers (sync.Pool) or effectively sizing slices can dramatically reduce latency jitter caused by GC pauses.

Distributed Systems & Fan-out: If your service calls 5 downstream services, your latency is defined by the slowest dependency (tail latency). Optimization here often involves parallel requests, timeouts, and "hedged requests" (sending 2, taking the first), rather than optimizing your own CPU loop.

10. Quality Gates (Definition of Done)

Checklist

[ ] Performance measurement shows >10% improvement (or target met).
[ ] No regression in functional tests.
[ ] Optimization is documented (why did we do this weird thing?).
[ ] Build does not fail.

Validation

Criterion	Method	Threshold
Latency	Load Test (k6)	< Target (e.g., 200ms)
Throughput	Load Test	> Target (e.g., 1000 RPS)
Correctness	Unit Tests	100% Pass

11. Failure Modes & Recovery

Failure Mode	Symptoms	Recovery Action
Optimization broke logic	Faster results, but wrong data.	Revert immediately. Add a regression test case covering the specific data scenario.
Memory Leak	Optimization involved caching, now RAM is full.	Limit cache size (LRU/TTL). Ensure cache keys are bounded.
Thundering Herd	Cache expires, DB falls over.	Implement "Request Coalescing" (singleflight) or "Probabilistic Early Expiration".

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text

## Task: Optimize Backend Performance

### Context
I need to improve the performance of a specific component.
- **Metric**: [Target: TARGET_METRIC]
- **Role**: Backend Engineer

### Inputs
- **Base Code**: [Repo Path]
- **Load Script**: [Provide: LOAD_SCRIPT]

### Constraints
- Must include "Before" and "After" benchmarks.
- No functional regressions.
- Use profiling data, don't guess.

### Instructions
Execute the following procedure:

**Phase 1: Baseline**
Run the load script. Capture p50, p95, p99. Capture a CPU profile (flamegraph).

**Phase 2: Diagnosis & Tuning**
Analyze profile. Identify the bottleneck (CPU vs IO). Apply targeted fix (Index, Cache, or Refactor).

**Phase 3: Verification**
Re-run load script. Validate improvement. Run regression tests.

### Quality Gates
- [ ] Improvement Verified.
- [ ] Tests Pass.

### Output Format
- Benchmark Report (Before/After).
- Optimization Patch.

Optimize Backend Performance ​

1. Objective ​

2. Context & Scope ​

In Scope ​

Out of Scope ​

3. When to Use / When Not to Use ​

✅ Use This Workflow When ​

❌ Do NOT Use This Workflow When ​

4. Inputs (Required/Optional) ​

Required Inputs ​

Optional Inputs ​

5. Outputs (Artifacts) ​

6. Operating Modes ​

⚡ Fast Mode ​

🎯 Standard Mode (Default) ​

🔬 Deep Mode ​

7. Constraints & Guardrails ​

Technical Constraints ​

Security & Privacy ​

8. Procedure ​

Phase 1: Establish Baseline (Preparation) ​

Phase 2: Diagnosis & Tuning (Execution) ​

Phase 3: Verification (Finalization) ​

9. Technical Considerations ​

10. Quality Gates (Definition of Done) ​

Checklist ​

Validation ​

11. Failure Modes & Recovery ​

12. Copy-Paste Prompt ​