Skip to content

Code Revamp (Optimization)


1. Objective

The objective of this workflow is to elevate "working" code to "high-performance" code. In ML, a slow loop can cost days of training time or thousands of dollars in inference. This workflow focuses on performance optimization (Vectorization, Caching) and strict enforceability of standards (Mypy Strict, Ruff).


2. Context & Scope

In Scope

This workflow covers Static Analysis (Ruff/Mypy), Vectorization (replacing for loops with numpy/torch), Memory Profiling, and Dependency tightening.

Assumption: The code is already modular (functions/classes exist). If it is a messy notebook, use "ML Code Revamp (Refactor)" first.

Out of Scope

  • Algorithm Redesign: We are not changing the math, just the implementation efficiency.
  • Hardware Upgrade: We optimize for the current hardware.

3. When to Use / When Not to Use

Use This Workflow When

  • Your data loader is the bottleneck (GPU utilization < 50%).
  • Inference latency is too high for production SLA.
  • You are releasing an open-source library and need professional quality.

Do NOT Use This Workflow When

  • The code runs in 5 seconds and runs once a month. (Premature Optimization).
  • You are debugging logic errors. Fix bugs before optimizing speed.

4. Inputs (Required/Optional)

Required Inputs

InputDescriptionFormatExample
REPO_PATHCode location.Path./src/my_model

Optional Inputs

InputDescriptionDefaultCondition
TARGET_LATENCYGoal.Float100ms

5. Outputs (Artifacts)

ArtifactFormatDestinationQuality Criteria
Optimized CodeSourceRepoPasses strict checking.
Profile ReportHTML/TextArtifactsShows 'Before' vs 'After'.

6. Operating Modes

Fast Mode

Timebox: 1 hour Scope: Linter Sweep. Details: Running ruff check --select ALL --fix and black. Cleaning unused imports, formatting, and simple rewrites (e.g. f-strings).

🎯 Standard Mode (Default)

Timebox: 4 hours Scope: Vectorization & Typing. Details: Identifying explicit Python loops and replacing them with Pandas/Numpy vector operations. Adding type hints to 100% of signatures.

🔬 Deep Mode

Timebox: 2 days Scope: Cython/JIT. Details: Using numba.jit or writing Cython extensions for the absolute critical path. Analyzing memory leaks with memray.


7. Constraints & Guardrails

Technical Constraints

  • Maintainability vs Speed: Don't write obscure "bit-hacking" numpy code if it becomes unreadable, unless absolutely necessary. Comment heavily.
  • Compatibility: Ensure optimized code works on all supported platforms (Windows/Linux/Mac). Numba/Cython can be tricky with wheels.

Security & Privacy

CAUTION

Safe Unpickling If modernizing data loading, switch from pickle to safetensors or json where possible. Pickle is a security hole.

Compliance

  • Dependencies: Do not introduce heavy dependencies (like pandas) into a lightweight inference library if numpy suffices.

8. Procedure

Phase 1: Profiling (Baseline)

Objective: Find the hotspot.

Instrument the code.

  • CPU: Use cProfile or py-spy. py-spy record -o profile.svg --pid 12345.
  • Memory: Use memray. Identify the top 3 functions consuming time/RAM. Note: Often it's i/o (loading files) or copying data, not the math.

Verify: A Flame Graph showing where time is spent.

Phase 2: Vectorization & Optimization

Objective: Hardware acceleration.

Refactor the Hotspots:

  • Loop Removal: Replace [x * 2 for x in list] with arr * 2.
  • Pre-allocation: Don't append to lists in a loop. Allocate np.zeros(N) and fill.
  • Parallelism: Use joblib.Parallel or ProcessPoolExecutor for CPU-bound tasks.
  • Caching: Use @functools.lru_cache for repeated distinct calls.

Verify: Micro-benchmark the function. Speedup > 2x.

Phase 3: Modernization (Strictness)

Objective: Bulletproof.

config mypy.ini: strict = True. Run mypy .. Fix errors. (No Any, generic lists List[str]). Run ruff. Enable SIM (Simplify) and PERF (Performance) rules.

  • PERF401: Use a list comprehension to create a transformed list.
  • SIM105: Use contextlib.suppress instead of try-except-pass.

Verify: Codebase is "Green" on strict settings.


9. Technical Considerations

Generators: Use Generators (yield) for data processing. Do not load a 10GB CSV into RAM. Stream it line by line or chunk by chunk.

Fusing Operations: In PyTorch, use torch.compile (2.0+) to fuse kernels automatically. Manual fusion is hard; let the compiler do it.

Dependency Trimming: Check pipdeptree. Remove unused transitive dependencies to speed up Docker builds and reduce attack surface.


10. Quality Gates (Definition of Done)

Checklist

  • [ ] Hotspots identified via Profiler.
  • [ ] Vectorization applied to critical path.
  • [ ] Mypy Strict passing (or annotated ignores).
  • [ ] Requirements.txt minimized.

Validation

CriterionMethodThreshold
LatencyBenchmarkReduced by > 20%
LintRuff0 Errors

11. Failure Modes & Recovery

Failure ModeSymptomsRecovery Action
OOMKilled 9.You optimized for speed (Caching) but blew up Memory. Reduce concurrency or cache size.
RegressionsFast but Wrong.Run the Unit Test suite. If tests fail, your optimized logic is flawed.
Build FailStructurally sound but fails CI.Check Mypy version differences. Pin dev dependencies.

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a Senior ML Engineer specializing in Performance Optimization.
Task: Execute the Code Revamp (Optimization) workflow.

## Objective & Scope
- **Goal**: Elevate "working" code to "high-performance" production standards (Speed + Types).
- **Scope**: Profiling, Vectorization, Strict Typing (Mypy), and Linting (Ruff).

## Inputs
- [ ] REPO_PATH: Path to source code.
- [ ] TARGET_LATENCY: Latency goal (optional).

## Output Artifacts
- [ ] Optimized Codebase (Source)
- [ ] Benchmark Report (Text/HTML)

## Execution Steps
1. **Diagnosis**
   - Instrument code with cProfile/py-spy/memray to identify top 3 hotspots.
2. **Optimize**
   - Refactor hotspots using Vectorization (Numpy/Torch), Pre-allocation, or Caching.
3. **Standardize**
   - Run strict linters (Ruff/Mypy) and fix all errors. Annotate generic types.

## Quality Gates
- [ ] Hotspots identified and optimized.
- [ ] Vectorization applied.
- [ ] Mypy Strict passing.
- [ ] Speedup > 20% verified.

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Technical**: Prioritize Vectorization over Complexity. Maintain readability.
- **Security**: Use safe serialization (safetensors) where possible.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

VersionDateAuthorChanges
1.0.02026-01-14AI Engineering TeamInitial release

Cập nhật lần cuối: