Profile GPU Usage

1. Objective

To execute Profile GPU Usage for high-performing AI systems.

2. When to use / When not to use

When to use:

When working on profile gpu usage.

When not to use:

Without sufficient data.

3. Inputs (Required/Optional)

Required:

Training Data
Model Config

Optional:

Relevant documentation

4. Outputs (Artifacts)

Model Artifact: Description of the Model Artifact artifact.
Metrics Report: Description of the Metrics Report artifact.

5. Operating Modes

Mode	Description	Verification Level
Fast	Focus on speed, minimal validation.	Basic syntax/lint checks only.
Standard	Balanced approach.	Unit tests and standard linting.
Deep	Comprehensive analysis and optimization.	Full test suite, performance profiling, security scan.

6. Constraints & Guardrails

No Broken Builds: Ensure all changes pass the build process.
Code Style: Strictly adhere to the project's linting and formatting rules.
Security: Do not introduce new vulnerabilities; sanitize all inputs.
Performance: Avoid O(n^2) or worse complexity unless strictly necessary and documented.
Testing: Maintain or improve code coverage; do not degrade it.

7. Procedure

Phase 1: Data Prep

Clean data.
Split train/test.
Normalize features.

Phase 2: Training

Configure model.
Run training loop.
Log metrics.

Phase 3: Evaluation

Calculate accuracy/F1.
Check bias.
Save artifact.

8. Quality Gates (Definition of Done)

[ ] code compiles/runs without errors.
[ ] All new components include identical or improved test coverage.
[ ] No new linting errors or warnings introduced.
[ ] Documentation updated (inline and external).
[ ] Security scan passes (no high/critical severities).

9. Failure Modes & Recovery

Failure Mode	Recovery Action
Build Failure	Check error logs, revert recent changes, verify dependencies.
Test Failure	Isolate failing test, debug logic, or update test if requirements changed.
Linting Error	Run auto-formatter and manually fix remaining issues.
Merge Conflict	Rebase on main, resolve conflicts manually, run tests again.

10. Copy-Paste Prompt

text

Role: Act as a Senior ML Ops Engineer.
Task: Execute the Profile GPU Usage workflow.

## Objective & Scope
- **Goal**: Analyze and optimize GPU utilization to reduce training time and costs.
- **Scope**: GPU Memory tracking, Compute utilization, Data Loader bottlenecks, and Kernel profiling.

## Inputs
- [ ] SCRIPT: The training script to profile.
- [ ] ENV: Environment details (GPU type, CUDA version).

## Output Artifacts
- [ ] GPU Profile Report (Memory/Util)
- [ ] Optimization Recommendations

## Execution Steps
1. **Monitor**
   - Run the script with `nvidia-smi` or `dcgm-exporter` monitoring. Check for low volatility (starvation).
2. **Profile**
   - Use PyTorch Profiler or Nsight Systems to trace execution. Identify valid "Gaps" in GPU activity.
3. **Diagnose**
   - Determine cause: DataLoader bottlenecks? CPU overhead? Memory fragmentation?

## Quality Gates
- [ ] Utilization plotted over time.
- [ ] Bottleneck source identified (Compute vs IO).
- [ ] Actionable recommendations provided.

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Overhead**: Ensure profiling tools don't skew results significantly (sampling rate).

## Command
Now execute this workflow step-by-step.

Profile GPU Usage ​

1. Objective ​

2. When to use / When not to use ​

3. Inputs (Required/Optional) ​

4. Outputs (Artifacts) ​

5. Operating Modes ​

6. Constraints & Guardrails ​

7. Procedure ​

Phase 1: Data Prep ​

Phase 2: Training ​

Phase 3: Evaluation ​

8. Quality Gates (Definition of Done) ​

9. Failure Modes & Recovery ​

10. Copy-Paste Prompt ​

Profile GPU Usage

1. Objective

2. When to use / When not to use

3. Inputs (Required/Optional)

4. Outputs (Artifacts)

5. Operating Modes

6. Constraints & Guardrails

7. Procedure

Phase 1: Data Prep

Phase 2: Training

Phase 3: Evaluation

8. Quality Gates (Definition of Done)

9. Failure Modes & Recovery

10. Copy-Paste Prompt