Fine-Tune LLM (LoRA)

1. Purpose

Specialize a generic genius (GPT/Llama) into a domain expert (Legal/Medical). LoRA freezes the main model and trains tiny "adapter" layers, reducing VRAM usage by 90% and preventing "Catastrophic Forgetting".

2. When to Use / When Not to Use

Use This Workflow When

Base model fails to follow a specific format (JSON, SQL).
Domain language is unique (Medical jargon).
You have high-quality example pairs.

Do NOT Use This Workflow When

You just need to add knowledge (Use RAG). Fine-tuning is for Behavior, RAG is for Facts.
You have < 100 examples (Prompts engineering is better).
You want to teach the model "Reasoning" (Very hard to fine-tune, requires RLHF).

3. Inputs

Required Inputs

[[BASE_MODEL]]: e.g., meta-llama/Llama-2-7b-hf.
[[DATASET_PATH]]: JSONL file with {"instruction": "...", "output": "..."}.
[[OUTPUT_ADAPTER]]: Directory to save weights.

4. Outputs

Adapter: adapter_model.bin (~100MB).
Merged Model: (Optional) Base + Adapter fused for faster inference.

5. Preconditions

GPU with adequate VRAM (e.g., A10G or local RTX 3090/4090).
HuggingFace Token (if using Gated models).
peft, transformers, bitsandbytes installed.

6. Procedure

Phase 1: Setup

Action: Quantize Base Model.
- Expected Output: Load model in 4-bit (QLoRA) using BitsAndBytesConfig.
- Notes: Crucial for memory. 7B model fits on 16GB VRAM.
Action: Prepare LoRA Config.
- Expected Output: LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"]).
- Notes: r is rank. Higher = more parameters to train, but diminishing returns.

Phase 2: Training

Action: Tokenize Data.
- Expected Output: Inputs padded to consistent length (e.g., 512 or 2048).
Action: Run Trainer.
- Expected Output: Training Loop. Loss should decrease.
- Notes: Watch for Overfitting (Loss goes to 0, Validation Loss goes up).

Phase 3: Validation

Action: Inference Test.
- Expected Output: Compare Base Model vs Base + Adapter on a test prompt.

7. Quality Gates

[ ] Loss Convergence: Training loss steadily declined.
[ ] Format Compliance: Output strictly follows desired format (if training for format).
[ ] Safety: Model didn't unlearn safety guardrails (Run basic red-teaming).

8. Failure Handling

OOM (Out of Memory)

Symptoms: CUDA OOM error.
Recovery: Reduce Batch Size (try 1 with Gradient Accumulation). Reduce Context Length. Use 4-bit loading.

Collapse (Gibberish Output)

Symptoms: Model outputs repeating loops or NaN.
Recovery: Learning Rate too high. Reduce by 10x. Check dataset quality (Garbage In, Garbage Out).

9. Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text

Role: Act as a LLM Engineer.
Task: Execute the LoRA Fine-Tining workflow.

## Objective
Fine-tune [[BASE_MODEL]] on [[DATASET_PATH]].

## Inputs
- **Output**: [[OUTPUT_ADAPTER]]

## Procedure
Execute the following phases:

1. **Load**:
   - `AutoModelForCausalLM` with `load_in_4bit=True`.
   - Apply `prepare_model_for_kbit_training`.

2. **Config**:
   - Define `LoraConfig`. Target `q_proj`, `v_proj`.
   - Define `TrainingArguments` (lr=2e-4, batch=1).

3. **Train**:
   - Use `SFTTrainer` (Supervised Fine-tuning).
   - Save adapter.

## Quality Gates
- [ ] Gradient Checkpointing enabled (Save VRAM).
- [ ] Validation Set included.
- [ ] Loss logged to Tensorboard/W&B.

## Constraints
- Output: Python Training Script.
- Library: HF PEFT + Transformers.

## Command
Write the training script.

Fine-Tune LLM (LoRA) ​

1. Purpose ​

2. When to Use / When Not to Use ​

Use This Workflow When ​

Do NOT Use This Workflow When ​

3. Inputs ​

Required Inputs ​

4. Outputs ​

5. Preconditions ​

6. Procedure ​

Phase 1: Setup ​

Phase 2: Training ​

Phase 3: Validation ​

7. Quality Gates ​

8. Failure Handling ​

OOM (Out of Memory) ​

Collapse (Gibberish Output) ​

9. Paste Prompt ​