Skip to content

Fine-Tune LLM (LoRA)


1. Purpose

Specialize a generic genius (GPT/Llama) into a domain expert (Legal/Medical). LoRA freezes the main model and trains tiny "adapter" layers, reducing VRAM usage by 90% and preventing "Catastrophic Forgetting".


2. When to Use / When Not to Use

Use This Workflow When

  • Base model fails to follow a specific format (JSON, SQL).
  • Domain language is unique (Medical jargon).
  • You have high-quality example pairs.

Do NOT Use This Workflow When

  • You just need to add knowledge (Use RAG). Fine-tuning is for Behavior, RAG is for Facts.
  • You have < 100 examples (Prompts engineering is better).
  • You want to teach the model "Reasoning" (Very hard to fine-tune, requires RLHF).

3. Inputs

Required Inputs

  • [[BASE_MODEL]]: e.g., meta-llama/Llama-2-7b-hf.
  • [[DATASET_PATH]]: JSONL file with {"instruction": "...", "output": "..."}.
  • [[OUTPUT_ADAPTER]]: Directory to save weights.

4. Outputs

  • Adapter: adapter_model.bin (~100MB).
  • Merged Model: (Optional) Base + Adapter fused for faster inference.

5. Preconditions

  • GPU with adequate VRAM (e.g., A10G or local RTX 3090/4090).
  • HuggingFace Token (if using Gated models).
  • peft, transformers, bitsandbytes installed.

6. Procedure

Phase 1: Setup

  1. Action: Quantize Base Model.

    • Expected Output: Load model in 4-bit (QLoRA) using BitsAndBytesConfig.
    • Notes: Crucial for memory. 7B model fits on 16GB VRAM.
  2. Action: Prepare LoRA Config.

    • Expected Output: LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"]).
    • Notes: r is rank. Higher = more parameters to train, but diminishing returns.

Phase 2: Training

  1. Action: Tokenize Data.

    • Expected Output: Inputs padded to consistent length (e.g., 512 or 2048).
  2. Action: Run Trainer.

    • Expected Output: Training Loop. Loss should decrease.
    • Notes: Watch for Overfitting (Loss goes to 0, Validation Loss goes up).

Phase 3: Validation

  1. Action: Inference Test.
    • Expected Output: Compare Base Model vs Base + Adapter on a test prompt.

7. Quality Gates

  • [ ] Loss Convergence: Training loss steadily declined.
  • [ ] Format Compliance: Output strictly follows desired format (if training for format).
  • [ ] Safety: Model didn't unlearn safety guardrails (Run basic red-teaming).

8. Failure Handling

OOM (Out of Memory)

  • Symptoms: CUDA OOM error.
  • Recovery: Reduce Batch Size (try 1 with Gradient Accumulation). Reduce Context Length. Use 4-bit loading.

Collapse (Gibberish Output)

  • Symptoms: Model outputs repeating loops or NaN.
  • Recovery: Learning Rate too high. Reduce by 10x. Check dataset quality (Garbage In, Garbage Out).

9. Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a LLM Engineer.
Task: Execute the LoRA Fine-Tining workflow.

## Objective
Fine-tune [[BASE_MODEL]] on [[DATASET_PATH]].

## Inputs
- **Output**: [[OUTPUT_ADAPTER]]

## Procedure
Execute the following phases:

1. **Load**:
   - `AutoModelForCausalLM` with `load_in_4bit=True`.
   - Apply `prepare_model_for_kbit_training`.

2. **Config**:
   - Define `LoraConfig`. Target `q_proj`, `v_proj`.
   - Define `TrainingArguments` (lr=2e-4, batch=1).

3. **Train**:
   - Use `SFTTrainer` (Supervised Fine-tuning).
   - Save adapter.

## Quality Gates
- [ ] Gradient Checkpointing enabled (Save VRAM).
- [ ] Validation Set included.
- [ ] Loss logged to Tensorboard/W&B.

## Constraints
- Output: Python Training Script.
- Library: HF PEFT + Transformers.

## Command
Write the training script.

Cập nhật lần cuối: