TL;DR

Fine-tuning trains a pre-trained model on your specific data to improve performance on your task. Consider it when prompting and RAG aren't sufficient—but it requires data, cost, and maintenance.

What is fine-tuning?

Definition:
Additional training on a pre-trained model using your own dataset.

Goal:

  • Adapt to your domain (medical, legal, etc.)
  • Learn your style or format
  • Improve specific task performance

Not:

  • Teaching completely new knowledge (use RAG)
  • Fixing all model limitations

When to fine-tune

Good candidates:

  • Specific style/format needed
  • Domain-specific language
  • Consistent task structure
  • Have labeled data (100s-1000s examples)

Examples:

  • Generate emails in your company's tone
  • Classify support tickets into custom categories
  • Extract entities specific to your industry

When NOT to fine-tune

Use RAG instead if:

  • Need to add knowledge
  • Knowledge changes frequently
  • Don't have training data

Use better prompting if:

  • Task is general
  • Few-shot examples work well
  • Data collection is hard

The fine-tuning process

1. Prepare data:

  • Collect 100-10,000 examples
  • Format as input-output pairs
  • Clean and deduplicate

2. Choose base model:

  • GPT-3.5, GPT-4 (OpenAI)
  • Llama, Mistral (open source)

3. Train:

  • Upload data to platform or run locally
  • Set hyperparameters (learning rate, epochs)
  • Monitor training metrics

4. Evaluate:

  • Test on held-out data
  • Compare to base model

5. Deploy:

  • Use fine-tuned model via API or hosting

Data requirements

Quantity:

  • Minimum: 50-100 examples
  • Recommended: 500-1000+
  • More is better (diminishing returns)

Quality:

  • Accurate labels
  • Representative of production
  • Diverse examples

Format (example for OpenAI):

[
  {"messages": [
    {"role": "system", "content": "You are a customer support agent."},
    {"role": "user", "content": "My order is late"},
    {"role": "assistant", "content": "I apologize. Let me check your order status..."}
  ]},
  ...
]

Fine-tuning platforms

OpenAI:

  • GPT-3.5, GPT-4 fine-tuning
  • Easy API
  • Paid per training + usage

Hugging Face:

  • Open source models
  • Training scripts provided
  • Self-host or use Endpoints

Google Vertex AI:

  • Fine-tune PaLM models
  • Managed service

Self-hosted (advanced):

  • Full control
  • Requires ML expertise

Costs

OpenAI fine-tuning:

  • Training: $0.008 per 1K tokens
  • Usage: ~2x base model price
  • Example: 1M training tokens = $8

Self-hosted:

  • GPU costs ($500-5000/month)
  • Engineering time
  • Cheaper at scale

Common pitfalls

Overfitting:

  • Model memorizes training data
  • Fails on new examples
  • Solution: More diverse data, early stopping

Insufficient data:

Wrong base model:

  • Too small (can't learn)
  • Too large (expensive, slow)

Ignoring alternatives:

  • Sometimes better prompts = same results
  • Try RAG first

Evaluation

Compare:

Metrics:

  • Accuracy, F1, BLEU (task-dependent)
  • Human evaluation
  • A/B test in production

Maintaining fine-tuned models

  • Retrain periodically with new data
  • Monitor for drift
  • Update when base model improves

Decision framework

Need to add knowledge? → RAG
Specific style/format? → Fine-tuning
Complex reasoning? → Better prompting
All of the above? → Combine techniques

What's next

  • Fine-Tuning vs RAG (deeper comparison)
  • Training Data Preparation
  • Model Selection