TL;DR

Use LLMs to generate synthetic training data: create classification examples, question-answer pairs, or diverse variations of existing data. Useful when real data is limited, expensive, or sensitive.

Use cases

Data augmentation: Expand small datasets
Privacy: Generate realistic data without PII
Bootstrapping: Create initial training set
Long-tail coverage: Generate rare examples

Generation strategies

Prompted generation:

  • "Generate 100 customer support queries about password resets"
  • Review and curate outputs

Few-shot expansion:

  • Provide 5 examples
  • Ask LLM to generate 100 more similar

Variation generation:

  • Take existing examples
  • Generate paraphrases, translations, perturbations

Quality control

  • Filter low-quality generations
  • Human review samples
  • Validate diversity (avoid repetition)
  • Check for hallucinations

Combining synthetic + real data

  • 70% real, 30% synthetic common ratio
  • Use synthetic for rare classes
  • Validate that synthetic data improves performance

Tools and techniques

  • GPT-4 / Claude for generation
  • Filtering with smaller models
  • Embedding-based deduplication
  • Active learning to select valuable synthetic examples

Limitations

  • Synthetic data may not capture real distribution
  • Risk of model collapse (training on own outputs)
  • Still need real data for validation