TL;DR

Model compression reduces size and improves speed through quantization (lower precision), pruning (remove weights), and distillation (train smaller model to mimic larger). Essential for edge deployment and cost reduction.

Compression techniques

Quantization: Reduce numeric precision (FP32 → INT8)
Pruning: Remove less important weights
Distillation: Train small model to mimic large model
Low-rank factorization: Decompose weight matrices

Quantization

Types:

  • Post-training quantization (PTQ): After training
  • Quantization-aware training (QAT): During training

Precision levels:

  • FP32 (standard): 32-bit floating point
  • FP16/BF16: 16-bit (2x smaller, faster)
  • INT8: 8-bit integer (4x smaller)
  • INT4: 4-bit (8x smaller, quality loss)

Trade-offs:

  • 2-4x speedup typical
  • Minimal accuracy loss (FP16, INT8)
  • Notable loss at INT4

Pruning

Structured: Remove entire neurons/channels
Unstructured: Remove individual weights
Magnitude-based: Remove smallest weights
Iterative: Prune, retrain, repeat

Results:

  • 50-90% weights removed possible
  • Requires specialized hardware for speedup

Knowledge distillation

Process:

  1. Train large "teacher" model
  2. Generate predictions on dataset
  3. Train small "student" to match teacher outputs
  4. Student learns to approximate teacher

Benefits:

  • 10-100x smaller models
  • 90-95% of teacher performance
  • Much faster inference

Examples:

  • DistilBERT: 60% size of BERT, 97% performance
  • TinyLLaMA: Distilled from larger Llama

Practical implementation

Tools:

  • ONNX Runtime (quantization)
  • PyTorch (built-in quantization)
  • TensorFlow Lite
  • llama.cpp (INT4/INT8 for Llama models)

Use cases

  • Mobile/edge deployment
  • Reducing inference costs
  • Real-time applications
  • Low-resource environments

What's next

  • Efficient Inference
  • Edge AI Deployment
  • Model Optimization