Model Compression: Smaller, Faster AI
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Compress AI models with quantization, pruning, and distillation. Deploy faster, cheaper models without sacrificing much accuracy.
TL;DR
Model compression makes AI models smaller, faster, and cheaper to run. The three main techniques are quantization (reducing numeric precision), pruning (removing unnecessary weights), and knowledge distillation (training a smaller model to mimic a larger one). These methods can shrink models by 2-100x while retaining 90-99% of the original performance, making deployment on phones, edge devices, and cost-sensitive servers practical.
Why it matters
The best AI models are massive. GPT-4 is rumoured to have over a trillion parameters. Llama 2 70B requires around 140 GB of memory in full precision. Running these models requires expensive GPU servers that cost thousands of dollars per month.
But not every application needs the full-sized model. A customer support chatbot, a text classifier, or a code completion tool can often work brilliantly with a model that is 10x or even 100x smaller than the original. Model compression is what makes this possible.
For anyone deploying AI in the real world, compression is not an academic exercise. It directly determines whether your application is financially viable, whether it can run on mobile devices, and whether it responds fast enough for users to actually enjoy using it. The ability to compress models effectively is becoming one of the most valuable skills in practical AI engineering.
Quantization: doing more with less precision
Every number in a neural network is stored as a floating-point value. By default, this is 32-bit floating point (FP32), meaning each number takes 32 bits of memory. Quantization reduces that precision.
FP16 and BF16 (16-bit) cut memory usage in half with almost no quality loss. Most modern GPUs are actually optimised for 16-bit computation, so this often speeds up inference as well. If you are not already running your models in FP16, this is the easiest win available.
INT8 (8-bit integer) reduces memory by 4x compared to FP32. This is where the trade-off gets interesting. Most models handle INT8 well, with accuracy dropping by only 1-2% on typical benchmarks. For many applications, that small drop is well worth the 4x memory savings and significant speed improvement.
INT4 (4-bit integer) offers 8x compression but comes with more noticeable quality loss. The model may generate less fluent text, miss nuanced questions, or make more factual errors. However, for some use cases — especially when combined with other techniques — INT4 is viable and makes it possible to run large language models on consumer hardware.
There are two main approaches to quantization:
Post-training quantization (PTQ) converts an already-trained model to lower precision. It is fast and easy — you just run a conversion script. The downside is that the model has no chance to adapt to its new precision, so quality loss can be higher.
Quantization-aware training (QAT) simulates lower precision during the training process itself. The model learns to work with reduced precision, resulting in better quality at the same bit width. The trade-off is that it requires retraining, which is expensive and time-consuming.
For most practitioners, PTQ with INT8 is the sweet spot — easy to implement with minimal quality loss. If you need more compression, try INT4 with a method like GPTQ or AWQ, which are specifically designed to preserve quality at very low precision.
Pruning: cutting the dead weight
Not all parameters in a neural network contribute equally to its outputs. Pruning identifies and removes the least important ones, making the model smaller and potentially faster.
Unstructured pruning removes individual weights (setting them to zero). You might remove 50-90% of all weights in a model. The catch is that the resulting "sparse" model has an irregular structure that standard hardware does not accelerate well. You need specialised software or hardware to see speed improvements, though you always get memory savings.
Structured pruning removes entire neurons, channels, or attention heads. This produces models with a regular structure that standard hardware can accelerate immediately. The compression ratio is usually less dramatic than unstructured pruning (10-50% of weights removed), but the speed improvements are real and immediate.
Magnitude-based pruning is the simplest approach: remove the weights with the smallest absolute values. The assumption is that small weights contribute less to the output. Despite its simplicity, this works surprisingly well for moderate compression levels.
Iterative pruning alternates between pruning and retraining. Remove 10% of weights, retrain briefly to recover accuracy, remove another 10%, retrain again, and so on. This gradual approach achieves better compression than removing everything at once because the model has opportunities to adapt.
In practice, pruning is most effective when combined with quantization. Prune the model to remove unnecessary weights, then quantize the remaining ones for additional compression.
Knowledge distillation: learning from a teacher
Distillation takes a completely different approach. Instead of compressing the original model directly, you train a new, smaller model to mimic the original's behaviour.
The process works like this:
Run the large "teacher" model on a dataset and record its outputs — not just the final answers, but the full probability distributions (called "soft labels"). These soft labels contain richer information than simple right/wrong labels. When the teacher says "this is 70% a cat and 25% a dog," that 25% for dog is useful information that hard labels lose.
Train a small "student" model to match the teacher's soft labels. The student learns not just what the right answer is, but what the teacher thinks the plausible alternatives are. This transfers knowledge that would take much more data and time to learn from scratch.
The student ends up dramatically smaller but retains most of the teacher's capability. DistilBERT, for example, is 60% the size of BERT but retains 97% of its language understanding performance. TinyLlama takes a similar approach for generative models.
Distillation can produce models that are 10-100x smaller than the teacher while maintaining 90-97% of the performance. The student model runs faster, uses less memory, and costs less to serve — all while being much better than a small model trained from scratch.
Combining techniques for maximum compression
The most effective compression strategies combine multiple techniques:
Distillation + quantization is the most common combination. Distill a large model into a smaller architecture, then quantize the student to INT8 or INT4. This can easily achieve 50-200x total compression.
Pruning + quantization removes unnecessary weights first, then reduces the precision of the remaining ones. This is particularly effective because pruning removes the weights that would have introduced the most error during quantization.
Distillation + pruning + quantization is the full stack. Start with distillation to get a smaller architecture, prune to remove redundant weights, and quantize for final compression. This approach requires more effort but can produce models that are hundreds of times smaller than the original.
Tools and frameworks
Several mature tools make compression accessible:
llama.cpp is the go-to tool for running quantized language models. It supports GGUF format models at INT4, INT5, and INT8 precision and runs on CPUs, making it possible to run 7B and even 13B parameter models on a laptop.
ONNX Runtime provides cross-platform model optimisation including quantization. It is widely used in production for deploying models efficiently across different hardware.
PyTorch includes built-in quantization support through torch.quantization. For pruning, torch.nn.utils.prune provides common pruning methods. Hugging Face's Optimum library adds additional compression tools.
TensorFlow Lite is designed for mobile and edge deployment, with built-in quantization and optimisation passes.
GPTQ and AWQ are modern quantization methods specifically designed for large language models, producing high-quality INT4 quantizations that preserve generation quality better than naive approaches.
Choosing the right approach
The right compression technique depends on your constraints:
If you need the simplest solution: Start with FP16 quantization. It is practically free in terms of quality and requires no special tooling.
If you need to run on a phone or edge device: Use knowledge distillation to create a purpose-built small model, then quantize it to INT8.
If you need to run large language models on consumer hardware: Use GPTQ or AWQ quantization at INT4 with llama.cpp or a compatible runtime.
If you need production-grade efficiency at scale: Combine pruning and quantization on an already-efficient architecture, then deploy with ONNX Runtime or TensorRT.
Common mistakes
Compressing before establishing a quality baseline. Always measure the full-precision model's performance on your specific task first. Without a baseline, you cannot know what compression is costing you.
Using INT4 quantization without testing on your actual use case. Benchmark results do not always reflect real-world performance. A model that scores well on a standard benchmark might fail on your specific domain. Always test with your own data.
Ignoring hardware-software alignment. Unstructured pruning looks great on paper but delivers no speed improvement on standard GPUs. Match your compression technique to your deployment hardware.
Compressing the wrong model. If a smaller model (like GPT-3.5 instead of GPT-4) already meets your quality requirements, use it directly instead of compressing a larger model. The simplest solution often wins.
Skipping the distillation step when training data is available. If you can afford to generate labels from a teacher model, distillation almost always produces better results than directly quantizing a large model.
What's next?
Continue building your knowledge of efficient AI with these related guides:
- AI Latency Optimization for making AI responses faster
- AI Cost Management for strategies to reduce your AI spend
- AI Model Architectures for understanding the models you are compressing
- Efficient Inference Optimization for the full picture of production efficiency
Frequently Asked Questions
Can I run large language models on my laptop using compression?
Yes. With INT4 quantization via llama.cpp, you can run models up to about 13 billion parameters on a modern laptop with 16 GB of RAM. A 7B parameter model quantized to INT4 requires about 4 GB of memory and runs at a usable speed on most recent laptops. Models larger than 13B generally require a GPU or more RAM.
How much quality do I lose with INT8 quantization?
Typically 1-3% on standard benchmarks, which is imperceptible for most practical applications. The quality loss manifests as slightly less fluent text generation or very occasional factual errors. For classification, summarisation, and most generation tasks, INT8 models are effectively indistinguishable from their FP32 counterparts in real-world use.
What is the difference between GPTQ, AWQ, and standard quantization?
Standard quantization applies the same precision reduction uniformly across all weights. GPTQ and AWQ are 'smart' quantization methods that analyse which weights are most important and allocate precision accordingly. Critical weights keep higher precision while less important ones get lower precision. This produces much better quality at INT4 and INT3 levels, where standard quantization often fails.
Is it worth training a distilled model if I can just use a smaller pre-trained model?
It depends on whether an existing small model meets your needs. If a general-purpose small model works for your task, use it — it is simpler and cheaper. But if you need a small model that is specifically good at your domain or task, distillation from a larger teacher will produce significantly better results than using a general-purpose small model, because the student inherits task-specific knowledge from the teacher.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
Quantization
A compression technique that reduces AI model size and memory usage by using lower-precision numbers, making models faster and cheaper to run.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Related Guides
Quantization and Distillation Deep Dive
AdvancedMaster advanced model compression: quantization-aware training, mixed precision, and distillation strategies for production deployment.
8 min readAdvanced RAG Techniques
AdvancedGo beyond basic RAG: hybrid search, reranking, query expansion, HyDE, and multi-hop retrieval for better context quality.
9 min readDistributed Training for Large Models
AdvancedScale AI training across multiple GPUs and machines. Learn data parallelism, model parallelism, and pipeline parallelism strategies.
8 min read