Quantization and Distillation Deep Dive
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Master advanced model compression: quantization-aware training, mixed precision, and distillation strategies for production deployment.
TL;DR
Quantization and distillation are two techniques for making AI models smaller and faster without losing much quality. Quantization works like saving a photo as JPEG instead of RAW -- you trade a tiny bit of precision for a massive reduction in size. Distillation works like an expert teacher training a smaller student to perform nearly as well. Together, they make it possible to run powerful AI models on phones, laptops, and affordable servers instead of expensive data center GPUs.
Why it matters
The most capable AI models are enormous. GPT-4 is rumored to have over a trillion parameters. Running models this size requires multiple high-end GPUs that cost tens of thousands of dollars. For a company serving millions of users, the compute costs can reach millions of dollars per month.
This creates a practical problem: the best models are too expensive and too slow for many real-world applications. A customer service chatbot that takes 10 seconds to respond loses customers. A mobile app that requires a constant internet connection to a GPU cluster is unusable offline. A startup that cannot afford $50,000 per month in GPU costs simply cannot use these models.
Quantization and distillation solve this by making models dramatically smaller and faster. A properly quantized model can run 2-4 times faster while using a quarter of the memory. A well-distilled smaller model can achieve 90-95% of the larger model's quality at a fraction of the cost. These are not theoretical gains -- they are what makes it possible to run language models on your phone, in your browser, or on a $10-per-month server.
Quantization: the JPEG compression analogy
The easiest way to understand quantization is to think about image compression. A RAW photo from a high-end camera might be 50 megabytes. Save it as JPEG and it drops to 5 megabytes -- 10 times smaller. If you look carefully, you might notice very slight quality loss, but for most purposes the JPEG is indistinguishable from the original.
Quantization does the same thing for AI models. The "RAW photo" is a model that stores every number (every weight and activation) as a 32-bit or 16-bit floating-point number -- very precise but very large. Quantization converts those numbers to lower precision -- 8-bit integers, 4-bit integers, or even lower -- which dramatically reduces the model's size and speeds up calculations.
Why does lower precision work? Because most of the precision in those 32-bit numbers is unnecessary. The difference between a weight of 0.12345678 and 0.123 almost never matters for the model's output. By rounding to fewer decimal places (effectively), you save enormous amounts of memory and compute with minimal impact on quality.
Types of quantization
Different quantization approaches offer different tradeoffs between ease of implementation, quality retention, and compression ratio.
INT8 quantization converts 32-bit or 16-bit floating-point numbers to 8-bit integers. This is the most conservative and widely supported approach. It cuts model size roughly in half (compared to FP16) and speeds up inference significantly on hardware that supports INT8 operations, which includes most modern GPUs and CPUs. Quality loss is typically less than 1% on standard benchmarks. This is the safe, default choice for most production deployments.
INT4 quantization goes further, using only 4 bits per number. This cuts the model size to roughly a quarter of its FP16 size. The quality tradeoff is larger -- expect 1-3% degradation on benchmarks, with some tasks affected more than others. INT4 is what makes it possible to run 7-billion-parameter models on consumer GPUs and even smartphones. It is the sweet spot for local deployment of medium-sized models.
GPTQ (GPT Quantization) is a popular technique specifically designed for large language models. It works by quantizing the model one layer at a time, using a small calibration dataset to minimize the quality loss at each step. GPTQ models can be quantized to 4-bit or even 3-bit precision and are widely available on platforms like Hugging Face. If you have downloaded a model file with "GPTQ" in the name, this is what it used.
AWQ (Activation-Aware Weight Quantization) takes a smarter approach by recognizing that some weights matter more than others. Instead of treating all weights equally, AWQ identifies the most important weights (based on activation patterns) and preserves them at higher precision while aggressively quantizing less important weights. This often achieves better quality than GPTQ at the same compression ratio.
GGUF format and llama.cpp quantization deserves special mention because it is what most people encounter when running models locally. The GGUF format (used by llama.cpp and Ollama) supports various quantization levels labeled Q2 through Q8. Q4_K_M is a popular default that balances quality and speed for local use. These quantized models are what make it possible to run Llama, Mistral, and other open models on a laptop.
Knowledge distillation: teaching a small model to mimic a large one
Quantization makes the same model smaller. Distillation creates an entirely new, smaller model that behaves like the large one.
Think of it this way: a master chess player has decades of experience and deep understanding of the game. A talented student cannot gain those decades of experience overnight, but they can learn to play almost as well by studying the master's moves and reasoning. The student will not understand everything the master understands, but they can reproduce most of the master's decisions.
In AI, the "master" is a large, capable model (called the teacher), and the "student" is a smaller, faster model. The distillation process works by running inputs through the large teacher model and recording not just the final answers but the teacher's confidence distribution across all possible answers. The small student model is then trained to match these distributions rather than just the correct answers. This is key -- the teacher's distribution contains rich information about relationships between answers. If the teacher says a sentiment is "90% positive, 8% neutral, 2% negative," that tells the student much more than just "positive."
Why does distillation work so well? Because the student learns from the teacher's "soft" knowledge -- the nuanced probability distributions -- rather than just hard labels. A student model trained via distillation typically outperforms an identical model trained directly on the same data with hard labels, often by a significant margin.
Real-world examples of distillation are everywhere. DistilBERT is a distilled version of BERT that is 60% smaller and 60% faster while retaining 97% of BERT's language understanding. Many of the smaller models from AI providers are distilled from their larger siblings.
Practical implementation
If you want to use these techniques, here is how to approach them practically.
For quantization of an existing model, the simplest path is to use pre-quantized models. Platforms like Hugging Face host thousands of models already quantized in GPTQ, AWQ, and GGUF formats. Download one that matches your quality and speed requirements and you are done.
If you need to quantize a model yourself, tools like AutoGPTQ, AutoAWQ, and llama.cpp's conversion scripts handle the process. You provide the original model and a small calibration dataset (typically 128-512 examples representative of your use case), and the tool produces a quantized version. Always evaluate the quantized model on your specific tasks -- aggregate benchmarks can be misleading about performance on your particular use case.
For distillation, the process is more involved. You need to run your target inputs through the teacher model to generate the training signal, then train the student model on those outputs. Frameworks like Hugging Face Transformers provide distillation utilities, and there are specific libraries like TextBrewer for text model distillation. The quality of your distillation dataset matters enormously -- it should represent the actual tasks and inputs the student model will encounter.
When to use each technique
Use quantization when you want a quick win with minimal effort. You have a model that works well but is too large or too slow. You want to deploy on consumer hardware or reduce serving costs. Pre-quantized models are often available so you do not need to do any work yourself.
Use distillation when you need a fundamentally smaller architecture (not just compressed weights). You have specific tasks where a specialized small model can match a generalist large model. You are willing to invest more effort for potentially larger gains in speed and cost.
Combine both when you need maximum compression. Distill a large model into a smaller one, then quantize the smaller model. This stacks the benefits: a distilled model that is already 5 times smaller can be quantized to become 20 times smaller than the original while retaining most of its capability on your target tasks.
Real-world impact
The practical impact of these techniques is enormous. Running LLMs on phones: Apple's on-device AI features and Google's on-device Gemini Nano use aggressively quantized and distilled models. A model that requires 32GB of GPU memory at full precision can run in 4GB of phone memory after INT4 quantization.
Affordable AI hosting: A company that would spend $10,000 per month serving a full-precision model can spend $2,000 per month (or less) serving a quantized version with negligible quality difference for their use case.
Offline and edge deployment: Quantized models power AI features in cars, IoT devices, and applications that need to work without internet access. This was impossible with full-precision models.
Faster response times: Smaller models generate tokens faster. A quantized model might respond in 500 milliseconds where the full-precision version took 2 seconds. For interactive applications, this difference is transformative.
Common mistakes
Assuming quantization always works. Some tasks are more sensitive to quantization than others. Math reasoning and coding tend to degrade more than general conversation. Always test on your specific use case.
Using the wrong quantization level. More compression is not always better. Q2 quantization saves more memory than Q4 but the quality drop may be unacceptable. Start with Q4 or INT8 and only go lower if you must.
Skipping evaluation after quantization. Benchmark numbers tell you the average impact, but your specific use case might be hit harder or lighter than average. Always evaluate on your own tasks before deploying.
Distilling without enough diverse training data. The student model can only learn what the teacher demonstrates. If your distillation dataset is too narrow, the student will fail on inputs outside that narrow range. Use a diverse, representative dataset.
Ignoring hardware compatibility. Not all hardware supports all quantization formats efficiently. INT8 is widely supported, but INT4 performance varies significantly across GPU and CPU architectures. Test on your actual deployment hardware.
What's next?
Explore more about making AI efficient and production-ready:
- Efficient Inference Optimization -- Broader techniques for making AI faster
- Cost and Latency -- Understanding the tradeoffs between quality, speed, and cost
- Deployment Patterns -- How to deploy optimized models in production
- AI Model Architectures -- Understanding the models you are compressing
Frequently Asked Questions
Can I quantize any AI model?
Most modern language models can be quantized, but results vary. Larger models (7B+ parameters) generally tolerate quantization better than smaller ones because they have more redundancy. Very small models may lose too much quality when quantized aggressively. Always test on your specific tasks before committing to a quantization level.
What is the difference between GPTQ and GGUF quantized models?
GPTQ is a quantization method designed for GPU inference -- it produces models optimized for running on NVIDIA GPUs. GGUF is a file format used by llama.cpp that supports CPU inference and mixed CPU/GPU inference. If you are running on a GPU server, GPTQ or AWQ models are typically faster. If you are running on a laptop or CPU-only setup, GGUF models are the standard choice.
How much quality do I actually lose with quantization?
With INT8 quantization, typically less than 1% on standard benchmarks -- most users cannot tell the difference. With INT4, expect 1-3% degradation on average, though some tasks like complex math may be affected more. With INT2 or INT3, quality drops become noticeable. The practical impact depends heavily on your specific use case, so always measure on your own tasks.
Is distillation the same as fine-tuning?
No, though they both involve training a model. Fine-tuning adjusts an existing model's weights using new data to improve performance on specific tasks. Distillation trains a smaller model to mimic a larger model's behavior. You can combine both: distill a large model into a smaller one, then fine-tune the smaller model for your specific use case.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Quantization
A compression technique that reduces AI model size and memory usage by using lower-precision numbers, making models faster and cheaper to run.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
Training
The process of feeding large amounts of data to an AI system so it learns patterns, relationships, and rules, enabling it to make predictions or generate output.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Related Guides
Model Compression: Smaller, Faster AI
AdvancedCompress AI models with quantization, pruning, and distillation. Deploy faster, cheaper models without sacrificing much accuracy.
9 min readAdvanced RAG Techniques
AdvancedGo beyond basic RAG: hybrid search, reranking, query expansion, HyDE, and multi-hop retrieval for better context quality.
9 min readDistributed Training for Large Models
AdvancedScale AI training across multiple GPUs and machines. Learn data parallelism, model parallelism, and pipeline parallelism strategies.
8 min read