TL;DR

Quantization and distillation are two techniques for making AI models smaller and faster without losing much quality. Quantization works like saving a photo as JPEG instead of RAW -- you trade a tiny bit of precision for a massive reduction in size. Distillation works like an expert teacher training a smaller student to perform nearly as well. Together, they make it possible to run powerful AI models on phones, laptops, and affordable servers instead of expensive data center GPUs.

Why it matters

The most capable AI models are enormous. GPT-4 is rumored to have over a trillion parameters. Running models this size requires multiple high-end GPUs that cost tens of thousands of dollars. For a company serving millions of users, the compute costs can reach millions of dollars per month.

This creates a practical problem: the best models are too expensive and too slow for many real-world applications. A customer service chatbot that takes 10 seconds to respond loses customers. A mobile app that requires a constant internet connection to a GPU cluster is unusable offline. A startup that cannot afford $50,000 per month in GPU costs simply cannot use these models.

Quantization and distillation solve this by making models dramatically smaller and faster. A properly quantized model can run 2-4 times faster while using a quarter of the memory. A well-distilled smaller model can achieve 90-95% of the larger model's quality at a fraction of the cost. These are not theoretical gains -- they are what makes it possible to run language models on your phone, in your browser, or on a $10-per-month server.

Quantization: the JPEG compression analogy

The easiest way to understand quantization is to think about image compression. A RAW photo from a high-end camera might be 50 megabytes. Save it as JPEG and it drops to 5 megabytes -- 10 times smaller. If you look carefully, you might notice very slight quality loss, but for most purposes the JPEG is indistinguishable from the original.

Quantization does the same thing for AI models. The "RAW photo" is a model that stores every number (every weight and activation) as a 32-bit or 16-bit floating-point number -- very precise but very large. Quantization converts those numbers to lower precision -- 8-bit integers, 4-bit integers, or even lower -- which dramatically reduces the model's size and speeds up calculations.

Why does lower precision work? Because most of the precision in those 32-bit numbers is unnecessary. The difference between a weight of 0.12345678 and 0.123 almost never matters for the model's output. By rounding to fewer decimal places (effectively), you save enormous amounts of memory and compute with minimal impact on quality.

Types of quantization

Different quantization approaches offer different tradeoffs between ease of implementation, quality retention, and compression ratio.

INT8 quantization converts 32-bit or 16-bit floating-point numbers to 8-bit integers. This is the most conservative and widely supported approach. It cuts model size roughly in half (compared to FP16) and speeds up inference significantly on hardware that supports INT8 operations, which includes most modern GPUs and CPUs. Quality loss is typically less than 1% on standard benchmarks. This is the safe, default choice for most production deployments.

INT4 quantization goes further, using only 4 bits per number. This cuts the model size to roughly a quarter of its FP16 size. The quality tradeoff is larger -- expect 1-3% degradation on benchmarks, with some tasks affected more than others. INT4 is what makes it possible to run 7-billion-parameter models on consumer GPUs and even smartphones. It is the sweet spot for local deployment of medium-sized models.

GPTQ (GPT Quantization) is a popular technique specifically designed for large language models. It works by quantizing the model one layer at a time, using a small calibration dataset to minimize the quality loss at each step. GPTQ models can be quantized to 4-bit or even 3-bit precision and are widely available on platforms like Hugging Face. If you have downloaded a model file with "GPTQ" in the name, this is what it used.

AWQ (Activation-Aware Weight Quantization) takes a smarter approach by recognizing that some weights matter more than others. Instead of treating all weights equally, AWQ identifies the most important weights (based on activation patterns) and preserves them at higher precision while aggressively quantizing less important weights. This often achieves better quality than GPTQ at the same compression ratio.

GGUF format and llama.cpp quantization deserves special mention because it is what most people encounter when running models locally. The GGUF format (used by llama.cpp and Ollama) supports various quantization levels labeled Q2 through Q8. Q4_K_M is a popular default that balances quality and speed for local use. These quantized models are what make it possible to run Llama, Mistral, and other open models on a laptop.

Knowledge distillation: teaching a small model to mimic a large one

Quantization makes the same model smaller. Distillation creates an entirely new, smaller model that behaves like the large one.

Think of it this way: a master chess player has decades of experience and deep understanding of the game. A talented student cannot gain those decades of experience overnight, but they can learn to play almost as well by studying the master's moves and reasoning. The student will not understand everything the master understands, but they can reproduce most of the master's decisions.

In AI, the "master" is a large, capable model (called the teacher), and the "student" is a smaller, faster model. The distillation process works by running inputs through the large teacher model and recording not just the final answers but the teacher's confidence distribution across all possible answers. The small student model is then trained to match these distributions rather than just the correct answers. This is key -- the teacher's distribution contains rich information about relationships between answers. If the teacher says a sentiment is "90% positive, 8% neutral, 2% negative," that tells the student much more than just "positive."

Why does distillation work so well? Because the student learns from the teacher's "soft" knowledge -- the nuanced probability distributions -- rather than just hard labels. A student model trained via distillation typically outperforms an identical model trained directly on the same data with hard labels, often by a significant margin.

Real-world examples of distillation are everywhere. DistilBERT is a distilled version of BERT that is 60% smaller and 60% faster while retaining 97% of BERT's language understanding. Many of the smaller models from AI providers are distilled from their larger siblings.

Practical implementation

If you want to use these techniques, here is how to approach them practically.

For quantization of an existing model, the simplest path is to use pre-quantized models. Platforms like Hugging Face host thousands of models already quantized in GPTQ, AWQ, and GGUF formats. Download one that matches your quality and speed requirements and you are done.

If you need to quantize a model yourself, tools like AutoGPTQ, AutoAWQ, and llama.cpp's conversion scripts handle the process. You provide the original model and a small calibration dataset (typically 128-512 examples representative of your use case), and the tool produces a quantized version. Always evaluate the quantized model on your specific tasks -- aggregate benchmarks can be misleading about performance on your particular use case.

For distillation, the process is more involved. You need to run your target inputs through the teacher model to generate the training signal, then train the student model on those outputs. Frameworks like Hugging Face Transformers provide distillation utilities, and there are specific libraries like TextBrewer for text model distillation. The quality of your distillation dataset matters enormously -- it should represent the actual tasks and inputs the student model will encounter.

When to use each technique

Use quantization when you want a quick win with minimal effort. You have a model that works well but is too large or too slow. You want to deploy on consumer hardware or reduce serving costs. Pre-quantized models are often available so you do not need to do any work yourself.

Use distillation when you need a fundamentally smaller architecture (not just compressed weights). You have specific tasks where a specialized small model can match a generalist large model. You are willing to invest more effort for potentially larger gains in speed and cost.

Combine both when you need maximum compression. Distill a large model into a smaller one, then quantize the smaller model. This stacks the benefits: a distilled model that is already 5 times smaller can be quantized to become 20 times smaller than the original while retaining most of its capability on your target tasks.

Real-world impact

The practical impact of these techniques is enormous. Running LLMs on phones: Apple's on-device AI features and Google's on-device Gemini Nano use aggressively quantized and distilled models. A model that requires 32GB of GPU memory at full precision can run in 4GB of phone memory after INT4 quantization.

Affordable AI hosting: A company that would spend $10,000 per month serving a full-precision model can spend $2,000 per month (or less) serving a quantized version with negligible quality difference for their use case.

Offline and edge deployment: Quantized models power AI features in cars, IoT devices, and applications that need to work without internet access. This was impossible with full-precision models.

Faster response times: Smaller models generate tokens faster. A quantized model might respond in 500 milliseconds where the full-precision version took 2 seconds. For interactive applications, this difference is transformative.

Common mistakes

Assuming quantization always works. Some tasks are more sensitive to quantization than others. Math reasoning and coding tend to degrade more than general conversation. Always test on your specific use case.

Using the wrong quantization level. More compression is not always better. Q2 quantization saves more memory than Q4 but the quality drop may be unacceptable. Start with Q4 or INT8 and only go lower if you must.

Skipping evaluation after quantization. Benchmark numbers tell you the average impact, but your specific use case might be hit harder or lighter than average. Always evaluate on your own tasks before deploying.

Distilling without enough diverse training data. The student model can only learn what the teacher demonstrates. If your distillation dataset is too narrow, the student will fail on inputs outside that narrow range. Use a diverse, representative dataset.

Ignoring hardware compatibility. Not all hardware supports all quantization formats efficiently. INT8 is widely supported, but INT4 performance varies significantly across GPU and CPU architectures. Test on your actual deployment hardware.

What's next?

Explore more about making AI efficient and production-ready: