Quantization
Also known as: Model Quantization, Weight Quantization
In one sentence
A compression technique that reduces AI model size and memory usage by using lower-precision numbers, making models faster and cheaper to run.
Explain like I'm 12
Like compressing a huge photo file to make it smaller — you lose a tiny bit of quality, but now it loads way faster and takes up less space on your phone.
In context
Converting a 70-billion-parameter model from 32-bit to 4-bit precision can shrink it from 140 GB to under 35 GB, letting it run on a single consumer GPU instead of an expensive server cluster. Tools like GPTQ and GGUF make quantized versions of open-source models available for local use. Businesses use quantization to cut cloud inference costs by 50-75% while keeping output quality within a few percentage points of the original.
See also
Related Guides
Learn more about Quantization in these guides:
Quantization and Distillation Deep Dive
AdvancedMaster advanced model compression: quantization-aware training, mixed precision, and distillation strategies for production deployment.
8 min readModel Compression: Smaller, Faster AI
AdvancedCompress AI models with quantization, pruning, and distillation. Deploy faster, cheaper models without sacrificing much accuracy.
9 min readEfficient Inference Optimization
AdvancedOptimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.
8 min read