Skip to main content
BETAThis is a new design — give feedback

Quantization

Also known as: Model Quantization, Weight Quantization

In one sentence

A compression technique that reduces AI model size and memory usage by using lower-precision numbers, making models faster and cheaper to run.

Explain like I'm 12

Like compressing a huge photo file to make it smaller — you lose a tiny bit of quality, but now it loads way faster and takes up less space on your phone.

In context

Converting a 70-billion-parameter model from 32-bit to 4-bit precision can shrink it from 140 GB to under 35 GB, letting it run on a single consumer GPU instead of an expensive server cluster. Tools like GPTQ and GGUF make quantized versions of open-source models available for local use. Businesses use quantization to cut cloud inference costs by 50-75% while keeping output quality within a few percentage points of the original.

See also

Related Guides

Learn more about Quantization in these guides: