Efficient Inference Optimization
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Optimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.
TL;DR
Inference optimization is about making AI models respond faster and cheaper without sacrificing quality. Techniques like quantization, batching, caching, and speculative decoding can achieve 2-10x speedups. Whether you self-host models or use APIs, understanding these techniques helps you build better products and manage costs.
Why it matters
Training an AI model is a one-time cost (or at least infrequent). Inference -- the process of actually using the model to generate responses -- is the ongoing cost that runs every single time a user asks a question, every time your app processes a request, every minute of every day.
For context: training GPT-4 reportedly cost over $100 million. But OpenAI's inference costs for serving millions of users daily dwarf the training cost many times over. Inference is where the real money goes.
Speed matters too. Users expect responses in seconds, not minutes. A chatbot that takes 30 seconds to reply feels broken. A code completion tool that takes 5 seconds defeats its own purpose. Internal AI tools that take minutes per request will not get adopted by employees.
Whether you are self-hosting open-source models or building on top of API providers, understanding inference optimization helps you make better architecture decisions, reduce costs, and deliver a better user experience.
Quantization: JPEG compression for AI models
Quantization is the single most impactful optimization technique for most teams. The concept is simple: reduce the numerical precision of a model's parameters.
During training, model parameters are stored as high-precision numbers (32-bit floating point, or FP32). Each parameter takes up 4 bytes of memory. A 70-billion parameter model at FP32 needs about 280 GB of memory -- far more than any single GPU can hold.
Quantization converts these numbers to lower precision:
- FP16/BF16 (16-bit): Halves memory usage. Minimal quality loss. This is the baseline for most deployments.
- INT8 (8-bit): Quarters the original memory. Very small quality loss for most tasks. A 70B model fits on a single high-end GPU.
- INT4 (4-bit): One-eighth the original memory. Noticeable quality loss on difficult tasks, but acceptable for many use cases. A 70B model fits on consumer hardware.
Think of it like image compression. A RAW photo is huge and pixel-perfect. A high-quality JPEG is much smaller and looks nearly identical. A heavily compressed JPEG is tiny but visibly degraded. Quantization makes the same trade-off with AI models: smaller size and faster speed, with a gradual quality reduction.
Practical impact: Quantizing a model from FP16 to INT4 can make it run 2-3x faster while using 75% less memory. For many applications, the quality difference is imperceptible.
Knowledge distillation: teaching a small model from a large one
Knowledge distillation is another powerful approach: train a small, fast model to mimic the behavior of a large, accurate model.
The process works like this:
- Run the large model (the "teacher") on your dataset and record its outputs, including the probability distributions over possible answers.
- Train a small model (the "student") to reproduce those same outputs. The student learns not just the right answers but the teacher's confidence and nuance.
- Deploy the student model in production, enjoying dramatically faster inference.
The student model often achieves 90-95% of the teacher's quality at 10-50x the speed. This is because the teacher's outputs contain richer information than raw training data -- they encode the teacher's "reasoning" about each example.
Real-world example: Many production AI features use distilled models. The AI that powers autocomplete in your email client is not a 175-billion parameter model -- it is a small, distilled model that runs in milliseconds, trained to mimic a much larger model's capabilities for that specific task.
Batching: processing multiple requests together
When a GPU processes one request at a time, most of its compute capacity sits idle. Batching groups multiple requests together and processes them in parallel, dramatically improving throughput.
Static batching
Wait until you have a fixed number of requests (say, 8), then process them all at once. Simple but introduces latency -- early requests wait for the batch to fill up.
Dynamic batching
More sophisticated. New requests are continuously added to the current batch as earlier requests finish. This keeps GPU utilization high without forcing requests to wait. Most modern serving frameworks support this.
Continuous batching
The most advanced approach, used by vLLM and other modern frameworks. Since LLM generation produces tokens one at a time, different requests in a batch may finish at different times. Continuous batching immediately fills the freed slot with a new request, maximizing GPU utilization at every moment.
Practical impact: Dynamic and continuous batching can improve throughput by 2-5x compared to processing one request at a time, with little or no increase in latency for individual requests.
Caching strategies
KV cache (Key-Value cache)
This is specific to transformer models and is critical for understanding why LLM inference works the way it does.
When a transformer model generates text, it produces one token at a time. For each new token, it needs to "attend to" (look at) all previous tokens. Without caching, the model would recompute the attention calculations for all previous tokens at every step -- generating the 100th token would redo all the work from tokens 1-99.
The KV cache stores the intermediate calculations (key and value vectors) from previous tokens. When generating the next token, the model only computes the new token's contribution and looks up the cached values for everything before it.
Without KV cache: Generating 100 tokens requires roughly 100 + 99 + 98 + ... + 1 = 5,050 units of work.
With KV cache: Generating 100 tokens requires roughly 100 units of work.
The trade-off is memory. KV caches can consume significant GPU memory, especially for long conversations. This is why models have context window limits -- the KV cache for a 100,000 token conversation is enormous.
Prompt caching
If many requests share a common prefix (the same system prompt, for example), you can compute and cache the KV values for that prefix once and reuse them for all requests. API providers like Anthropic and OpenAI offer prompt caching that can reduce costs by 50-90% for requests with shared prefixes.
Semantic caching
Cache complete responses for semantically similar queries. If ten users ask "What is the capital of France?" in slightly different ways, generate the answer once and serve cached responses for the rest. This requires an embedding-based similarity check but can dramatically reduce costs for common queries.
Hardware optimization basics
Different hardware excels at different aspects of inference:
- NVIDIA GPUs (A100, H100, H200): The standard for LLM inference. H100s offer significant speedups over A100s due to better support for FP8 operations and higher memory bandwidth.
- Apple Silicon (M-series): Surprisingly capable for local inference of smaller models, thanks to unified memory architecture.
- Intel/AMD CPUs: Suitable for smaller models or with heavy quantization. Much cheaper per hour but slower per token.
- Specialized AI chips (Google TPUs, AWS Inferentia, Groq LPUs): Designed specifically for inference workloads. Groq's LPU architecture, for example, achieves extremely high throughput by eliminating the memory bandwidth bottleneck.
For API users: You do not need to worry about hardware selection -- the provider handles it. But understanding hardware helps you evaluate provider pricing and make informed decisions about self-hosting versus API usage.
Measuring what matters
Track these metrics to understand your inference performance:
- Latency (time to first token): How long before the user sees the first word of the response. This is the most important metric for user experience. Aim for under 1 second.
- Latency (total generation time): How long until the complete response is ready. For streaming responses, time-to-first-token matters more.
- Throughput (tokens per second): How many tokens the system generates per second across all concurrent requests. This determines how many users you can serve simultaneously.
- p50, p95, p99 latency: The median, 95th percentile, and 99th percentile response times. p95 is often more important than the average -- it tells you what the slowest 5% of users experience.
- Cost per token: How much each generated token costs in compute. This is the metric that determines your unit economics.
Practical tips for API users
Even if you are not self-hosting models, you can optimize inference performance:
- Use prompt caching. If your system prompt is long and shared across requests, enable prompt caching. Anthropic and OpenAI both offer this feature.
- Stream responses. Do not wait for the complete response before showing it to users. Stream tokens as they are generated to reduce perceived latency.
- Choose the right model size. Not every task needs the largest model. Using Claude Haiku or GPT-4o-mini for simple tasks can be 10-20x cheaper and 5x faster than using the flagship model.
- Batch non-urgent requests. If you have bulk processing jobs (analyzing 1,000 documents), use batch APIs that offer 50% cost discounts in exchange for longer processing times.
- Cache common responses. If you see repeated or similar queries, implement a semantic cache to avoid redundant API calls.
- Minimize input tokens. Be concise in system prompts and instructions. Every unnecessary token costs money and adds latency.
Common mistakes
- Optimizing before measuring. You cannot improve what you do not measure. Instrument your system with latency and throughput tracking before attempting any optimization.
- Choosing the largest model by default. Using a 70B parameter model for a task that a 7B model handles well is paying 10x more for no benefit. Benchmark smaller models on your specific task before committing to the largest available.
- Ignoring quantization. Many teams deploy models at FP16 by default and never try quantization. INT8 or even INT4 quantization often provides dramatic speedups with minimal quality impact for production use cases.
- Forgetting about cold starts. The first request after a model is loaded takes much longer than subsequent ones. In serverless deployments, this "cold start" penalty can be several seconds. Design your architecture to handle this -- keep models warm with periodic health checks.
- Not using streaming. Waiting for a complete response before displaying anything to the user is the simplest way to make your application feel slow. Streaming tokens as they are generated makes the application feel responsive even when total generation takes several seconds.
What's next?
- Cost and Latency -- understanding the fundamental cost-speed trade-offs in AI systems
- AI Cost Management -- comprehensive strategies for managing AI infrastructure expenses
- Deployment Patterns -- how to architect AI systems for production reliability
- AI Latency Optimization -- deep dive into reducing response times across the full stack
Frequently Asked Questions
What is the single most impactful inference optimization?
For self-hosted models, quantization -- converting from FP16 to INT8 or INT4 typically provides 2-3x speedup with minimal quality loss. For API users, choosing the right model size for each task is the biggest lever. Using a smaller, cheaper model for simple tasks while reserving the largest model for complex reasoning can cut costs by 5-10x.
Can I optimize inference speed if I am using an API like OpenAI or Anthropic?
Yes. Enable prompt caching for shared system prompts, stream responses instead of waiting for completion, use the smallest model that handles your task well, batch non-urgent requests, and implement semantic caching for repeated queries. These strategies can significantly reduce both cost and latency.
How much quality do you lose with quantization?
With INT8 quantization, most benchmarks show less than 1% performance degradation -- essentially undetectable in practice. INT4 quantization shows 2-5% degradation on challenging benchmarks, though for many production tasks the difference is not noticeable to end users. Always benchmark on your specific use case.
Is it worth self-hosting models instead of using APIs?
It depends on your volume and requirements. At low volumes (under 10,000 requests per day), APIs are almost always cheaper and simpler. At high volumes (millions of requests), self-hosting with proper optimization can be significantly cheaper. Self-hosting also gives you more control over latency, privacy, and customization.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Inference
The process where a trained AI model takes new input and produces an output—a prediction, answer, or generated text. This is the 'using' phase that happens after training is complete.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Latency
The time delay between sending a request to an AI model and receiving the first part of its response. Lower latency means faster replies.
Related Guides
AI Latency Optimization: Making AI Faster
IntermediateLearn to reduce AI response times. From model optimization to infrastructure tuning—practical techniques for building faster AI applications.
10 min readCost & Latency: Making AI Fast and Affordable
AdvancedOptimize AI systems for speed and cost. Techniques for reducing latency, controlling API costs, and scaling efficiently.
13 min readBenchmarking AI Models: Measuring What Matters
IntermediateLearn to benchmark AI models effectively. From choosing metrics to running fair comparisons—practical guidance for evaluating AI performance.
9 min read