- Home
- /Guides
- /build-deploy
- /Cost & Latency: Making AI Fast and Affordable
Cost & Latency: Making AI Fast and Affordable
Optimize AI systems for speed and cost. Techniques for reducing latency, controlling API costs, and scaling efficiently.
TL;DR
AI can drain your budget and frustrate users if you don't optimize for cost and speed. Use smaller models when possible, cache aggressively, stream responses, batch requests, and compress prompts. Choose the right model for each task—GPT-4 for complexity, GPT-3.5 for speed and savings. Monitor usage, set spending alerts, and test latency regularly. Small changes can cut costs by 50-90% and reduce response times from seconds to milliseconds.
Why it matters
Every AI API call costs money. Every second of latency loses users. Whether you're running a chatbot, content generator, or search system, you need to balance quality, speed, and cost. A poorly optimized AI system can rack up thousands in monthly bills and deliver slow, frustrating experiences. Optimization isn't optional—it's essential for sustainable AI products.
Understanding AI costs
AI costs come from API calls, compute, and storage. Most production systems use API-based models (OpenAI, Anthropic, Cohere), which charge per token—roughly 4 characters of text, or about 0.75 words.
Pricing examples (as of 2024)
| Model | Input cost | Output cost | Use case |
|---|---|---|---|
| GPT-4 Turbo | $10 / 1M tokens | $30 / 1M tokens | Complex reasoning, writing, analysis |
| GPT-3.5 Turbo | $0.50 / 1M tokens | $1.50 / 1M tokens | Chatbots, fast tasks, high-volume |
| Claude 3 Opus | $15 / 1M tokens | $75 / 1M tokens | Detailed analysis, long contexts |
| Claude 3 Haiku | $0.25 / 1M tokens | $1.25 / 1M tokens | Fast tasks, high-throughput |
| Embeddings (Ada) | $0.10 / 1M tokens | N/A | Search, RAG, similarity |
Real cost breakdown
Let's say your chatbot processes 10,000 conversations per day. Each conversation averages 500 input tokens (user + context) and 300 output tokens (AI response).
Daily cost with GPT-4 Turbo:
- Input: 10,000 × 500 × $10 / 1M = $50
- Output: 10,000 × 300 × $30 / 1M = $90
- Total: $140/day = $4,200/month
Daily cost with GPT-3.5 Turbo:
- Input: 10,000 × 500 × $0.50 / 1M = $2.50
- Output: 10,000 × 300 × $1.50 / 1M = $4.50
- Total: $7/day = $210/month
Savings: 95% by switching models for simpler tasks.
Jargon: "Token"
A chunk of text used by AI models—roughly 4 characters or 0.75 words. Models charge per token processed.
Measuring latency
Latency is how long users wait for AI responses. Key metrics:
- Time-to-first-token (TTFT): How long until the first word appears (critical for streaming)
- Total response time: How long until the full response is complete
- Throughput: How many requests the system handles per second
Typical latencies (GPT-3.5 Turbo, 500 tokens)
- TTFT: 200-800ms
- Total response time: 2-5 seconds
- Streaming vs. non-streaming: Streaming feels 2-3x faster to users because they see output immediately
Users notice delays above 200ms. Sub-100ms feels instant. Above 1 second feels slow.
Cost optimization techniques
1. Use the smallest model that works
Don't default to GPT-4 for everything. Most tasks—chat, summarization, Q&A—work fine with GPT-3.5 or Claude Haiku.
Decision tree:
- Complex reasoning, coding, research? → GPT-4, Claude Opus
- Simple chat, FAQs, summaries? → GPT-3.5, Claude Haiku
- Embeddings, search? → Ada, Cohere Embed
2. Cache responses aggressively
If users ask the same questions, cache the answers instead of re-calling the API.
Example: A customer support bot might answer "What's your refund policy?" 50 times a day. Cache the first response, serve it from memory (cost: $0).
Tools: Redis, Memcached, in-memory cache (LRU).
Savings: 30-70% cost reduction for repetitive queries.
3. Compress prompts
Every word in your prompt costs money. Cut fluff.
Before (120 tokens):
"I would like you to please summarize the following article for me in a way that is easy to understand and concise: [article]"
After (15 tokens):
"Summarize this article concisely: [article]"
Savings: 87% fewer input tokens.
4. Use shorter contexts
Don't dump entire documents into the prompt. Use RAG to retrieve only relevant chunks (see Embeddings & RAG Explained).
Example: Instead of sending 10,000-token docs to the model, retrieve the top 3 relevant paragraphs (500 tokens). Saves 95% on input costs.
5. Batch requests
If you're processing 1,000 documents, don't send 1,000 individual API calls. Batch them (if the API supports it) or process in parallel with rate limits.
OpenAI Batch API: 50% discount for non-urgent tasks (results in 24 hours).
6. Set max tokens
Limit output length with max_tokens. If you only need a 50-word summary, don't let the model generate 500 words.
Savings: 90% on output costs for short tasks.
7. Use streaming for perceived speed
Streaming doesn't reduce cost, but it feels faster to users. They see the response building in real-time instead of staring at a loading spinner.
Implementation: Set stream: true in API calls (OpenAI, Anthropic support this).
8. Monitor and alert
Set up spending alerts so you know if costs spike. Track:
- Daily/weekly costs
- Cost per request
- Token usage by endpoint
Tools: OpenAI usage dashboard, custom analytics (log API calls, track costs in your DB).
Latency optimization techniques
1. Stream responses
Non-streaming: User waits 5 seconds, then sees the full response.
Streaming: User sees words after 500ms, feels 80% faster.
Always stream for user-facing chat.
2. Use smaller, faster models
GPT-3.5 responds in 2 seconds. GPT-4 takes ~5 seconds. Claude Haiku is even faster (1-2 seconds).
For speed-critical tasks, use the smallest viable model.
3. Reduce prompt size
Larger prompts = slower processing. Keep prompts under 1,000 tokens when possible.
4. Edge deployment (advanced)
Run models close to users (edge servers, CDNs like Cloudflare Workers). Reduces network latency.
Example: Hosting a small model on the edge can shave 200-500ms off response time vs. a US-based API from Europe.
Trade-off: More complex infrastructure, limited model sizes.
5. Precompute when possible
If you know users will ask certain questions, generate answers ahead of time and serve them instantly.
Example: FAQ bot pre-generates answers to common questions, stores them in a DB. Zero latency, zero cost per query.
6. Parallelize tasks
If you need to process multiple steps (embeddings, retrieval, LLM call), run them in parallel where possible.
Example: While retrieving docs from a vector DB, start warming up the API connection. Saves 100-200ms.
Choosing the right model
When to use GPT-4 (or Claude Opus)
- Complex reasoning (legal analysis, research, code reviews)
- Long, nuanced writing
- Multi-step problem solving
- High-stakes accuracy
When to use GPT-3.5 (or Claude Haiku)
- Simple chat, FAQs
- Summarization
- High-volume tasks
- Cost-sensitive applications
When to use embeddings (Ada, Cohere)
- Search, RAG, similarity matching
- Clustering, recommendations
When to self-host (advanced)
- Very high volume (100k+ requests/day)
- Specialized models (fine-tuned, domain-specific)
- Data privacy requirements (on-prem)
Cost crossover: Cloud APIs are cheaper below ~50k requests/day. Above that, self-hosting can save 50-80%.
Real-world cost-saving examples
Example 1: Customer support bot
Before:
- 5,000 queries/day
- GPT-4 for all queries
- $70/day = $2,100/month
After:
- Cache 40% of common FAQs (free)
- Use GPT-3.5 for 50% of queries
- Use GPT-4 for 10% of complex queries
- $8/day = $240/month
Savings: $1,860/month (89%)
Example 2: Content summarizer
Before:
- 1,000 articles/day
- Full article (5,000 tokens) sent to GPT-3.5
- $7.50/day = $225/month
After:
- Use RAG to extract top 500 tokens
- Compress prompts
- $0.75/day = $22.50/month
Savings: $202.50/month (90%)
Example 3: Search system (RAG)
Before:
- 10,000 searches/day
- Retrieve 10 chunks (2,000 tokens) + GPT-3.5
- $25/day = $750/month
After:
- Retrieve 3 chunks (600 tokens)
- Cache top 20% of queries
- Use Claude Haiku (cheaper)
- $6/day = $180/month
Savings: $570/month (76%)
Infrastructure choices
Cloud APIs (OpenAI, Anthropic, Cohere)
- Pros: No setup, auto-scaling, latest models
- Cons: Per-request cost, no control over infrastructure
- Best for: Startups, MVPs, low-to-medium volume
Self-hosted (Hugging Face, vLLM, Ollama)
- Pros: Cost-effective at scale, full control, data privacy
- Cons: Setup complexity, GPU costs, maintenance
- Best for: High volume (100k+ requests/day), specialized models
Hybrid
- Use cloud APIs for prototyping
- Switch to self-hosted for high-volume tasks
- Keep edge cases on cloud APIs
Monitoring and alerts
Track these metrics:
- Cost per request (detect inefficiencies)
- Total monthly spend (budget control)
- 95th percentile latency (catch slow outliers)
- Cache hit rate (optimize caching)
- Error rate (failed API calls waste money)
Tools: OpenAI dashboard, Datadog, custom logging (store API call metadata in your DB).
Set alerts:
- Daily spend > $X
- Latency > 5 seconds
- Error rate > 5%
Common pitfalls
1. Not caching
Repeated queries cost money. Cache everything you can.
2. Over-relying on GPT-4
Use it only when necessary. GPT-3.5 handles 80% of tasks at 1/20th the cost.
3. Ignoring prompt size
Every extra word costs money. Compress aggressively.
4. Not streaming
Users hate waiting. Stream to improve perceived speed.
5. No spending alerts
Costs can spiral. Set alerts early.
Use responsibly
- Don't sacrifice quality for cost (users notice bad answers)
- Monitor for bias and errors (cheaper models can be less accurate)
- Test changes rigorously (a/b test model swaps to verify quality)
- Set hard spending limits (prevent runaway costs in production)
What's next?
- Deployment Patterns: Learn about serverless, edge, and container options
- Embeddings & RAG: Optimize retrieval for cost and speed
- Evaluations 201: Measure quality vs. cost trade-offs
- Prompting 201: Advanced techniques for efficient prompts
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Related Guides
Monitoring AI Systems in Production
AdvancedEnterprise-grade monitoring, alerting, and observability for production AI systems. Learn to track performance, costs, quality, and security at scale.
Efficient Inference Optimization
AdvancedOptimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.
Context Management: Handling Long Conversations and Documents
IntermediateMaster context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.