Cost & Latency: Making AI Fast and Affordable
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Optimize AI systems for speed and cost. Techniques for reducing latency, controlling API costs, and scaling efficiently.
TL;DR
AI can drain your budget and frustrate users if you don't optimize for cost and speed. Use smaller models when possible, cache aggressively, stream responses, batch requests, and compress prompts. Choose the right model for each task—GPT-4 for complexity, GPT-3.5 for speed and savings. Monitor usage, set spending alerts, and test latency regularly. Small changes can cut costs by 50-90% and reduce response times from seconds to milliseconds.
Why it matters
Every AI API call costs money. Every second of latency loses users. Whether you're running a chatbot, content generator, or search system, you need to balance quality, speed, and cost. A poorly optimized AI system can rack up thousands in monthly bills and deliver slow, frustrating experiences. Optimization isn't optional—it's essential for sustainable AI products.
Understanding AI costs
AI costs come from API calls, compute, and storage. Most production systems use API-based models (OpenAI, Anthropic, Cohere), which charge per token—roughly 4 characters of text, or about 0.75 words.
Pricing examples (as of 2024)
| Model | Input cost | Output cost | Use case |
|---|---|---|---|
| GPT-4 Turbo | $10 / 1M tokens | $30 / 1M tokens | Complex reasoning, writing, analysis |
| GPT-3.5 Turbo | $0.50 / 1M tokens | $1.50 / 1M tokens | Chatbots, fast tasks, high-volume |
| Claude 3 Opus | $15 / 1M tokens | $75 / 1M tokens | Detailed analysis, long contexts |
| Claude 3 Haiku | $0.25 / 1M tokens | $1.25 / 1M tokens | Fast tasks, high-throughput |
| Embeddings (Ada) | $0.10 / 1M tokens | N/A | Search, RAG, similarity |
Real cost breakdown
Let's say your chatbot processes 10,000 conversations per day. Each conversation averages 500 input tokens (user + context) and 300 output tokens (AI response).
Daily cost with GPT-4 Turbo:
- Input: 10,000 × 500 × $10 / 1M = $50
- Output: 10,000 × 300 × $30 / 1M = $90
- Total: $140/day = $4,200/month
Daily cost with GPT-3.5 Turbo:
- Input: 10,000 × 500 × $0.50 / 1M = $2.50
- Output: 10,000 × 300 × $1.50 / 1M = $4.50
- Total: $7/day = $210/month
Savings: 95% by switching models for simpler tasks.
Jargon: "Token"
A chunk of text used by AI models—roughly 4 characters or 0.75 words. Models charge per token processed.
Measuring latency
Latency is how long users wait for AI responses. Key metrics:
- Time-to-first-token (TTFT): How long until the first word appears (critical for streaming)
- Total response time: How long until the full response is complete
- Throughput: How many requests the system handles per second
Typical latencies (GPT-3.5 Turbo, 500 tokens)
- TTFT: 200-800ms
- Total response time: 2-5 seconds
- Streaming vs. non-streaming: Streaming feels 2-3x faster to users because they see output immediately
Users notice delays above 200ms. Sub-100ms feels instant. Above 1 second feels slow.
Cost optimization techniques
1. Use the smallest model that works
Don't default to GPT-4 for everything. Most tasks—chat, summarization, Q&A—work fine with GPT-3.5 or Claude Haiku.
Decision tree:
- Complex reasoning, coding, research? → GPT-4, Claude Opus
- Simple chat, FAQs, summaries? → GPT-3.5, Claude Haiku
- Embeddings, search? → Ada, Cohere Embed
2. Cache responses aggressively
If users ask the same questions, cache the answers instead of re-calling the API.
Example: A customer support bot might answer "What's your refund policy?" 50 times a day. Cache the first response, serve it from memory (cost: $0).
Tools: Redis, Memcached, in-memory cache (LRU).
Savings: 30-70% cost reduction for repetitive queries.
3. Compress prompts
Every word in your prompt costs money. Cut fluff.
Before (120 tokens):
"I would like you to please summarize the following article for me in a way that is easy to understand and concise: [article]"
After (15 tokens):
"Summarize this article concisely: [article]"
Savings: 87% fewer input tokens.
4. Use shorter contexts
Don't dump entire documents into the prompt. Use RAG to retrieve only relevant chunks (see Embeddings & RAG Explained).
Example: Instead of sending 10,000-token docs to the model, retrieve the top 3 relevant paragraphs (500 tokens). Saves 95% on input costs.
5. Batch requests
If you're processing 1,000 documents, don't send 1,000 individual API calls. Batch them (if the API supports it) or process in parallel with rate limits.
OpenAI Batch API: 50% discount for non-urgent tasks (results in 24 hours).
6. Set max tokens
Limit output length with max_tokens. If you only need a 50-word summary, don't let the model generate 500 words.
Savings: 90% on output costs for short tasks.
7. Use streaming for perceived speed
Streaming doesn't reduce cost, but it feels faster to users. They see the response building in real-time instead of staring at a loading spinner.
Implementation: Set stream: true in API calls (OpenAI, Anthropic support this).
8. Monitor and alert
Set up spending alerts so you know if costs spike. Track:
- Daily/weekly costs
- Cost per request
- Token usage by endpoint
Tools: OpenAI usage dashboard, custom analytics (log API calls, track costs in your DB).
Latency optimization techniques
1. Stream responses
Non-streaming: User waits 5 seconds, then sees the full response.
Streaming: User sees words after 500ms, feels 80% faster.
Always stream for user-facing chat.
2. Use smaller, faster models
GPT-3.5 responds in 2 seconds. GPT-4 takes ~5 seconds. Claude Haiku is even faster (1-2 seconds).
For speed-critical tasks, use the smallest viable model.
3. Reduce prompt size
Larger prompts = slower processing. Keep prompts under 1,000 tokens when possible.
4. Edge deployment (advanced)
Run models close to users (edge servers, CDNs like Cloudflare Workers). Reduces network latency.
Example: Hosting a small model on the edge can shave 200-500ms off response time vs. a US-based API from Europe.
Trade-off: More complex infrastructure, limited model sizes.
5. Precompute when possible
If you know users will ask certain questions, generate answers ahead of time and serve them instantly.
Example: FAQ bot pre-generates answers to common questions, stores them in a DB. Zero latency, zero cost per query.
6. Parallelize tasks
If you need to process multiple steps (embeddings, retrieval, LLM call), run them in parallel where possible.
Example: While retrieving docs from a vector DB, start warming up the API connection. Saves 100-200ms.
Choosing the right model
When to use GPT-4 (or Claude Opus)
- Complex reasoning (legal analysis, research, code reviews)
- Long, nuanced writing
- Multi-step problem solving
- High-stakes accuracy
When to use GPT-3.5 (or Claude Haiku)
- Simple chat, FAQs
- Summarization
- High-volume tasks
- Cost-sensitive applications
When to use embeddings (Ada, Cohere)
- Search, RAG, similarity matching
- Clustering, recommendations
When to self-host (advanced)
- Very high volume (100k+ requests/day)
- Specialized models (fine-tuned, domain-specific)
- Data privacy requirements (on-prem)
Cost crossover: Cloud APIs are cheaper below ~50k requests/day. Above that, self-hosting can save 50-80%.
Real-world cost-saving examples
Example 1: Customer support bot
Before:
- 5,000 queries/day
- GPT-4 for all queries
- $70/day = $2,100/month
After:
- Cache 40% of common FAQs (free)
- Use GPT-3.5 for 50% of queries
- Use GPT-4 for 10% of complex queries
- $8/day = $240/month
Savings: $1,860/month (89%)
Example 2: Content summarizer
Before:
- 1,000 articles/day
- Full article (5,000 tokens) sent to GPT-3.5
- $7.50/day = $225/month
After:
- Use RAG to extract top 500 tokens
- Compress prompts
- $0.75/day = $22.50/month
Savings: $202.50/month (90%)
Example 3: Search system (RAG)
Before:
- 10,000 searches/day
- Retrieve 10 chunks (2,000 tokens) + GPT-3.5
- $25/day = $750/month
After:
- Retrieve 3 chunks (600 tokens)
- Cache top 20% of queries
- Use Claude Haiku (cheaper)
- $6/day = $180/month
Savings: $570/month (76%)
Infrastructure choices
Cloud APIs (OpenAI, Anthropic, Cohere)
- Pros: No setup, auto-scaling, latest models
- Cons: Per-request cost, no control over infrastructure
- Best for: Startups, MVPs, low-to-medium volume
Self-hosted (Hugging Face, vLLM, Ollama)
- Pros: Cost-effective at scale, full control, data privacy
- Cons: Setup complexity, GPU costs, maintenance
- Best for: High volume (100k+ requests/day), specialized models
Hybrid
- Use cloud APIs for prototyping
- Switch to self-hosted for high-volume tasks
- Keep edge cases on cloud APIs
Monitoring and alerts
Track these metrics:
- Cost per request (detect inefficiencies)
- Total monthly spend (budget control)
- 95th percentile latency (catch slow outliers)
- Cache hit rate (optimize caching)
- Error rate (failed API calls waste money)
Tools: OpenAI dashboard, Datadog, custom logging (store API call metadata in your DB).
Set alerts:
- Daily spend > $X
- Latency > 5 seconds
- Error rate > 5%
Common pitfalls
1. Not caching
Repeated queries cost money. Cache everything you can.
2. Over-relying on GPT-4
Use it only when necessary. GPT-3.5 handles 80% of tasks at 1/20th the cost.
3. Ignoring prompt size
Every extra word costs money. Compress aggressively.
4. Not streaming
Users hate waiting. Stream to improve perceived speed.
5. No spending alerts
Costs can spiral. Set alerts early.
Use responsibly
- Don't sacrifice quality for cost (users notice bad answers)
- Monitor for bias and errors (cheaper models can be less accurate)
- Test changes rigorously (a/b test model swaps to verify quality)
- Set hard spending limits (prevent runaway costs in production)
What's next?
- Deployment Patterns: Learn about serverless, edge, and container options
- Embeddings & RAG: Optimize retrieval for cost and speed
- Evaluations 201: Measure quality vs. cost trade-offs
- Prompting 201: Advanced techniques for efficient prompts
Frequently Asked Questions
How much does it cost to run an AI chatbot?
It depends on volume and model choice. A small chatbot handling 1,000 conversations per day on GPT-3.5 Turbo costs roughly $7-15/month. The same volume on GPT-4 Turbo costs $140+/month. Caching common queries and using smaller models for simple questions can cut costs by 50-90%.
What is the fastest way to reduce AI API costs?
Cache responses for frequently asked questions and use the smallest model that delivers acceptable quality. These two changes alone typically reduce costs by 60-80%. Compressing prompts and limiting output token length provide additional savings.
Why is streaming important for AI user experience?
Streaming shows users the response as it generates, word by word. Even though total response time stays the same, users see the first token in 200-800ms instead of waiting 5+ seconds for the complete response. This makes the experience feel 2-3x faster.
When should I consider self-hosting AI models instead of using APIs?
Self-hosting becomes cost-effective above roughly 50,000-100,000 requests per day, when you have strict data privacy requirements, or when you need specialized fine-tuned models. Below that volume, cloud APIs are simpler and cheaper. Many teams use a hybrid approach.
How do I prevent unexpected AI cost spikes?
Set daily spending limits and alerts with your API provider. Monitor cost per request and total monthly spend. Use rate limiting to cap request volume. Test changes in staging before production, and always set max_tokens on API calls to prevent runaway output costs.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Latency
The time delay between sending a request to an AI model and receiving the first part of its response. Lower latency means faster replies.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Related Guides
Monitoring AI Systems in Production
AdvancedEnterprise-grade monitoring, alerting, and observability for production AI systems. Learn to track performance, costs, quality, and security at scale.
20 min readEfficient Inference Optimization
AdvancedOptimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.
8 min readContext Management: Handling Long Conversations and Documents
IntermediateMaster context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.
12 min read