Advanced13 min read

Cost & Latency: Making AI Fast and Affordable

Optimize AI systems for speed and cost. Techniques for reducing latency, controlling API costs, and scaling efficiently.

costlatencyperformanceoptimizationscaling

TL;DR

AI can drain your budget and frustrate users if you don't optimize for cost and speed. Use smaller models when possible, cache aggressively, stream responses, batch requests, and compress prompts. Choose the right model for each task—GPT-4 for complexity, GPT-3.5 for speed and savings. Monitor usage, set spending alerts, and test latency regularly. Small changes can cut costs by 50-90% and reduce response times from seconds to milliseconds.

Why it matters

Every AI API call costs money. Every second of latency loses users. Whether you're running a chatbot, content generator, or search system, you need to balance quality, speed, and cost. A poorly optimized AI system can rack up thousands in monthly bills and deliver slow, frustrating experiences. Optimization isn't optional—it's essential for sustainable AI products.

Understanding AI costs

AI costs come from API calls, compute, and storage. Most production systems use API-based models (OpenAI, Anthropic, Cohere), which charge per token—roughly 4 characters of text, or about 0.75 words.

Pricing examples (as of 2024)

Model	Input cost	Output cost	Use case
GPT-4 Turbo	$10 / 1M tokens	$30 / 1M tokens	Complex reasoning, writing, analysis
GPT-3.5 Turbo	$0.50 / 1M tokens	$1.50 / 1M tokens	Chatbots, fast tasks, high-volume
Claude 3 Opus	$15 / 1M tokens	$75 / 1M tokens	Detailed analysis, long contexts
Claude 3 Haiku	$0.25 / 1M tokens	$1.25 / 1M tokens	Fast tasks, high-throughput
Embeddings (Ada)	$0.10 / 1M tokens	N/A	Search, RAG, similarity

Real cost breakdown

Let's say your chatbot processes 10,000 conversations per day. Each conversation averages 500 input tokens (user + context) and 300 output tokens (AI response).

Daily cost with GPT-4 Turbo:

Input: 10,000 × 500 × $10 / 1M = $50
Output: 10,000 × 300 × $30 / 1M = $90
Total: $140/day = $4,200/month

Daily cost with GPT-3.5 Turbo:

Input: 10,000 × 500 × $0.50 / 1M = $2.50
Output: 10,000 × 300 × $1.50 / 1M = $4.50
Total: $7/day = $210/month

Savings: 95% by switching models for simpler tasks.

Jargon: "Token"
A chunk of text used by AI models—roughly 4 characters or 0.75 words. Models charge per token processed.

Measuring latency

Latency is how long users wait for AI responses. Key metrics:

Time-to-first-token (TTFT): How long until the first word appears (critical for streaming)
Total response time: How long until the full response is complete
Throughput: How many requests the system handles per second

Typical latencies (GPT-3.5 Turbo, 500 tokens)

TTFT: 200-800ms
Total response time: 2-5 seconds
Streaming vs. non-streaming: Streaming feels 2-3x faster to users because they see output immediately

Users notice delays above 200ms. Sub-100ms feels instant. Above 1 second feels slow.

Cost optimization techniques

1. Use the smallest model that works

Don't default to GPT-4 for everything. Most tasks—chat, summarization, Q&A—work fine with GPT-3.5 or Claude Haiku.

Decision tree:

Complex reasoning, coding, research? → GPT-4, Claude Opus
Simple chat, FAQs, summaries? → GPT-3.5, Claude Haiku
Embeddings, search? → Ada, Cohere Embed

2. Cache responses aggressively

If users ask the same questions, cache the answers instead of re-calling the API.

Example: A customer support bot might answer "What's your refund policy?" 50 times a day. Cache the first response, serve it from memory (cost: $0).

Tools: Redis, Memcached, in-memory cache (LRU).

Savings: 30-70% cost reduction for repetitive queries.

3. Compress prompts

Every word in your prompt costs money. Cut fluff.

Before (120 tokens):

"I would like you to please summarize the following article for me in a way that is easy to understand and concise: [article]"

After (15 tokens):

"Summarize this article concisely: [article]"

Savings: 87% fewer input tokens.

4. Use shorter contexts

Don't dump entire documents into the prompt. Use RAG to retrieve only relevant chunks (see Embeddings & RAG Explained).

Example: Instead of sending 10,000-token docs to the model, retrieve the top 3 relevant paragraphs (500 tokens). Saves 95% on input costs.

5. Batch requests

If you're processing 1,000 documents, don't send 1,000 individual API calls. Batch them (if the API supports it) or process in parallel with rate limits.

OpenAI Batch API: 50% discount for non-urgent tasks (results in 24 hours).

6. Set max tokens

Limit output length with max_tokens. If you only need a 50-word summary, don't let the model generate 500 words.

Savings: 90% on output costs for short tasks.

7. Use streaming for perceived speed

Streaming doesn't reduce cost, but it feels faster to users. They see the response building in real-time instead of staring at a loading spinner.

Implementation: Set stream: true in API calls (OpenAI, Anthropic support this).

8. Monitor and alert

Set up spending alerts so you know if costs spike. Track:

Daily/weekly costs
Cost per request
Token usage by endpoint

Tools: OpenAI usage dashboard, custom analytics (log API calls, track costs in your DB).

Latency optimization techniques

1. Stream responses

Non-streaming: User waits 5 seconds, then sees the full response.
Streaming: User sees words after 500ms, feels 80% faster.

Always stream for user-facing chat.

2. Use smaller, faster models

GPT-3.5 responds in ~~2 seconds. GPT-4 takes ~5 seconds. Claude Haiku is even faster (~~1-2 seconds).

For speed-critical tasks, use the smallest viable model.

3. Reduce prompt size

Larger prompts = slower processing. Keep prompts under 1,000 tokens when possible.

4. Edge deployment (advanced)

Run models close to users (edge servers, CDNs like Cloudflare Workers). Reduces network latency.

Example: Hosting a small model on the edge can shave 200-500ms off response time vs. a US-based API from Europe.

Trade-off: More complex infrastructure, limited model sizes.

5. Precompute when possible

If you know users will ask certain questions, generate answers ahead of time and serve them instantly.

Example: FAQ bot pre-generates answers to common questions, stores them in a DB. Zero latency, zero cost per query.

6. Parallelize tasks

If you need to process multiple steps (embeddings, retrieval, LLM call), run them in parallel where possible.

Example: While retrieving docs from a vector DB, start warming up the API connection. Saves 100-200ms.

Choosing the right model

When to use GPT-4 (or Claude Opus)

Complex reasoning (legal analysis, research, code reviews)
Long, nuanced writing
Multi-step problem solving
High-stakes accuracy

When to use GPT-3.5 (or Claude Haiku)

Simple chat, FAQs
Summarization
High-volume tasks
Cost-sensitive applications

When to use embeddings (Ada, Cohere)

Search, RAG, similarity matching
Clustering, recommendations

When to self-host (advanced)

Very high volume (100k+ requests/day)
Specialized models (fine-tuned, domain-specific)
Data privacy requirements (on-prem)

Cost crossover: Cloud APIs are cheaper below ~50k requests/day. Above that, self-hosting can save 50-80%.

Real-world cost-saving examples

Example 1: Customer support bot

Before:

5,000 queries/day
GPT-4 for all queries
$70/day = $2,100/month

After:

Cache 40% of common FAQs (free)
Use GPT-3.5 for 50% of queries
Use GPT-4 for 10% of complex queries
$8/day = $240/month

Savings: $1,860/month (89%)

Example 2: Content summarizer

Before:

1,000 articles/day
Full article (5,000 tokens) sent to GPT-3.5
$7.50/day = $225/month

After:

Use RAG to extract top 500 tokens
Compress prompts
$0.75/day = $22.50/month

Savings: $202.50/month (90%)

Example 3: Search system (RAG)

Before:

10,000 searches/day
Retrieve 10 chunks (2,000 tokens) + GPT-3.5
$25/day = $750/month

After:

Retrieve 3 chunks (600 tokens)
Cache top 20% of queries
Use Claude Haiku (cheaper)
$6/day = $180/month

Savings: $570/month (76%)

Infrastructure choices

Cloud APIs (OpenAI, Anthropic, Cohere)

Pros: No setup, auto-scaling, latest models
Cons: Per-request cost, no control over infrastructure
Best for: Startups, MVPs, low-to-medium volume

Self-hosted (Hugging Face, vLLM, Ollama)

Pros: Cost-effective at scale, full control, data privacy
Cons: Setup complexity, GPU costs, maintenance
Best for: High volume (100k+ requests/day), specialized models

Hybrid

Use cloud APIs for prototyping
Switch to self-hosted for high-volume tasks
Keep edge cases on cloud APIs

Monitoring and alerts

Track these metrics:

Cost per request (detect inefficiencies)
Total monthly spend (budget control)
95th percentile latency (catch slow outliers)
Cache hit rate (optimize caching)
Error rate (failed API calls waste money)

Tools: OpenAI dashboard, Datadog, custom logging (store API call metadata in your DB).

Set alerts:

Daily spend > $X
Latency > 5 seconds
Error rate > 5%

Common pitfalls

1. Not caching

Repeated queries cost money. Cache everything you can.

2. Over-relying on GPT-4

Use it only when necessary. GPT-3.5 handles 80% of tasks at 1/20th the cost.

3. Ignoring prompt size

Every extra word costs money. Compress aggressively.

4. Not streaming

Users hate waiting. Stream to improve perceived speed.

5. No spending alerts

Costs can spiral. Set alerts early.

Use responsibly

Don't sacrifice quality for cost (users notice bad answers)
Monitor for bias and errors (cheaper models can be less accurate)
Test changes rigorously (a/b test model swaps to verify quality)
Set hard spending limits (prevent runaway costs in production)

What's next?

Deployment Patterns: Learn about serverless, edge, and container options
Embeddings & RAG: Optimize retrieval for cost and speed
Evaluations 201: Measure quality vs. cost trade-offs
Prompting 201: Advanced techniques for efficient prompts

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

Latency

How long it takes for an AI model to generate a response after you send a request.

AI (Artificial Intelligence)

Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.

Related Guides

Monitoring AI Systems in Production

Advanced

Enterprise-grade monitoring, alerting, and observability for production AI systems. Learn to track performance, costs, quality, and security at scale.

20 min read

Efficient Inference Optimization

Advanced

Optimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.

8 min read

Context Management: Handling Long Conversations and Documents

Intermediate

Master context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.

12 min read

TL;DR

Why it matters

Understanding AI costs

Pricing examples (as of 2024)

Real cost breakdown

Measuring latency

Typical latencies (GPT-3.5 Turbo, 500 tokens)

Cost optimization techniques

1. Use the smallest model that works

2. Cache responses aggressively

3. Compress prompts

4. Use shorter contexts

5. Batch requests

6. Set max tokens

7. Use streaming for perceived speed

8. Monitor and alert

Latency optimization techniques

1. Stream responses

2. Use smaller, faster models

3. Reduce prompt size

4. Edge deployment (advanced)

5. Precompute when possible

6. Parallelize tasks

Choosing the right model

When to use GPT-4 (or Claude Opus)

When to use GPT-3.5 (or Claude Haiku)

When to use embeddings (Ada, Cohere)

When to self-host (advanced)

Real-world cost-saving examples

Example 1: Customer support bot

Example 2: Content summarizer

Example 3: Search system (RAG)

Infrastructure choices

Cloud APIs (OpenAI, Anthropic, Cohere)

Self-hosted (Hugging Face, vLLM, Ollama)

Hybrid

Monitoring and alerts

Common pitfalls

1. Not caching

2. Over-relying on GPT-4

3. Ignoring prompt size

4. Not streaming

5. No spending alerts

Use responsibly

What&#39;s next?

Was this guide helpful?

Key Terms Used in This Guide

Latency

AI (Artificial Intelligence)

Related Guides

Monitoring AI Systems in Production

Efficient Inference Optimization

Context Management: Handling Long Conversations and Documents

What's next?