TL;DR

AI can drain your budget and frustrate users if you don't optimize for cost and speed. Use smaller models when possible, cache aggressively, stream responses, batch requests, and compress prompts. Choose the right model for each task—GPT-4 for complexity, GPT-3.5 for speed and savings. Monitor usage, set spending alerts, and test latency regularly. Small changes can cut costs by 50-90% and reduce response times from seconds to milliseconds.

Why it matters

Every AI API call costs money. Every second of latency loses users. Whether you're running a chatbot, content generator, or search system, you need to balance quality, speed, and cost. A poorly optimized AI system can rack up thousands in monthly bills and deliver slow, frustrating experiences. Optimization isn't optional—it's essential for sustainable AI products.

Understanding AI costs

AI costs come from API calls, compute, and storage. Most production systems use API-based models (OpenAI, Anthropic, Cohere), which charge per token—roughly 4 characters of text, or about 0.75 words.

Pricing examples (as of 2024)

Model Input cost Output cost Use case
GPT-4 Turbo $10 / 1M tokens $30 / 1M tokens Complex reasoning, writing, analysis
GPT-3.5 Turbo $0.50 / 1M tokens $1.50 / 1M tokens Chatbots, fast tasks, high-volume
Claude 3 Opus $15 / 1M tokens $75 / 1M tokens Detailed analysis, long contexts
Claude 3 Haiku $0.25 / 1M tokens $1.25 / 1M tokens Fast tasks, high-throughput
Embeddings (Ada) $0.10 / 1M tokens N/A Search, RAG, similarity

Real cost breakdown

Let's say your chatbot processes 10,000 conversations per day. Each conversation averages 500 input tokens (user + context) and 300 output tokens (AI response).

Daily cost with GPT-4 Turbo:

  • Input: 10,000 × 500 × $10 / 1M = $50
  • Output: 10,000 × 300 × $30 / 1M = $90
  • Total: $140/day = $4,200/month

Daily cost with GPT-3.5 Turbo:

  • Input: 10,000 × 500 × $0.50 / 1M = $2.50
  • Output: 10,000 × 300 × $1.50 / 1M = $4.50
  • Total: $7/day = $210/month

Savings: 95% by switching models for simpler tasks.

Jargon: "Token"
A chunk of text used by AI models—roughly 4 characters or 0.75 words. Models charge per token processed.

Measuring latency

Latency is how long users wait for AI responses. Key metrics:

  • Time-to-first-token (TTFT): How long until the first word appears (critical for streaming)
  • Total response time: How long until the full response is complete
  • Throughput: How many requests the system handles per second

Typical latencies (GPT-3.5 Turbo, 500 tokens)

  • TTFT: 200-800ms
  • Total response time: 2-5 seconds
  • Streaming vs. non-streaming: Streaming feels 2-3x faster to users because they see output immediately

Users notice delays above 200ms. Sub-100ms feels instant. Above 1 second feels slow.

Cost optimization techniques

1. Use the smallest model that works

Don't default to GPT-4 for everything. Most tasks—chat, summarization, Q&A—work fine with GPT-3.5 or Claude Haiku.

Decision tree:

  • Complex reasoning, coding, research? → GPT-4, Claude Opus
  • Simple chat, FAQs, summaries? → GPT-3.5, Claude Haiku
  • Embeddings, search? → Ada, Cohere Embed

2. Cache responses aggressively

If users ask the same questions, cache the answers instead of re-calling the API.

Example: A customer support bot might answer "What's your refund policy?" 50 times a day. Cache the first response, serve it from memory (cost: $0).

Tools: Redis, Memcached, in-memory cache (LRU).

Savings: 30-70% cost reduction for repetitive queries.

3. Compress prompts

Every word in your prompt costs money. Cut fluff.

Before (120 tokens):

"I would like you to please summarize the following article for me in a way that is easy to understand and concise: [article]"

After (15 tokens):

"Summarize this article concisely: [article]"

Savings: 87% fewer input tokens.

4. Use shorter contexts

Don't dump entire documents into the prompt. Use RAG to retrieve only relevant chunks (see Embeddings & RAG Explained).

Example: Instead of sending 10,000-token docs to the model, retrieve the top 3 relevant paragraphs (500 tokens). Saves 95% on input costs.

5. Batch requests

If you're processing 1,000 documents, don't send 1,000 individual API calls. Batch them (if the API supports it) or process in parallel with rate limits.

OpenAI Batch API: 50% discount for non-urgent tasks (results in 24 hours).

6. Set max tokens

Limit output length with max_tokens. If you only need a 50-word summary, don't let the model generate 500 words.

Savings: 90% on output costs for short tasks.

7. Use streaming for perceived speed

Streaming doesn't reduce cost, but it feels faster to users. They see the response building in real-time instead of staring at a loading spinner.

Implementation: Set stream: true in API calls (OpenAI, Anthropic support this).

8. Monitor and alert

Set up spending alerts so you know if costs spike. Track:

  • Daily/weekly costs
  • Cost per request
  • Token usage by endpoint

Tools: OpenAI usage dashboard, custom analytics (log API calls, track costs in your DB).

Latency optimization techniques

1. Stream responses

Non-streaming: User waits 5 seconds, then sees the full response.
Streaming: User sees words after 500ms, feels 80% faster.

Always stream for user-facing chat.

2. Use smaller, faster models

GPT-3.5 responds in 2 seconds. GPT-4 takes ~5 seconds. Claude Haiku is even faster (1-2 seconds).

For speed-critical tasks, use the smallest viable model.

3. Reduce prompt size

Larger prompts = slower processing. Keep prompts under 1,000 tokens when possible.

4. Edge deployment (advanced)

Run models close to users (edge servers, CDNs like Cloudflare Workers). Reduces network latency.

Example: Hosting a small model on the edge can shave 200-500ms off response time vs. a US-based API from Europe.

Trade-off: More complex infrastructure, limited model sizes.

5. Precompute when possible

If you know users will ask certain questions, generate answers ahead of time and serve them instantly.

Example: FAQ bot pre-generates answers to common questions, stores them in a DB. Zero latency, zero cost per query.

6. Parallelize tasks

If you need to process multiple steps (embeddings, retrieval, LLM call), run them in parallel where possible.

Example: While retrieving docs from a vector DB, start warming up the API connection. Saves 100-200ms.

Choosing the right model

When to use GPT-4 (or Claude Opus)

  • Complex reasoning (legal analysis, research, code reviews)
  • Long, nuanced writing
  • Multi-step problem solving
  • High-stakes accuracy

When to use GPT-3.5 (or Claude Haiku)

  • Simple chat, FAQs
  • Summarization
  • High-volume tasks
  • Cost-sensitive applications

When to use embeddings (Ada, Cohere)

  • Search, RAG, similarity matching
  • Clustering, recommendations

When to self-host (advanced)

  • Very high volume (100k+ requests/day)
  • Specialized models (fine-tuned, domain-specific)
  • Data privacy requirements (on-prem)

Cost crossover: Cloud APIs are cheaper below ~50k requests/day. Above that, self-hosting can save 50-80%.

Real-world cost-saving examples

Example 1: Customer support bot

Before:

  • 5,000 queries/day
  • GPT-4 for all queries
  • $70/day = $2,100/month

After:

  • Cache 40% of common FAQs (free)
  • Use GPT-3.5 for 50% of queries
  • Use GPT-4 for 10% of complex queries
  • $8/day = $240/month

Savings: $1,860/month (89%)

Example 2: Content summarizer

Before:

  • 1,000 articles/day
  • Full article (5,000 tokens) sent to GPT-3.5
  • $7.50/day = $225/month

After:

  • Use RAG to extract top 500 tokens
  • Compress prompts
  • $0.75/day = $22.50/month

Savings: $202.50/month (90%)

Example 3: Search system (RAG)

Before:

  • 10,000 searches/day
  • Retrieve 10 chunks (2,000 tokens) + GPT-3.5
  • $25/day = $750/month

After:

  • Retrieve 3 chunks (600 tokens)
  • Cache top 20% of queries
  • Use Claude Haiku (cheaper)
  • $6/day = $180/month

Savings: $570/month (76%)

Infrastructure choices

Cloud APIs (OpenAI, Anthropic, Cohere)

  • Pros: No setup, auto-scaling, latest models
  • Cons: Per-request cost, no control over infrastructure
  • Best for: Startups, MVPs, low-to-medium volume

Self-hosted (Hugging Face, vLLM, Ollama)

  • Pros: Cost-effective at scale, full control, data privacy
  • Cons: Setup complexity, GPU costs, maintenance
  • Best for: High volume (100k+ requests/day), specialized models

Hybrid

  • Use cloud APIs for prototyping
  • Switch to self-hosted for high-volume tasks
  • Keep edge cases on cloud APIs

Monitoring and alerts

Track these metrics:

  1. Cost per request (detect inefficiencies)
  2. Total monthly spend (budget control)
  3. 95th percentile latency (catch slow outliers)
  4. Cache hit rate (optimize caching)
  5. Error rate (failed API calls waste money)

Tools: OpenAI dashboard, Datadog, custom logging (store API call metadata in your DB).

Set alerts:

  • Daily spend > $X
  • Latency > 5 seconds
  • Error rate > 5%

Common pitfalls

1. Not caching

Repeated queries cost money. Cache everything you can.

2. Over-relying on GPT-4

Use it only when necessary. GPT-3.5 handles 80% of tasks at 1/20th the cost.

3. Ignoring prompt size

Every extra word costs money. Compress aggressively.

4. Not streaming

Users hate waiting. Stream to improve perceived speed.

5. No spending alerts

Costs can spiral. Set alerts early.

Use responsibly

  • Don't sacrifice quality for cost (users notice bad answers)
  • Monitor for bias and errors (cheaper models can be less accurate)
  • Test changes rigorously (a/b test model swaps to verify quality)
  • Set hard spending limits (prevent runaway costs in production)

What's next?

  • Deployment Patterns: Learn about serverless, edge, and container options
  • Embeddings & RAG: Optimize retrieval for cost and speed
  • Evaluations 201: Measure quality vs. cost trade-offs
  • Prompting 201: Advanced techniques for efficient prompts