TL;DR

AI latency comes from three sources: model computation, data transfer, and infrastructure overhead. Optimize by choosing smaller models, using caching, streaming responses, and tuning infrastructure. Most applications can achieve sub-second responses with proper optimization.

Why it matters

Users expect instant responses. Every 100ms of latency reduces engagement. In real-time applications like chatbots or recommendations, latency determines usability. Understanding where time goes lets you optimize effectively.

Understanding AI latency

Latency components

Model computation:

  • Time to process input through model
  • Depends on model size and complexity
  • Scales with input/output length

Data transfer:

  • Network latency to API services
  • Payload serialization
  • TLS handshakes

Infrastructure overhead:

  • Cold starts (model loading)
  • Queue wait times
  • Preprocessing and postprocessing

Typical latency breakdown

Component Typical time Optimization potential
Network round trip 50-200ms Medium (geographic)
Model inference 100-2000ms High (model choice)
Cold start 1-30s High (warm instances)
Pre/post processing 10-100ms Medium

Model-level optimization

Smaller models

The fastest optimization—use a smaller model:

Tradeoffs:

  • GPT-4 vs GPT-3.5: 2-3x slower, higher quality
  • Large vs Small embeddings: Similar quality, faster
  • Specialized vs General: Often faster AND better

When to use smaller:

  • Simple tasks (classification, extraction)
  • High-volume, low-stakes operations
  • Latency-critical applications

Output length control

Shorter outputs = faster responses:

Techniques:

  • Set max_tokens limit
  • Request concise responses in prompt
  • Use structured outputs
  • Stop sequences to end early

Streaming responses

Show output as it's generated:

Benefits:

  • Perceived latency much lower
  • First token in <500ms typically
  • Users can start reading immediately

Implementation:

  • Use streaming API endpoints
  • Handle incremental rendering
  • Manage partial response state

Caching strategies

What to cache

High value:

  • Identical queries
  • Embeddings for static content
  • Preprocessed data
  • Common responses

Cache key considerations:

  • Exact match vs semantic similarity
  • Include relevant context in key
  • Handle personalization appropriately

Cache architecture

Request → Check L1 (memory) → Miss
              ↓
        Check L2 (Redis) → Miss
              ↓
        Model inference → Cache result → Return

Cache hit rates

Strategy Typical hit rate Best for
Exact match 10-30% Repeated queries
Semantic cache 30-60% Similar questions
Pre-computed 80-95% Common scenarios

Infrastructure optimization

Keep models warm

Avoid cold start latency:

Strategies:

  • Minimum instance count >0
  • Health check pings
  • Predictive scaling
  • Warm pools

Cold start impact:

  • First request: 5-30 seconds
  • Warm request: <1 second

Geographic distribution

Reduce network latency:

  • Deploy close to users
  • Multiple regions for global apps
  • Edge caching for static
  • CDN for common responses

Batching

Combine requests for efficiency:

When to batch:

  • Background processing
  • Non-real-time features
  • High-volume operations

Batch tradeoffs:

  • Higher throughput
  • Higher individual latency
  • More efficient resource use

Monitoring and profiling

Key metrics

Track:

  • P50, P95, P99 latency
  • Time to first token (streaming)
  • Cold start frequency
  • Cache hit rates

Alert on:

  • P95 > target threshold
  • P99 significantly higher than P95
  • Sudden latency increases

Profiling approach

  1. Measure end-to-end latency
  2. Break down by component
  3. Identify bottlenecks
  4. Optimize largest contributors
  5. Repeat

Common optimizations by use case

Use case Key optimizations
Chatbots Streaming, warm instances, caching
Search Embeddings cache, semantic cache
Content generation Smaller models, output limits
Real-time analysis Edge deployment, batching

Common mistakes

Mistake Impact Prevention
Optimizing wrong component Wasted effort Profile first
No streaming Poor perceived latency Always stream interactive
Cold instances Spiky latency Keep warm
No caching Unnecessary computation Cache aggressively
Over-optimizing Complexity, diminishing returns Stop at "good enough"

What&#39;s next

Continue optimizing AI systems: