Latency
Also known as: Response Time, Inference Time
In one sentence
The time delay between sending a request to an AI model and receiving the first part of its response. Lower latency means faster replies.
Explain like I'm 12
The waiting time between asking a question and getting an answer—like the pause after you text a friend before those three typing dots finally turn into a message.
In context
Latency varies dramatically across AI models and use cases. GPT-4 might take 2-5 seconds for complex prompts, while smaller models like GPT-3.5 respond in under a second. Factors affecting latency include model size, prompt length, server load, and geographic distance to the API server. Streaming helps by showing words as they're generated rather than waiting for the full response, reducing perceived latency even when total generation time stays the same.
See also
Related Guides
Learn more about Latency in these guides:
AI Latency Optimization: Making AI Faster
IntermediateLearn to reduce AI response times. From model optimization to infrastructure tuning—practical techniques for building faster AI applications.
10 min readCost & Latency: Making AI Fast and Affordable
AdvancedOptimize AI systems for speed and cost. Techniques for reducing latency, controlling API costs, and scaling efficiently.
13 min readEfficient Inference Optimization
AdvancedOptimize AI inference for speed and cost: batching, caching, model serving, KV cache, speculative decoding, and more.
8 min read