Skip to main content
BETAThis is a new design — give feedback

Latency

Also known as: Response Time, Inference Time

In one sentence

The time delay between sending a request to an AI model and receiving the first part of its response. Lower latency means faster replies.

Explain like I'm 12

The waiting time between asking a question and getting an answer—like the pause after you text a friend before those three typing dots finally turn into a message.

In context

Latency varies dramatically across AI models and use cases. GPT-4 might take 2-5 seconds for complex prompts, while smaller models like GPT-3.5 respond in under a second. Factors affecting latency include model size, prompt length, server load, and geographic distance to the API server. Streaming helps by showing words as they're generated rather than waiting for the full response, reducing perceived latency even when total generation time stays the same.

See also

Related Guides

Learn more about Latency in these guides: