Skip to main content
BETAThis is a new design — give feedback

Tokenizer

Also known as: Tokenisation, Token Encoding

In one sentence

A tool that breaks text into smaller pieces (tokens) that an AI model can process. Different models use different tokenizers, affecting how they count and understand text.

Explain like I'm 12

It's like cutting a sandwich into bite-sized pieces so you can eat it. The tokenizer cuts your words into little chunks the AI can 'digest' — and different AIs cut their sandwiches differently.

In context

Every LLM has its own tokenizer trained alongside the model. OpenAI's tiktoken, used by GPT-4, might split 'unbelievable' into 'un', 'believ', 'able' (3 tokens), while another tokenizer might split it differently. Tokenizers use algorithms like Byte Pair Encoding (BPE) that learn the most efficient way to split text from training data. You can test tokenizers online — OpenAI's tokenizer tool lets you paste text and see exactly how it gets split and counted.

See also

Related Guides

Learn more about Tokenizer in these guides: