Tokenizer
Also known as: Tokenisation, Token Encoding
In one sentence
A tool that breaks text into smaller pieces (tokens) that an AI model can process. Different models use different tokenizers, affecting how they count and understand text.
Explain like I'm 12
It's like cutting a sandwich into bite-sized pieces so you can eat it. The tokenizer cuts your words into little chunks the AI can 'digest' — and different AIs cut their sandwiches differently.
In context
Every LLM has its own tokenizer trained alongside the model. OpenAI's tiktoken, used by GPT-4, might split 'unbelievable' into 'un', 'believ', 'able' (3 tokens), while another tokenizer might split it differently. Tokenizers use algorithms like Byte Pair Encoding (BPE) that learn the most efficient way to split text from training data. You can test tokenizers online — OpenAI's tokenizer tool lets you paste text and see exactly how it gets split and counted.
See also
Related Guides
Learn more about Tokenizer in these guides:
Token Economics: Understanding AI Costs
IntermediateAI APIs charge per token. Learn how tokens work, how to estimate costs, and how to optimize spending.
6 min readContext Management: Handling Long Conversations and Documents
IntermediateMaster context window management for AI. Learn strategies for long conversations, document processing, memory systems, and context optimization.
12 min readTraining Multi-Modal Models
AdvancedTrain models that understand images and text together. Contrastive learning, vision-language pre-training, and alignment techniques.
7 min read