TL;DR

AI APIs charge by the token, not by the word. A token is roughly four characters or three-quarters of a word. Both your input (the prompt) and the output (the response) count toward your bill. Understanding how tokens work helps you estimate costs accurately, stay within context limits, and optimize your spending without sacrificing quality.

Why it matters

If you are using AI through APIs to build products, automate workflows, or process large amounts of text, tokens directly translate to money. A single API call might cost a fraction of a cent, but thousands of calls per day add up quickly. Teams have been surprised by bills in the thousands of dollars because they did not understand how token counting works.

Beyond cost, tokens determine what your AI can even process. Every model has a context window, a maximum number of tokens it can handle in a single conversation. If your prompt plus the expected response exceeds that limit, you get an error or truncated output. Understanding tokens helps you design prompts that fit within these limits and use the available space efficiently.

For businesses building AI-powered features, token economics directly affects your profit margins. The difference between a well-optimized prompt and a wasteful one can be a 5x to 10x cost difference at scale.

What is a token?

A token is not a word. It is a sub-word unit that the model's tokenizer creates when breaking text into pieces it can process. Common words are usually a single token. Less common words get split into multiple tokens. Punctuation and spaces also count.

Here are some examples to build your intuition. The word "Hello" is 1 token. "ChatGPT" is 2 tokens: "Chat" and "GPT." A long word like "Internationalization" might be 5 tokens because it gets broken into common sub-word pieces.

The general rule of thumb is that 100 tokens equal roughly 75 English words, or that 1 token is approximately 4 characters. This varies by language. Languages that use longer words or non-Latin scripts, like German, Japanese, or Arabic, often use more tokens per word.

You can check exact token counts using tools like OpenAI's tokenizer (available online) or the tiktoken library in Python. These show you exactly how a specific piece of text gets split into tokens by a particular model.

How tokenization works

When you send text to an AI model, the first thing that happens is tokenization. A tokenizer is an algorithm that splits your text into a fixed vocabulary of sub-word pieces. The most common approach is called Byte Pair Encoding (BPE).

BPE starts with individual characters and iteratively merges the most frequently occurring pairs. After training on a large text corpus, the tokenizer has a vocabulary of typically 50,000 to 100,000 tokens. Very common words like "the" or "is" become single tokens. Rare words get split into smaller pieces that the model does recognize.

For example, the sentence "I'm learning about AI tokens." gets tokenized into something like: ["I", "'m", " learning", " about", " AI", " tokens", "."], giving you 7 tokens. Notice that spaces are often attached to the following word and that punctuation gets its own token.

Different models use different tokenizers, which means the same text produces different token counts depending on which model you use. GPT-4 and Claude use different tokenizers, so a 1,000-word document might be 1,300 tokens in one model and 1,400 in another. Always count tokens using the specific model's tokenizer for accurate cost estimates.

How token pricing works

Most AI APIs charge separately for input tokens (your prompt) and output tokens (the model's response). Output tokens are typically more expensive because they require more computation to generate. Pricing is quoted per million tokens or per thousand tokens, depending on the provider.

As of early 2026, pricing varies dramatically between models and providers. The most capable models like GPT-4o and Claude Opus cost more per token than smaller models like GPT-4o-mini or Claude Haiku. The price difference can be 10x to 50x between the cheapest and most expensive options.

Here is a concrete example. Say you have a prompt that uses 500 input tokens and the model generates a 1,000-token response. At a rate of $3 per million input tokens and $15 per million output tokens, that single call costs: (500 / 1,000,000 * $3) + (1,000 / 1,000,000 * $15) = $0.0015 + $0.015 = about $0.017 or roughly 1.7 cents. That seems tiny, but if you make 100,000 such calls per month, you are spending $1,700.

Pricing changes frequently as providers compete and release new models. Always check the current pricing page for your provider before budgeting.

Estimating and budgeting costs

To estimate costs for a project, you need three numbers: average input tokens per request, average output tokens per request, and expected request volume.

Start by running your typical prompts through a tokenizer to count input tokens. Then test with a few real requests to see how many output tokens the model generates on average. Multiply by your expected daily or monthly volume and apply the pricing formula.

Build in a buffer. Real-world usage almost always exceeds initial estimates. Retries after failures, longer-than-expected responses, and growing user adoption all push costs up. A 30 to 50 percent buffer is reasonable for initial budgeting.

For applications with variable-length inputs, like document summarization, test with your shortest and longest expected documents to understand the range. Your average cost will fall somewhere in between, but your peak cost matters for budgeting.

How to reduce token usage

The most effective optimization is writing concise prompts. Remove unnecessary context, instructions the model already follows by default, and verbose phrasing. A prompt that says "Please analyze the following text and provide a detailed summary including the main points, key themes, and any notable details" can often be shortened to "Summarize this text" with the same results and half the tokens.

Use system messages efficiently. System messages persist across an entire conversation, so every word in them costs tokens on every single request. Keep system messages focused and concise.

Set the max_tokens parameter to limit output length. If you only need a one-sentence answer, do not let the model generate a five-paragraph essay. This saves both tokens and latency.

Choose the right model for each task. Do not use your most expensive model for simple classification or extraction tasks. A smaller, cheaper model handles routine work perfectly well. Reserve your premium model for tasks that genuinely require advanced reasoning.

Implement caching for repeated or similar queries. If ten users ask the same question within an hour, serve the cached response instead of making ten API calls. Even simple caching strategies can reduce costs by 30 to 60 percent for many applications.

For batch processing, combine multiple items into a single API call when possible. Instead of making 100 separate calls to classify 100 support tickets, send them in batches of 10 or 20 with instructions to process all of them.

Hidden costs to watch for

Several costs are easy to overlook when budgeting. Retries and failures can double your effective cost if your system retries aggressively. Testing and debugging during development burns tokens that do not produce user value. Prompt engineering iterations, where you try dozens of prompt variations, add up quickly.

The biggest hidden cost in conversational applications is context accumulation. In a multi-turn conversation, the entire conversation history is sent with every new message. By turn 20, your input tokens might be 10x what they were at turn 1. Implement conversation summarization or sliding window strategies to keep context manageable.

Image and audio inputs, for multimodal models, use significantly more tokens than text. A single high-resolution image can cost the equivalent of thousands of text tokens. Factor this into your pricing if your application handles visual content.

Monitoring and controlling costs

Set up monitoring from day one. Track API usage per feature, per user, and per day. Most providers offer usage dashboards, but build your own monitoring too so you can correlate costs with specific application behaviors.

Set hard spending limits in your API provider's dashboard. These prevent runaway costs from bugs, abuse, or unexpected traffic spikes. An infinite loop that calls the API can burn through hundreds of dollars in minutes.

Alert on anomalies. If your daily spend suddenly doubles, you want to know immediately, not at the end of the month. Set up alerts at 80 percent and 100 percent of your expected daily budget.

Review your costs weekly and look for optimization opportunities. Often, a small change to a frequently-used prompt or switching one feature to a cheaper model can save hundreds of dollars per month.

Common mistakes

The most common mistake is not counting tokens before building. People design prompts, build features, and then discover their costs are 5x what they expected. Always prototype and measure token usage before committing to an approach.

Another mistake is sending the entire document when only part of it is relevant. If a user asks about chapter 3 of a book, do not send the entire book as context. Extract the relevant section first using search or chunking.

Teams frequently ignore output token costs, which are often 2x to 5x higher than input token costs. Letting the model ramble with no output limit is expensive. Be specific about the format and length you want.

Finally, many people use a single model for everything. Using GPT-4-class models for tasks that GPT-4o-mini handles perfectly is like taking a helicopter to the corner store. Match the model to the task complexity.

What's next?