Tokenization

Tokenization is the process of breaking text into chunks (tokens) — usually sub-word pieces — that an LLM actually reads and writes.

Tokenization is the unglamorous but essential first step in every LLM call. Before a model sees your text, it's split into discrete tokens — sub-word units pulled from a vocabulary of typically 32k-200k entries.

The most common scheme today is BPE (Byte-Pair Encoding), where common character sequences get their own token and rare sequences are split. For most English text, one token ≈ ¾ of a word. Common words ("the", "and", "cat") are single tokens; rare or technical words ("antidisestablishmentarianism") become 4-6 tokens.

Why tokenization matters in practice:

Pricing — APIs charge per token, not per word. A 500-word answer is roughly 650 tokens. Knowing this affects cost estimates.
Context limits — when an API says "128k context window," it means tokens, not words. Plan for ~95,000 words effective capacity.
Languages — tokenizers are usually trained on English-heavy corpora, so non-English languages end up using more tokens per character. A sentence in Chinese, Japanese, or Korean can use 2-3x more tokens than the English equivalent. This makes non-English APIs effectively more expensive.
Code — modern tokenizers handle code well, but indentation, brackets, and structural characters each cost a token. A 100-line file is typically 400-1200 tokens depending on language.

Practical tip: when sizing prompts and budgets, use the tokenizer the actual API uses (tiktoken for OpenAI, anthropic-tokenizer for Claude). Generic estimators like "1 token = 4 characters" are 80% accurate, not 99%.

FAQ

Can I count tokens without calling the API?

Yes. Anthropic ships an open tokenizer for Claude; OpenAI ships tiktoken. Both are fast Python/JS libraries you can run locally.

Related terms

LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.
Context window — The context window is the maximum number of tokens (text chunks) a language model can consider at once — both the prompt you send and the response it generates.
Transformer architecture — The transformer is the neural network architecture introduced in 2017 that powers every major LLM — built around the attention mechanism that lets each token weigh all other tokens.

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →