Tokenizer: how text becomes tokens a model can read
A tokenizer is the component that converts text into tokens, the small units a language model reads and writes, and converts the model's output tokens back into text. Each model ships with its own tokenizer, and its rules decide exactly how a piece of text is split.
At a glance
What it is
The component that splits text into tokens and joins them back
Why it matters
A token is what a model counts, so the split sets your context and cost
It is model-specific
Each model ships its own tokenizer; swap it and the splits change
Where it bites
A missing tokenizer file and the server never starts
Flow
Where the tokenizer sits
Your text passes through the tokenizer on the way in and on the way out. The model only ever sees tokens, never the raw characters you typed.
1
Your textthe words you typed
2
Tokenizersplits text into tokens, joins tokens back into text
3
The modelreads and writes only tokens
What does a tokenizer actually do?
A model cannot read letters. Before any text reaches it, a tokenizer breaks that
text into tokens, the small units the model works in, and hands the model a
sequence of numbers, one per token. When the model replies, it produces tokens,
and the tokenizer runs in reverse to turn them back into the text you read. So
the tokenizer sits at both ends of every request, and the model in the middle
never sees a single character you typed.
The split is not one token per word. Common words often map to a single token,
while a rare word, a long identifier, or an unusual character gets broken into
several smaller pieces. That is why the same sentence can cost a different number
of tokens than you would guess from counting words.
Why does the tokenizer matter to you?
Because a model counts tokens, not words, the tokenizer quietly sets two things
you care about: how much of your context window a prompt uses, and how much a
request costs to run. The same text, fed through two different tokenizers, can
land on a different token count.
It is also model-specific. Each model ships with its own tokenizer, trained
alongside it, and the two have to match. Serve a model without its tokenizer file
and the server does not improvise: it refuses to start, because guessing how to
split text would produce garbage the model was never trained to read.
Check it yourself
ls /model/tokenizer.json
A served model needs its tokenizer file present. If it is missing, the server fails to start instead of guessing how to split text.
A tokenizer decides
How your text is split into tokens, and therefore how many tokens it costs
How rare words and odd characters are broken into smaller pieces
How the model's output tokens are turned back into readable text
It does not decide
Whether the model's answer is correct; that is the model, not the split
How fast the model runs; that is hardware and the serving engine
The meaning of a token; it only assigns the pieces, the model learns meaning