Tokenizer: how text becomes tokens a model can read : Learn

A tokenizer is the component that converts text into tokens, the small units a language model reads and writes, and converts the model's output tokens back into text. Each model ships with its own tokenizer, and its rules decide exactly how a piece of text is split.

At a glance

What it is

The component that splits text into tokens and joins them back

Why it matters

A token is what a model counts, so the split sets your context and cost

It is model-specific

Each model ships its own tokenizer; swap it and the splits change

Where it bites

A missing tokenizer file and the server never starts

What does a tokenizer actually do?

A model cannot read letters. Before any text reaches it, a tokenizer breaks that text into tokens, the small units the model works in, and hands the model a sequence of numbers, one per token. When the model replies, it produces tokens, and the tokenizer runs in reverse to turn them back into the text you read. So the tokenizer sits at both ends of every request, and the model in the middle never sees a single character you typed.

The split is not one token per word. Common words often map to a single token, while a rare word, a long identifier, or an unusual character gets broken into several smaller pieces. That is why the same sentence can cost a different number of tokens than you would guess from counting words.

Why does the tokenizer matter to you?

Because a model counts tokens, not words, the tokenizer quietly sets two things you care about: how much of your context window a prompt uses, and how much a request costs to run. The same text, fed through two different tokenizers, can land on a different token count.

It is also model-specific. Each model ships with its own tokenizer, trained alongside it, and the two have to match. Serve a model without its tokenizer file and the server does not improvise: it refuses to start, because guessing how to split text would produce garbage the model was never trained to read.

Tokenizer: how text becomes tokens a model can read

At a glance

Where the tokenizer sits

What does a tokenizer actually do?

Why does the tokenizer matter to you?

Check it yourself

A tokenizer decides

It does not decide

Related terms

At a glance

Where the tokenizer sits

What does a tokenizer actually do?

Why does the tokenizer matter to you?

Check it yourself

A tokenizer decides

It does not decide

Related terms

Go deeper