The observation

Translate the sentence "The model failed to produce a coherent output on the third attempt" into Spanish: "El modelo no logró producir una salida coherente en el tercer intento." Feed both to GPT-4's tokenizer. The English version produces 16 tokens. The Spanish version produces 26 — 62% more, for semantically equivalent content. You pay for 26.

This is not a corner case. It holds across providers, languages, and query types. Multiple research groups have documented the pattern independently, and the mechanism is well understood. What is less discussed is why the architecture that produces this disparity was chosen, what it would take to change it, and whether the incentives to do so actually exist.

English
Baseline token cost
Spanish
~1.6×
More tokens for the same content
Arabic
~3.1×
More tokens for the same content
Hindi
~4.9×
More tokens for the same content

Ratios derived from Ahia et al. (EMNLP 2023) and Petrov et al. (NeurIPS 2023). Measured across OpenAI, Cohere, and open-source tokenizers.

Where BPE came from

Byte Pair Encoding was not designed for language models. Philip Gage introduced it in 1994 as a lossless data compression algorithm. The idea is simple: find the most frequent pair of adjacent bytes in a corpus, replace every occurrence of that pair with a single new symbol, and repeat. Each iteration reduces total data size by eliminating one redundant sequence.

C Users Journal · 1994
Gage, Philip. "A New Algorithm for Data Compression."
The original BPE paper. A compression utility, not an NLP method. The vocabulary is built by iterating pair-merge operations until a target vocabulary size is reached.

Twenty-two years later, Rico Sennrich and colleagues at the University of Edinburgh repurposed BPE for neural machine translation. The problem they were solving was different: how do you handle rare or unseen words in a fixed-vocabulary model? Their insight was that BPE's subword units could replace the word-level vocabulary. Instead of encoding "unhappiness" as a single token (or failing on it entirely), a BPE-based model encodes it as "un" + "happi" + "ness" — three learned subword units.

ACL 2016
Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units."
The paper that brought BPE into NLP. It solved the out-of-vocabulary problem for machine translation and became the template for tokenization in subsequent large language models.

This design worked. The approach was adopted by GPT-2, GPT-3, and eventually became the de facto standard across the industry. OpenAI's tiktoken library (cl100k_base, ~100k vocabulary), Google's SentencePiece implementations, and Meta's LLaMA tokenizer are all descendants of this architecture. Each provider trains their tokenizer on their own corpus. Each corpus has its own language distribution.

The corpus dominance problem

BPE learns its vocabulary from data. The more frequently a byte sequence appears across the training corpus, the more likely it is to be merged into a single token. This means the efficiency of any tokenizer is a direct function of the language distribution of its training data.

The training corpora for major commercial tokenizers are heavily weighted toward English. Common Crawl, the primary web-scraping dataset underlying most LLM training data, is estimated to be 45–60% English by volume. The next largest languages — Russian, German, French — collectively account for perhaps 20–25%. Hindi, Arabic, Swahili, and the other languages spoken by the majority of the world's population represent fractions of a percent.

The consequence is mechanical: BPE learns rich, high-coverage merge rules for English sequences. For a language like Hindi, written in Devanagari script, fewer merge rules are learned — which means more bytes remain unmerged, which means more tokens per sentence.

Same semantic content — "I need to process this document" — across four languages (tiktoken cl100k_base)
EN
"I need to process this document"
7 tokens
ES
"Necesito procesar este documento"
10 tokens
AR
"أحتاج إلى معالجة هذه الوثيقة"
22 tokens
HI
"मुझे इस दस्तावेज़ को प्रोसेस करना है"
34 tokens

Illustrative token counts. Exact values vary by tokenizer version and input text. The directional pattern is consistent across providers.

The empirical evidence

Several independent research groups have measured this systematically rather than anecdotally.

Rust et al. (ACL 2021) examined how tokenization quality affects downstream performance in multilingual models. Their key finding: tokenization fertility — the number of tokens a tokenizer produces for a given text — is inversely correlated with model performance on low-resource languages. Languages that tokenize inefficiently also perform worse on NLP tasks, because the model must allocate more of its context window to encoding the same amount of meaning.

ACL 2021
Rust, Phillip, et al. "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models."
Introduced "fertility" as a measure of tokenization efficiency and demonstrated its correlation with downstream model performance across 99 languages. Languages with high fertility scores (more tokens per word) consistently showed worse performance on tasks like NER and POS tagging.

Ahia et al. (EMNLP 2023) took the question directly to the economic domain. They computed cost disparities across seven commercial LLM providers — OpenAI, Cohere, AI21, Anthropic, and others — using parallel corpora: texts with the same meaning in multiple languages. Their findings were consistent across providers. Spanish costs roughly 1.6 times more than English to process. Arabic costs approximately 3.1 times more. Some low-resource languages showed multipliers above 10.

EMNLP 2023
Ahia, Orevaoghene, et al. "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models."
The most direct economic analysis of tokenization-driven price disparities. Used parallel corpora across providers and showed that cost differences are systematic across all major commercial APIs, not artifacts of any single provider's choices.

Petrov et al. (NeurIPS 2023) framed the same finding as a fairness problem. They introduced a metric called "intrinsic dimension" of tokenization — roughly, how many tokens the model needs to represent a unit of information — and showed that the gap between English and high-fertility languages creates structural inequities in access to AI systems. A research team in Lagos or Chennai pays a higher price per query not because of anything about their use case, but because of how their language was represented in training data assembled primarily in the United States.

NeurIPS 2023
Petrov, Aleksandar, et al. "Language Model Tokenizers Introduce Unfairness between Languages."
Proposes formal fairness criteria for tokenization and demonstrates violations across all major tokenizers. Introduces fertility-normalized cost as a metric for cross-language price comparisons.

Zouhar et al. (ACL Findings 2023) examined a related problem: how tokenization affects the quality of the model's output distribution, independent of cost. They found that higher fertility correlates with noisier distributions — the model's probability estimates for tokens in high-fertility languages are systematically less reliable than for low-fertility languages. The user pays more and gets less precise predictions.

ACL Findings 2023
Zouhar, Vilém, et al. "Tokenization and the Noisiness of Language Model Distributions."
Demonstrates that high-fertility tokenization (many tokens per word) degrades the model's output distribution quality, compounding the economic disadvantage with a quality disadvantage.

The economic scale of the disparity

Current LLM pricing (March 2026) spans from approximately $0.10 per million tokens on the lower end to over $100 per million tokens for the most capable models. Apply a 4.9× fertility multiplier for Hindi across the range and the effective per-query cost for a Hindi user querying a premium model exceeds the per-query cost for an English user on the same model by a larger absolute margin than the entire cost spread at the lower end of the market.

There is no ISO standard for tokens. Each provider defines the unit differently: OpenAI's tiktoken cl100k has approximately 100,000 vocabulary entries; Google's SentencePiece implementations use vocabularies up to 256,000 entries; Meta's LLaMA BPE tokenizer uses 32,000. The same input string produces different token counts across these systems. Comparing prices across providers requires knowing the fertility ratio for the specific language and text type — information that is rarely disclosed in pricing pages.

The measurement problem: Most pricing calculators estimate token counts for English text. For any other language, the estimate can be off by a factor of 2 to 5. A budget built on English estimates becomes a significant underestimate for a multilingual application.

A parallel from cloud computing history

The situation has a precedent. In the early 2000s, as cloud computing emerged, each major provider invented its own billing unit: AWS EC2 hours, Azure Compute Credits, Google Compute Units. The heterogeneity was not technically necessary — CPU time is CPU time. But proprietary units made direct price comparison difficult, created vendor lock-in, and delayed the development of neutral benchmarking.

It took years and the emergence of independent benchmarking organizations (SPECcloud, STAC Research) before standardized comparison became routine. Even then, standardization of the unit did not happen — what emerged was transparency about what each unit means in practice.

The token is in the same position. It is not a unit of meaning, not a unit of computation, not a unit of information in any formally defined sense. It is a byproduct of a compression algorithm applied to a specific training corpus. The same algorithm applied to a different corpus produces a different tokenizer. Different tokenizers produce different token counts for the same string. There is no external standard against which to verify claims.

Why this does not self-correct

The natural question is why market forces have not pushed providers toward more equitable tokenization. Several mechanisms work against it.

R1
Revenue is positively correlated with inefficiency
A provider billing per token earns more revenue from a non-English user submitting the same semantic content as an English user. There is no revenue incentive to reduce fertility for underrepresented languages.
R2
Training data assembly follows the same bias
Collecting high-quality multilingual data at scale is expensive. English web data is cheap and abundant. The tokenizer's language distribution reflects the economics of data collection, not the distribution of the world's speakers.
R3
Tokenizer changes break downstream systems
A tokenizer change invalidates all existing context lengths, prompt templates, and context window calculations for existing customers. The cost of migration is borne by users, creating pressure to keep tokenizers stable even when improvements exist.
R4
The harmed users are not yet the primary market
Non-English markets are growing but still represent a smaller fraction of enterprise LLM revenue than English markets. The pressure to fix the disparity has not yet reached the threshold where it overrides the migration cost.

What would a solution require?

It is worth being precise about what "solving" this problem would actually mean, because the problem has multiple distinct components that require different solutions.

The fertility disparity can be reduced — not eliminated — through better training data. A tokenizer trained on a corpus with a more equitable language distribution would learn more efficient merge rules for non-English languages. Research on this exists: multilingual tokenizers trained on more balanced corpora show significantly lower fertility disparities. The tradeoff is vocabulary size: covering more languages efficiently requires a larger vocabulary, which increases model memory requirements and inference cost.

The billing opacity problem requires a different solution. Providers could publish fertility tables for their tokenizers — how many tokens per word, measured across a standardized corpus of languages. This would allow users to build accurate cost estimates and make meaningful cross-provider comparisons. There is no technical obstacle to doing this. It is an information disclosure choice.

A third component — the pricing structure itself — is the hardest to address. Pricing per token is a sensible unit when the token is a consistent measure of computational work. It is a less sensible unit when the same computational work can cost 4.9 times more depending on which language the user speaks. Alternatives exist: pricing per byte of input, pricing per character, or pricing per semantic unit (however that is defined). Each has different tradeoffs for providers and users.

Q1
What vocabulary size is required for a tokenizer to achieve fertility parity (within ±20%) across the 50 most-spoken languages? Is this compatible with current model architectures?
Q2
At what point does the non-English enterprise market reach sufficient scale to make tokenizer migration economically rational for a major provider?
Q3
Is there a provider-neutral "fertility audit" methodology that would allow independent verification of tokenization claims — analogous to what SPEC does for compute benchmarks?
Q4
Would character-based pricing (per Unicode code point) be a tractable alternative to token-based pricing, given that it removes the tokenizer as a billing variable?
Q5
How does the fertility disparity interact with context window constraints? An 8,192-token context represents less semantic content in Arabic than in English — which means the effective capability of the model is also lower for high-fertility languages.

The structural implication

The disparity documented by Ahia, Petrov, Rust, and Zouhar is not primarily a story about individual pricing decisions. It is a story about how design choices made in one context — a compression algorithm, a training corpus assembled for English-language NLP — propagate downstream into a pricing structure that now affects millions of users who had no input into those choices.

The 1994 paper that introduced BPE was solving a data compression problem for C programmers. The 2016 paper that adapted it for neural machine translation was solving a vocabulary coverage problem for English-French and English-German translation pairs. Neither paper was designing a billing unit for a global AI market. The fact that BPE became the billing unit for that market is an accident of path dependency, not of deliberate design.

Accidents of path dependency are among the hardest things to fix in technology. They become embedded in APIs, pricing pages, SDK defaults, and documentation. They create constituencies with a stake in stability. They compound over time as more systems are built on top of them. The history of computing is full of them — from character encodings to file formats to network protocols — and the pattern of resolution is usually slow, partial, and driven by pressure from outside the established vendors.

For LLM tokenization, that pressure is starting to build. The papers cited here represent the beginning of a systematic research program, not a mature literature. The measurement methodology is still being developed. The proposed remedies are still being evaluated. Whether the incentive structure changes before the disparity becomes a defining issue for non-English AI adoption is an empirical question — one that will be answered in the next few years.


References

[1] Gage, Philip. "A New Algorithm for Data Compression." C Users Journal, 12(2), 1994.
[2] Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units." Proceedings of ACL 2016. arXiv:1508.07909.
[3] Rust, Phillip, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models." Proceedings of ACL-IJCNLP 2021. arXiv:2012.15613.
[4] Ahia, Orevaoghene, Sachin Kumar, Hila Gonen, Julia Kreutzer, David Chiang, Noah A. Smith, and Yulia Tsvetkov. "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models." Proceedings of EMNLP 2023. arXiv:2305.13707.
[5] Petrov, Aleksandar, Emilio La Malfa, Philip H. S. Torr, and Adel Bibi. "Language Model Tokenizers Introduce Unfairness between Languages." Advances in Neural Information Processing Systems (NeurIPS) 2023. arXiv:2305.15425.
[6] Zouhar, Vilém, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. "Tokenization and the Noisiness of Language Model Distributions." Findings of ACL 2023. arXiv:2305.11628.

This article was prepared in response to feedback from the Hacker News moderation team regarding an earlier post on tokenization pricing. The goal here is purely analytical: documenting a well-evidenced structural problem in LLM billing, grounded in peer-reviewed research. No commercial solutions are discussed or implied. Questions or corrections welcome.

Published at tokenstree.com · March 2026