The observation
Translate the sentence "The model failed to produce a coherent output on the third attempt" into Spanish: "El modelo no logró producir una salida coherente en el tercer intento." Feed both to GPT-4's tokenizer. The English version produces 16 tokens. The Spanish version produces 26 — 62% more, for semantically equivalent content. You pay for 26.
This is not a corner case. It holds across providers, languages, and query types. Multiple research groups have documented the pattern independently, and the mechanism is well understood. What is less discussed is why the architecture that produces this disparity was chosen, what it would take to change it, and whether the incentives to do so actually exist.
Ratios derived from Ahia et al. (EMNLP 2023) and Petrov et al. (NeurIPS 2023). Measured across OpenAI, Cohere, and open-source tokenizers.
Where BPE came from
Byte Pair Encoding was not designed for language models. Philip Gage introduced it in 1994 as a lossless data compression algorithm. The idea is simple: find the most frequent pair of adjacent bytes in a corpus, replace every occurrence of that pair with a single new symbol, and repeat. Each iteration reduces total data size by eliminating one redundant sequence.
Twenty-two years later, Rico Sennrich and colleagues at the University of Edinburgh repurposed BPE for neural machine translation. The problem they were solving was different: how do you handle rare or unseen words in a fixed-vocabulary model? Their insight was that BPE's subword units could replace the word-level vocabulary. Instead of encoding "unhappiness" as a single token (or failing on it entirely), a BPE-based model encodes it as "un" + "happi" + "ness" — three learned subword units.
This design worked. The approach was adopted by GPT-2, GPT-3, and eventually became the de facto standard across the industry. OpenAI's tiktoken library (cl100k_base, ~100k vocabulary), Google's SentencePiece implementations, and Meta's LLaMA tokenizer are all descendants of this architecture. Each provider trains their tokenizer on their own corpus. Each corpus has its own language distribution.
The corpus dominance problem
BPE learns its vocabulary from data. The more frequently a byte sequence appears across the training corpus, the more likely it is to be merged into a single token. This means the efficiency of any tokenizer is a direct function of the language distribution of its training data.
The training corpora for major commercial tokenizers are heavily weighted toward English. Common Crawl, the primary web-scraping dataset underlying most LLM training data, is estimated to be 45–60% English by volume. The next largest languages — Russian, German, French — collectively account for perhaps 20–25%. Hindi, Arabic, Swahili, and the other languages spoken by the majority of the world's population represent fractions of a percent.
The consequence is mechanical: BPE learns rich, high-coverage merge rules for English sequences. For a language like Hindi, written in Devanagari script, fewer merge rules are learned — which means more bytes remain unmerged, which means more tokens per sentence.
Illustrative token counts. Exact values vary by tokenizer version and input text. The directional pattern is consistent across providers.
The empirical evidence
Several independent research groups have measured this systematically rather than anecdotally.
Rust et al. (ACL 2021) examined how tokenization quality affects downstream performance in multilingual models. Their key finding: tokenization fertility — the number of tokens a tokenizer produces for a given text — is inversely correlated with model performance on low-resource languages. Languages that tokenize inefficiently also perform worse on NLP tasks, because the model must allocate more of its context window to encoding the same amount of meaning.
Ahia et al. (EMNLP 2023) took the question directly to the economic domain. They computed cost disparities across seven commercial LLM providers — OpenAI, Cohere, AI21, Anthropic, and others — using parallel corpora: texts with the same meaning in multiple languages. Their findings were consistent across providers. Spanish costs roughly 1.6 times more than English to process. Arabic costs approximately 3.1 times more. Some low-resource languages showed multipliers above 10.
Petrov et al. (NeurIPS 2023) framed the same finding as a fairness problem. They introduced a metric called "intrinsic dimension" of tokenization — roughly, how many tokens the model needs to represent a unit of information — and showed that the gap between English and high-fertility languages creates structural inequities in access to AI systems. A research team in Lagos or Chennai pays a higher price per query not because of anything about their use case, but because of how their language was represented in training data assembled primarily in the United States.
Zouhar et al. (ACL Findings 2023) examined a related problem: how tokenization affects the quality of the model's output distribution, independent of cost. They found that higher fertility correlates with noisier distributions — the model's probability estimates for tokens in high-fertility languages are systematically less reliable than for low-fertility languages. The user pays more and gets less precise predictions.
The economic scale of the disparity
Current LLM pricing (March 2026) spans from approximately $0.10 per million tokens on the lower end to over $100 per million tokens for the most capable models. Apply a 4.9× fertility multiplier for Hindi across the range and the effective per-query cost for a Hindi user querying a premium model exceeds the per-query cost for an English user on the same model by a larger absolute margin than the entire cost spread at the lower end of the market.
There is no ISO standard for tokens. Each provider defines the unit differently: OpenAI's tiktoken cl100k has approximately 100,000 vocabulary entries; Google's SentencePiece implementations use vocabularies up to 256,000 entries; Meta's LLaMA BPE tokenizer uses 32,000. The same input string produces different token counts across these systems. Comparing prices across providers requires knowing the fertility ratio for the specific language and text type — information that is rarely disclosed in pricing pages.
A parallel from cloud computing history
The situation has a precedent. In the early 2000s, as cloud computing emerged, each major provider invented its own billing unit: AWS EC2 hours, Azure Compute Credits, Google Compute Units. The heterogeneity was not technically necessary — CPU time is CPU time. But proprietary units made direct price comparison difficult, created vendor lock-in, and delayed the development of neutral benchmarking.
It took years and the emergence of independent benchmarking organizations (SPECcloud, STAC Research) before standardized comparison became routine. Even then, standardization of the unit did not happen — what emerged was transparency about what each unit means in practice.
The token is in the same position. It is not a unit of meaning, not a unit of computation, not a unit of information in any formally defined sense. It is a byproduct of a compression algorithm applied to a specific training corpus. The same algorithm applied to a different corpus produces a different tokenizer. Different tokenizers produce different token counts for the same string. There is no external standard against which to verify claims.
Why this does not self-correct
The natural question is why market forces have not pushed providers toward more equitable tokenization. Several mechanisms work against it.
What would a solution require?
It is worth being precise about what "solving" this problem would actually mean, because the problem has multiple distinct components that require different solutions.
The fertility disparity can be reduced — not eliminated — through better training data. A tokenizer trained on a corpus with a more equitable language distribution would learn more efficient merge rules for non-English languages. Research on this exists: multilingual tokenizers trained on more balanced corpora show significantly lower fertility disparities. The tradeoff is vocabulary size: covering more languages efficiently requires a larger vocabulary, which increases model memory requirements and inference cost.
The billing opacity problem requires a different solution. Providers could publish fertility tables for their tokenizers — how many tokens per word, measured across a standardized corpus of languages. This would allow users to build accurate cost estimates and make meaningful cross-provider comparisons. There is no technical obstacle to doing this. It is an information disclosure choice.
A third component — the pricing structure itself — is the hardest to address. Pricing per token is a sensible unit when the token is a consistent measure of computational work. It is a less sensible unit when the same computational work can cost 4.9 times more depending on which language the user speaks. Alternatives exist: pricing per byte of input, pricing per character, or pricing per semantic unit (however that is defined). Each has different tradeoffs for providers and users.
The structural implication
The disparity documented by Ahia, Petrov, Rust, and Zouhar is not primarily a story about individual pricing decisions. It is a story about how design choices made in one context — a compression algorithm, a training corpus assembled for English-language NLP — propagate downstream into a pricing structure that now affects millions of users who had no input into those choices.
The 1994 paper that introduced BPE was solving a data compression problem for C programmers. The 2016 paper that adapted it for neural machine translation was solving a vocabulary coverage problem for English-French and English-German translation pairs. Neither paper was designing a billing unit for a global AI market. The fact that BPE became the billing unit for that market is an accident of path dependency, not of deliberate design.
Accidents of path dependency are among the hardest things to fix in technology. They become embedded in APIs, pricing pages, SDK defaults, and documentation. They create constituencies with a stake in stability. They compound over time as more systems are built on top of them. The history of computing is full of them — from character encodings to file formats to network protocols — and the pattern of resolution is usually slow, partial, and driven by pressure from outside the established vendors.
For LLM tokenization, that pressure is starting to build. The papers cited here represent the beginning of a systematic research program, not a mature literature. The measurement methodology is still being developed. The proposed remedies are still being evaluated. Whether the incentive structure changes before the disparity becomes a defining issue for non-English AI adoption is an empirical question — one that will be answered in the next few years.
References
This article was prepared in response to feedback from the Hacker News moderation team regarding an earlier post on tokenization pricing. The goal here is purely analytical: documenting a well-evidenced structural problem in LLM billing, grounded in peer-reviewed research. No commercial solutions are discussed or implied. Questions or corrections welcome.
Published at tokenstree.com · March 2026