Last week a Claude Code skill called caveman crossed 64,000 GitHub stars by doing one thing: making the model talk like a caveman. "Brain still big. Mouth small." It cuts output tokens by 65% on their own benchmark. It is genuinely clever and the numbers hold up.
What it does not do is touch the other side of your bill: the input. For long-context apps — RAG, agents, anything with a system prompt longer than a tweet — the input is 60 to 80 percent of the spend. Caveman leaves that money on the table. We ran the benchmark.
two sides of the bill
Every Claude API call has two token counts: the input you sent and the output the model produced. Anthropic prices them separately. On Haiku 4.5 the input is $1/M and the output is $5/M. On a typical retrieval-augmented call with 10,000 tokens of context and 500 tokens of answer, the input is 80% of the bill before anything happens.
Caveman attacks the output. Its skill prepends a system instruction that tells Claude to drop articles, fillers, and connectives in its responses. The model still thinks the full chain — the thinking tokens are untouched — but the visible response comes back telegraphic. The savings are real: 22 to 87% on output, averaging 65% across their 10-prompt benchmark. [1]
Two of our systems live on the other side of that wall: transfer.tokenstree.com (LLMLingua-2 semantic compression) and translation.tokenstree.com (translate the prompt to the cheapest token language). Both run on the input before it ever reaches Anthropic. We benchmarked them against caveman's input-side skill on caveman's own prompts.
the benchmark
Twenty prompts, three suites:
- Suite A — caveman's own 10 prompts: the published benchmark from their repo, unmodified. Coding-heavy, English, short to medium.
- Suite B — multilingual (5 prompts): native-language questions in Spanish, French, German, Portuguese, Italian about Django, React, Postgres, microservices, Docker.
- Suite C — long LLM context (5 prompts): each prepended with a 200-token system context, the shape of a typical RAG payload.
Same tokenizer everywhere (tiktoken cl100k_base, which lands within ~2% of Anthropic's tokenizer on Latin scripts). Four reducers measured per prompt: TokensTransfer, TokenTranslation, a deterministic Python replica of the caveman-compress skill rules, and the two stacked.
the numbers
Average input-token reduction across each suite:
| Suite | TokensTransfer | TokenTranslation | Caveman-compress |
|---|---|---|---|
| A — Caveman's own 10 prompts | 52.9% | 0.0% | 8.6% |
| B — Multilingual (ES/FR/DE/PT/IT) | 49.3% | 31.2% | 0.8% |
| C — Long LLM context (200+ tok) | 53.2% | 0.0% | 12.0% |
Caveman's input-side skill saves ~9% on its own benchmark, ~12% on long context. TokensTransfer saves 5 to 6 times more on the same prompts. The reason is that LLMLingua-2 is a fine-tuned xlm-roberta-large model that learned to drop low-information tokens semantically. Caveman's rules are a regex pass. Different tools, different ceilings.
On multilingual prompts, TokenTranslation adds another 31% on top of the language tax we documented here: by translating Spanish/French/German prompts to English before sending them to Claude, you pay English token rates for non-English UX. Stacked with Transfer, the combined input saving on multilingual is in the 60% range.
they are not competitors
This is the part that matters and the part most takes miss. Caveman attacks the output. TokensTransfer and TokenTranslation attack the input. They are orthogonal. You should use both.
Worked example, Haiku 4.5, typical RAG call at 10k context / 500 output:
| Configuration | Cost / call | Cost / 1M calls |
|---|---|---|
| Nothing | $0.0125 | $12,500 |
| + Caveman only (output -65%) | $0.0109 | $10,875 (-13%) |
| + Transfer only (input -53%) | $0.0072 | $7,200 (-42%) |
| + Both stacked | $0.0056 | $5,600 (-55%) |
The asymmetry comes from the fact that input dominates the bill on long-context calls. Output-only reductions saturate fast. Input-only reductions move the big number. Together, they nearly halve the invoice.
what we did not measure
Three honest caveats:
Quality. All numbers above are token counts. We did not score answer quality on the compressed prompt versus the original. LLMLingua-2 has published evals on QA and summarization showing accuracy holds at rate=0.5; we will publish our own on real TokensTree workloads in a follow-up. If you are running Transfer in production today, sample answers and verify before going wide.
Caveman's output number is real and unrelated. 65% output reduction with quality preserved is a legitimate win. Use caveman for output. We are not arguing with their number; we are pointing out that their input-side skill is the weak member of the family and that input is where most apps lose money.
Our caveman-compress replica is a floor. The real caveman-compress skill calls Claude in the loop; our deterministic Python replica of the published rules will undershoot on edge cases where the LLM-in-the-loop catches things regex misses. So caveman-compress in production may do slightly better than the 8 to 12% we measured. It does not change the order of magnitude.
try it, ⭐ it, fight us on the numbers
Both systems are live. Free tier on both. Five-line integration in any language.
- TokensTransfer — transfer.tokenstree.com — LLMLingua-2 hosted, paste prompt, get compressed.
- TokenTranslation — translation.tokenstree.com — auto-detect source language, translate to English before sending, optional Tokinensis encoding.
- Reproduction repo — vfalbor/llm-language-token-tax · vs-caveman/ — script, raw
results.json, full methodology.
If your numbers disagree, open an issue on the repo with the prompt and we will run it. If they confirm, star the repo. Star caveman too. Different problem, same fight: stop paying for tokens you do not need.
References
- JuliusBrussee, caveman, GitHub README and
benchmarks/prompts.json, May 2026. github.com/JuliusBrussee/caveman - vfalbor, llm-language-token-tax / vs-caveman, reproduction repo with raw results and methodology, May 27, 2026. github.com/vfalbor/llm-language-token-tax
- Microsoft Research, LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, ACL 2024. arxiv.org/abs/2403.12968
- JuliusBrussee, caveman-compress skill rules,
skills/caveman-compress/SKILL.md, May 2026. github.com/JuliusBrussee/caveman/blob/main/skills/caveman-compress/SKILL.md