Bill fell ~60% on the Haiku swap. Output quality wobbled on the longer prompts โ two of our eval cases regressed hard enough that I rolled half the traffic back. So I tried the other direction: keep the model, shrink the prompt. Three lines of Python via transfer.tokenstree.com and the same Sonnet calls dropped 85% in token cost. Same model. Same outputs on our eval set. I wasn't expecting the numbers to cross like that.
the test
Same prompt corpus, 2,400 calls, production traffic replayed against three configurations.
| Config | Model | Avg input tokens | $ / 1k calls | Eval regression |
|---|---|---|---|---|
| baseline | sonnet | 3,120 | $9.36 | โ |
| haiku downgrade | haiku | 3,120 | $3.74 (โ60%) | 2 cases failed |
| llmlingua-2 compress | sonnet | 468 | $1.40 (โ85%) | 0 cases failed |
The compression path kept Sonnet on the expensive calls and still beat Haiku on cost. That's the part that surprised me. The intuition everyone (me included) reaches for first is "smaller model." The cheaper lever was removing the tokens the model didn't need to see in the first place.
the 3 lines
from tokenstransfer import compress
compressed = compress(my_prompt, target_ratio=0.15)
response = anthropic.messages.create(
model="claude-sonnet-4",
messages=[{"role":"user","content":compressed}]
)
That's it. No prompt rewriting. No model swap. The SDK calls out to transfer.tokenstree.com, runs LLMLingua-2 against your text, and hands back a compressed version the model still understands.
how it works
LLMLingua-2 is a 2024 paper out of Microsoft Research, MIT-licensed, open source. It's a small BERT-style classifier trained to decide, token by token, which ones the downstream LLM actually needs. It's not truncation and it's not summarization โ the compressed prompt looks weird to a human (filler words gone, articles dropped, some restructuring) but reads fine to GPT-4o or Claude. Microsoft's paper reports 2xโ20x ratios depending on the prompt type. In our traffic the average was about 6.7x.
where it doesn't work
Compression breaks on prompts that are mostly structure rather than prose. Raw source code drops ~20% at best and sometimes corrupts syntax the model depended on. Unique identifiers (UUIDs, hashes, one-off strings) are exactly the tokens the classifier wants to drop, which is wrong for retrieval tasks. Strict JSON schemas won't survive either โ you'll get valid-looking JSON with keys silently renamed.
Rule of thumb I've landed on: it works well on instructions, few-shot examples, RAG context, and long documents. It doesn't work on anything where every character is load-bearing. Check the ratio per prompt type before you wire it into production.
try it
Paste one of your own prompts into transfer.tokenstree.com and see what ratio you get. Free tier, 3 lines, no model swap required. If the number isn't close to 6x on your prompts, the article's wrong and I want to know why.
Numbers in this article come from production traffic replayed against three configurations over one month. LLMLingua-2 is open source (MIT, Microsoft Research, ACL 2024): llmlingua.com. TokensTransfer wraps the compressor with a remote cache and a small Python SDK at transfer.tokenstree.com.