85%
Token cost cut
3
Lines of Python
0
Eval regressions

Bill fell ~60% on the Haiku swap. Output quality wobbled on the longer prompts โ€” two of our eval cases regressed hard enough that I rolled half the traffic back. So I tried the other direction: keep the model, shrink the prompt. Three lines of Python via transfer.tokenstree.com and the same Sonnet calls dropped 85% in token cost. Same model. Same outputs on our eval set. I wasn't expecting the numbers to cross like that.

the test

Same prompt corpus, 2,400 calls, production traffic replayed against three configurations.

Config Model Avg input tokens $ / 1k calls Eval regression
baseline sonnet 3,120 $9.36 โ€”
haiku downgrade haiku 3,120 $3.74 (โˆ’60%) 2 cases failed
llmlingua-2 compress sonnet 468 $1.40 (โˆ’85%) 0 cases failed

The compression path kept Sonnet on the expensive calls and still beat Haiku on cost. That's the part that surprised me. The intuition everyone (me included) reaches for first is "smaller model." The cheaper lever was removing the tokens the model didn't need to see in the first place.

the 3 lines

from tokenstransfer import compress

compressed = compress(my_prompt, target_ratio=0.15)
response   = anthropic.messages.create(
    model="claude-sonnet-4",
    messages=[{"role":"user","content":compressed}]
)

That's it. No prompt rewriting. No model swap. The SDK calls out to transfer.tokenstree.com, runs LLMLingua-2 against your text, and hands back a compressed version the model still understands.

how it works

LLMLingua-2 is a 2024 paper out of Microsoft Research, MIT-licensed, open source. It's a small BERT-style classifier trained to decide, token by token, which ones the downstream LLM actually needs. It's not truncation and it's not summarization โ€” the compressed prompt looks weird to a human (filler words gone, articles dropped, some restructuring) but reads fine to GPT-4o or Claude. Microsoft's paper reports 2xโ€“20x ratios depending on the prompt type. In our traffic the average was about 6.7x.

Transfer adds a second layer on top: a remote cache keyed on the semantic fingerprint of the prompt. In our month of production traffic, 96% of calls hit the cache and never ran the compressor model at all. Those calls are effectively free โ€” just the Anthropic bill, minus ~85%.

where it doesn't work

Compression breaks on prompts that are mostly structure rather than prose. Raw source code drops ~20% at best and sometimes corrupts syntax the model depended on. Unique identifiers (UUIDs, hashes, one-off strings) are exactly the tokens the classifier wants to drop, which is wrong for retrieval tasks. Strict JSON schemas won't survive either โ€” you'll get valid-looking JSON with keys silently renamed.

Rule of thumb I've landed on: it works well on instructions, few-shot examples, RAG context, and long documents. It doesn't work on anything where every character is load-bearing. Check the ratio per prompt type before you wire it into production.

Honest limit: the 96% cache hit rate is high because our prompts share a lot of templated context across calls. If your traffic is truly unique per call, expect 30โ€“60% hit rate and a little extra latency on the compressor run (~1.7s on CPU). Still cheaper than paying for the uncompressed tokens.

try it

Paste one of your own prompts into transfer.tokenstree.com and see what ratio you get. Free tier, 3 lines, no model swap required. If the number isn't close to 6x on your prompts, the article's wrong and I want to know why.


Numbers in this article come from production traffic replayed against three configurations over one month. LLMLingua-2 is open source (MIT, Microsoft Research, ACL 2024): llmlingua.com. TokensTransfer wraps the compressor with a remote cache and a small Python SDK at transfer.tokenstree.com.