The problem nobody talks about
The industry has spent three years obsessing over making models cheaper per million tokens. OpenAI drops prices. Anthropic ships faster models. Google accelerates inference. But there is a far bigger lever nobody is pulling: the prompt itself.
A typical system prompt plus context plus instructions might be 300β500 tokens. Of those, research from Microsoft (LLMLingua-2, ACL 2024) shows you can remove up to 90% and the model still answers correctly. Not approximately. Correctly β on downstream tasks, across benchmarks, at scale.
β Huiqiang Jiang et al., Microsoft Research, ACL 2024
The technique is LLMLingua-2. It trains a BERT-level token classifier via GPT-4 distillation, learning which tokens are essential and which are structural noise. The result: compression on CPU, no GPU required, 1β5 seconds latency, with 98.5% quality retention on downstream tasks.
Introducing TokensTransfer
Today we're launching TokensTransfer β a new service in the TokensTree ecosystem that exposes LLMLingua-2 compression as a simple REST API. Register, get your API key by email, and start compressing within 60 seconds.
import httpx
resp = httpx.post(
"https://tokenstree.es/compress",
headers={"X-API-Key": "tt_your_api_key"},
json={"text": your_prompt, "rate": 0.3}
)
compressed = resp.json()["compressed_text"]
savings = resp.json()["metrics"]["savings_percent"]
# β 71.6% saved on a 514-token prompt
Real benchmark numbers
These results come from our production benchmark suite, running on CPU on our cluster node. Not synthetic data. Not hand-picked examples. All three prompt sizes tested at all compression rates:
| Prompt | Base | @50% rate | @30% rate | @10% rate |
|---|---|---|---|---|
| Short (~15 tok) | 12 | -50% | -75% | -91.7% |
| Medium (~90 tok) | 72 | -58% | -76% | -91.7% |
| Long (~400 tok) | 514 | -55% | -71.6% | -89.7% |
Method is always llmlingua2 β no blind truncation. The model selects which tokens to drop based on semantic importance, preserving the sentence's full meaning in a fraction of the space. Latency P50: 1,853ms. P90: 1,913ms. Stable and predictable under load.
The full three-layer pipeline
TokensTransfer is the first stage of a three-layer optimization stack we've now built across the TokensTree ecosystem. Each layer adds savings on top of the previous one:
Built for AI agents from day one
TokensTree was designed for agents. Every endpoint, every auth mechanism, every API response shape was built with autonomous AI clients in mind. TokensTransfer continues that tradition:
- X-API-Key header auth β no OAuth flow, no session management. One header, full access.
- Predictable JSON responses β
compressed_text,metrics.savings_percent,metrics.method. Always. - Pipeline endpoint β
/compress/pipelinechains LLMLingua-2 with Tokinensis automatically when you provide a translation key. - Stats endpoint β agents can query their own usage history and aggregate savings.
- Email registration β agents can be provisioned programmatically; API key is delivered by email on verification.
Now your agents can compress their own context before making LLM calls β closing the efficiency loop. An agent that optimizes its own input before using it is a qualitatively different kind of system.
How to add it as a Claude Code skill
If you use Claude Code, you can add TokensTransfer as a skill and invoke it inline from any workflow:
# Create skill directory
mkdir -p ~/.claude/skills/tokenstransfer
# Add SKILL.md with your API key
cat > ~/.claude/skills/tokenstransfer/SKILL.md <<'EOF'
---
name: tokenstransfer
description: Compress LLM prompts using TokensTransfer (LLMLingua-2). Use when asked
to compress or optimize a prompt before sending to an LLM. Reduces tokens by 50-91%.
---
# TokensTransfer Skill
POST https://tokenstree.es/compress
Header: X-API-Key: tt_YOUR_KEY
Body: {"text": "<prompt>", "rate": 0.3}
EOF
# Then in Claude Code:
/skill tokenstransfer
# Ask: "Compress this prompt before sending to GPT-4: [your text]"
A final thought
We've been hearing for months that "tokens are cheap" and "efficiency doesn't matter at this scale." That's what people say before they have 100,000 users. Tokens are the new bandwidth. In the nineties, nobody optimized images because the connection was "fast enough" β until it wasn't. The cost curve for LLM tokens at scale follows the same arc. The teams that optimize now will have a real structural advantage when it matters.
We're building that infrastructure. The free tier is genuinely free. Come use it.
TokensTransfer is live at tokenstree.es. The API is open and documented at tokenstree.es/api/docs. Benchmark results were measured on production hardware (4-core CPU, 8GB RAM) using the LLMLingua-2 model by Microsoft Research (ACL 2024, MIT license).