You're Paying for Tokens You Don't Need · TokensTree Newsletter #007

91%

Max token reduction

1.7s

P50 latency on CPU

Optimization layers

The problem nobody talks about

The industry has spent three years obsessing over making models cheaper per million tokens. OpenAI drops prices. Anthropic ships faster models. Google accelerates inference. But there is a far bigger lever nobody is pulling: the prompt itself.

A typical system prompt plus context plus instructions might be 300–500 tokens. Of those, research from Microsoft (LLMLingua-2, ACL 2024) shows you can remove up to 90% and the model still answers correctly. Not approximately. Correctly — on downstream tasks, across benchmarks, at scale.

"Token compression is not about losing information — it's about removing what the model never needed to read in the first place."
— Huiqiang Jiang et al., Microsoft Research, ACL 2024

The technique is LLMLingua-2. It trains a BERT-level token classifier via GPT-4 distillation, learning which tokens are essential and which are structural noise. The result: compression on CPU, no GPU required, 1–5 seconds latency, with 98.5% quality retention on downstream tasks.

Introducing TokensTransfer

Today we're launching TokensTransfer — a new service in the TokensTree ecosystem that exposes LLMLingua-2 compression as a simple REST API. Register, get your API key by email, and start compressing within 60 seconds.

import httpx

resp = httpx.post(
    "https://tokenstree.es/compress",
    headers={"X-API-Key": "tt_your_api_key"},
    json={"text": your_prompt, "rate": 0.3}
)
compressed = resp.json()["compressed_text"]
savings    = resp.json()["metrics"]["savings_percent"]
# → 71.6% saved on a 514-token prompt

Real benchmark numbers

These results come from our production benchmark suite, running on CPU on our cluster node. Not synthetic data. Not hand-picked examples. All three prompt sizes tested at all compression rates:

Prompt	Base	@50% rate	@30% rate	@10% rate
Short (~15 tok)	12	-50%	-75%	-91.7%
Medium (~90 tok)	72	-58%	-76%	-91.7%
Long (~400 tok)	514	-55%	-71.6%	-89.7%

Method is always llmlingua2 — no blind truncation. The model selects which tokens to drop based on semantic importance, preserving the sentence's full meaning in a fraction of the space. Latency P50: 1,853ms. P90: 1,913ms. Stable and predictable under load.

Bottom line on cost: A 514-token prompt at rate=0.10 comes out at 53 tokens. With GPT-4 Turbo at $0.01/1K input tokens, that's going from $0.00514 to $0.00053 per call. At 100,000 calls/month: from $514 to $53. Same quality output. 90% less spend.

The full three-layer pipeline

TokensTransfer is the first stage of a three-layer optimization stack we've now built across the TokensTree ecosystem. Each layer adds savings on top of the previous one:

1
Your original prompt
514 tokens · baseline cost
baseline
2
TokensTransfer — LLMLingua-2
tokenstree.es · REST API · MIT license · CPU-native
-71%
3
TokenTranslation — Tokinensis
Invented language 28% more token-efficient than English
-28% more
4
LLM API (GPT-4, Claude, Gemini…)
~53 tokens from original 514 · 90% saved
≤10% cost

Built for AI agents from day one

TokensTree was designed for agents. Every endpoint, every auth mechanism, every API response shape was built with autonomous AI clients in mind. TokensTransfer continues that tradition:

X-API-Key header auth — no OAuth flow, no session management. One header, full access.
Predictable JSON responses — compressed_text, metrics.savings_percent, metrics.method. Always.
Pipeline endpoint — /compress/pipeline chains LLMLingua-2 with Tokinensis automatically when you provide a translation key.
Stats endpoint — agents can query their own usage history and aggregate savings.
Email registration — agents can be provisioned programmatically; API key is delivered by email on verification.

Now your agents can compress their own context before making LLM calls — closing the efficiency loop. An agent that optimizes its own input before using it is a qualitatively different kind of system.

How to add it as a Claude Code skill

If you use Claude Code, you can add TokensTransfer as a skill and invoke it inline from any workflow:

# Create skill directory
mkdir -p ~/.claude/skills/tokenstransfer

# Add SKILL.md with your API key
cat > ~/.claude/skills/tokenstransfer/SKILL.md <<'EOF'
---
name: tokenstransfer
description: Compress LLM prompts using TokensTransfer (LLMLingua-2). Use when asked
  to compress or optimize a prompt before sending to an LLM. Reduces tokens by 50-91%.
---

# TokensTransfer Skill
POST https://tokenstree.es/compress
Header: X-API-Key: tt_YOUR_KEY
Body: {"text": "<prompt>", "rate": 0.3}
EOF

# Then in Claude Code:
/skill tokenstransfer
# Ask: "Compress this prompt before sending to GPT-4: [your text]"

Get your free API key at tokenstree.es — Registration is free. API key is delivered by email after verification. No credit card. No rate limit on the free tier for tokens saved.

A final thought

We've been hearing for months that "tokens are cheap" and "efficiency doesn't matter at this scale." That's what people say before they have 100,000 users. Tokens are the new bandwidth. In the nineties, nobody optimized images because the connection was "fast enough" — until it wasn't. The cost curve for LLM tokens at scale follows the same arc. The teams that optimize now will have a real structural advantage when it matters.

We're building that infrastructure. The free tier is genuinely free. Come use it.

TokensTransfer is live at tokenstree.es. The API is open and documented at tokenstree.es/api/docs. Benchmark results were measured on production hardware (4-core CPU, 8GB RAM) using the LLMLingua-2 model by Microsoft Research (ACL 2024, MIT license).