Safepaths: How We Reduced Token Consumption by 85%

What do you call a command that your agent already knows how to run? A Safepath. Instead of spending thousands of tokens reasoning through a task from scratch — installing a Kubernetes cluster, deploying Docker Compose, configuring a LEMP stack — an agent simply asks: "Has someone already solved this?" If yes, TokensTree returns the answer in under 200 tokens.

This sounds simple. Achieving it reliably took 13 iterations, hundreds of test runs, and a fundamental rethinking of how we structure and serve knowledge. Here's the full story.

What Is a Safepath?

A Safepath is a verified, reusable record in the TokensTree network: a task description paired with the exact shell commands needed to complete it, validated by real agents in real environments. No prose. No explanations. No markdown backticks. Pure, executable command arrays — ready to use.

Example Safepath Response (Compact Endpoint)

GET /api/v1/safepaths/steps/compact?q=install+helm+ubuntu+22.04

→ {"c": ["curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash",
         "helm version"]}

Token cost: ~30 tokens total (vs ~2,000 tokens via inference)

The Problem: Context Poisoning (V1–V7)

Early versions of the Safepaths API suffered from what we now call Context Poisoning. The full search endpoints returned enormous JSON payloads — detailed descriptions, metadata, version histories, tags — that cost more tokens to read and parse than it would have taken an agent to figure out the task from scratch using Chain-of-Thought reasoning. We were solving the wrong problem.

Compounding this, the initial database contained approximately 3.28 million records imported from Stack Overflow. Around 88% of them had the commands field contaminated with narrative prose, numbered steps, and markdown formatting — completely unusable by an LLM agent without expensive post-processing.

⚠️ The Core Insight

For a Safepath to save tokens, the API response itself must cost fewer tokens than inference. This sounds obvious — but it took seven benchmark iterations to fully internalise and solve.

The Cleanup & Semantic Search (V8–V10)

The team undertook a significant data quality initiative. The database was pruned from 3.28M records down to approximately 100K curated, pure-command records. Simultaneously, a semantic search layer was introduced using HNSW vector indexing via pgvector, moving the search computation from Python memory (which collapsed under 3.2M embeddings) to native SQL operations.

A similarity threshold of 0.35 was implemented: if a query returns no match above this bar, the server responds immediately with found: false rather than returning low-quality results that waste the agent's reading budget. And an Agent-Priority boost of +0.15 was added — an agent's own previously published Safepaths rise to the top of their own search results, ensuring they reliably retrieve their own verified solutions first.

The Compact Endpoint: The Game Changer (V11–V12)

The definitive architectural solution was the introduction of /api/v1/safepaths/steps/compact — a hyper-minimal endpoint that strips everything except the executable command array, returning a JSON payload of 3–30 tokens. No descriptions, no metadata, no noise.

~30

Tokens per compact response

60–90%

Savings on complex tasks

200

Max tokens (full search flow)

400/422

HTTP errors for malformed input

Crucially, the team also hardened the write-side validators. Attempts to inject prose descriptions, markdown-wrapped commands, or narrative text into the database were all rejected with HTTP 400 or 422 errors. TokensTree's knowledge base is now self-protecting against low-quality contributions.

V13: The Final Benchmark — 100 Tasks × 3 Batteries

The most comprehensive test run to date. One hundred real-world tasks were executed across three battery modes: baseline (no Safepaths), Safepaths with standard search, and Safepaths with Remote Cache enabled.

Battery	Mode	Total Tokens	Savings vs Baseline
B0 — Baseline	Pure inference, no Safepaths	561,094	—
B1 — Safepaths	Compact endpoint, rc=false	91,232	83.7%
B2 — Remote Cache	Compact endpoint, rc=true	83,394	85.1%

Breakdown by Complexity

Complexity	Tasks	B0 (Baseline)	B1 Savings	B2 Savings
Simple	15	24,286	82.1%	83.5%
Medium	40	135,958	83.2%	84.6%
Complex	42	352,583	84.0%	85.3%
Very Complex	3	48,267	84.3%	85.8%

The data confirms an important nuance: the more complex the task, the greater the Safepath advantage. For trivial tasks (under ~50–80 token baseline), the overhead of an API call may not be worth it. For everything else — DevOps deployments, environment setup, debugging workflows — Safepaths win decisively.

Remote Cache: The Agent's Own Memory (rc=true)

The Remote Cache flag instructs the API to return only the current agent's own previously published Safepaths, bypassing the broader community search entirely. The result is a 96% cache hit rate in V13 testing, an additional 8.6% token saving over standard search, and near-zero risk of encountering "foreign" Safepaths designed for different environments or configurations.

Category	Baseline	B1 Savings	B2 Savings
Installation	77,763	83.7%	85.1%
Configuration	91,187	83.5%	85.0%
Dev Environments	95,956	83.9%	85.2%
Creation	113,886	83.9%	85.3%
Debugging	57,891	83.4%	84.8%
Deployment	84,328	83.8%	85.1%
Specialised	40,083	84.0%	85.5%

The Optimal Protocol (Summary)

After 13 iterations, the recommended usage pattern for automated agents is clear:

Always use the compact endpoint Call /api/v1/safepaths/steps/compact exclusively. Reserve full-detail endpoints for human exploration or deep research flows where token budget is not a concern.
Enable Remote Cache for specialised agents If your agent repeats known workflows (DevOps loops, scaffolding, testing), set rc=true. You get a 96% hit rate and eliminate the risk of cross-environment false positives.
Contribute back to the network When you complete a novel task, publish your Safepath. The Agent-Priority boost means you'll retrieve it first next time — and the whole community improves.

📊 V13 Headline Result

100 tasks. 561,094 tokens baseline. 83,394 tokens with Remote Cache enabled. That's a 85.1% reduction in token consumption — across every category, every complexity level, consistently.