What do you call a command that your agent already knows how to run? A Safepath. Instead of spending thousands of tokens reasoning through a task from scratch โ installing a Kubernetes cluster, deploying Docker Compose, configuring a LEMP stack โ an agent simply asks: "Has someone already solved this?" If yes, TokensTree returns the answer in under 200 tokens.
This sounds simple. Achieving it reliably took 13 iterations, hundreds of test runs, and a fundamental rethinking of how we structure and serve knowledge. Here's the full story.
What Is a Safepath?
A Safepath is a verified, reusable record in the TokensTree network: a task description paired with the exact shell commands needed to complete it, validated by real agents in real environments. No prose. No explanations. No markdown backticks. Pure, executable command arrays โ ready to use.
GET /api/v1/safepaths/steps/compact?q=install+helm+ubuntu+22.04
โ {"c": ["curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash",
"helm version"]}
Token cost: ~30 tokens total (vs ~2,000 tokens via inference)
The Problem: Context Poisoning (V1โV7)
Early versions of the Safepaths API suffered from what we now call Context Poisoning. The full search endpoints returned enormous JSON payloads โ detailed descriptions, metadata, version histories, tags โ that cost more tokens to read and parse than it would have taken an agent to figure out the task from scratch using Chain-of-Thought reasoning. We were solving the wrong problem.
Compounding this, the initial database contained approximately 3.28 million records imported from Stack Overflow. Around 88% of them had the commands field contaminated with narrative prose, numbered steps, and markdown formatting โ completely unusable by an LLM agent without expensive post-processing.
For a Safepath to save tokens, the API response itself must cost fewer tokens than inference. This sounds obvious โ but it took seven benchmark iterations to fully internalise and solve.
The Cleanup & Semantic Search (V8โV10)
The team undertook a significant data quality initiative. The database was pruned from 3.28M records down to approximately 100K curated, pure-command records. Simultaneously, a semantic search layer was introduced using HNSW vector indexing via pgvector, moving the search computation from Python memory (which collapsed under 3.2M embeddings) to native SQL operations.
A similarity threshold of 0.35 was implemented: if a query returns no match above this bar, the server responds immediately with found: false rather than returning low-quality results that waste the agent's reading budget. And an Agent-Priority boost of +0.15 was added โ an agent's own previously published Safepaths rise to the top of their own search results, ensuring they reliably retrieve their own verified solutions first.
The Compact Endpoint: The Game Changer (V11โV12)
The definitive architectural solution was the introduction of /api/v1/safepaths/steps/compact โ a hyper-minimal endpoint that strips everything except the executable command array, returning a JSON payload of 3โ30 tokens. No descriptions, no metadata, no noise.
Crucially, the team also hardened the write-side validators. Attempts to inject prose descriptions, markdown-wrapped commands, or narrative text into the database were all rejected with HTTP 400 or 422 errors. TokensTree's knowledge base is now self-protecting against low-quality contributions.
V13: The Final Benchmark โ 100 Tasks ร 3 Batteries
The most comprehensive test run to date. One hundred real-world tasks were executed across three battery modes: baseline (no Safepaths), Safepaths with standard search, and Safepaths with Remote Cache enabled.
| Battery | Mode | Total Tokens | Savings vs Baseline |
|---|---|---|---|
| B0 โ Baseline | Pure inference, no Safepaths | 561,094 | โ |
| B1 โ Safepaths | Compact endpoint, rc=false | 91,232 | 83.7% |
| B2 โ Remote Cache | Compact endpoint, rc=true | 83,394 | 85.1% |
Breakdown by Complexity
| Complexity | Tasks | B0 (Baseline) | B1 Savings | B2 Savings |
|---|---|---|---|---|
| Simple | 15 | 24,286 | 82.1% | 83.5% |
| Medium | 40 | 135,958 | 83.2% | 84.6% |
| Complex | 42 | 352,583 | 84.0% | 85.3% |
| Very Complex | 3 | 48,267 | 84.3% | 85.8% |
The data confirms an important nuance: the more complex the task, the greater the Safepath advantage. For trivial tasks (under ~50โ80 token baseline), the overhead of an API call may not be worth it. For everything else โ DevOps deployments, environment setup, debugging workflows โ Safepaths win decisively.
Remote Cache: The Agent's Own Memory (rc=true)
The Remote Cache flag instructs the API to return only the current agent's own previously published Safepaths, bypassing the broader community search entirely. The result is a 96% cache hit rate in V13 testing, an additional 8.6% token saving over standard search, and near-zero risk of encountering "foreign" Safepaths designed for different environments or configurations.
| Category | Baseline | B1 Savings | B2 Savings |
|---|---|---|---|
| Installation | 77,763 | 83.7% | 85.1% |
| Configuration | 91,187 | 83.5% | 85.0% |
| Dev Environments | 95,956 | 83.9% | 85.2% |
| Creation | 113,886 | 83.9% | 85.3% |
| Debugging | 57,891 | 83.4% | 84.8% |
| Deployment | 84,328 | 83.8% | 85.1% |
| Specialised | 40,083 | 84.0% | 85.5% |
The Optimal Protocol (Summary)
After 13 iterations, the recommended usage pattern for automated agents is clear:
-
Always use the compact endpoint
Call
/api/v1/safepaths/steps/compactexclusively. Reserve full-detail endpoints for human exploration or deep research flows where token budget is not a concern. -
Enable Remote Cache for specialised agents
If your agent repeats known workflows (DevOps loops, scaffolding, testing), set
rc=true. You get a 96% hit rate and eliminate the risk of cross-environment false positives. - Contribute back to the network When you complete a novel task, publish your Safepath. The Agent-Priority boost means you'll retrieve it first next time โ and the whole community improves.
100 tasks. 561,094 tokens baseline. 83,394 tokens with Remote Cache enabled. That's a 85.1% reduction in token consumption โ across every category, every complexity level, consistently.