What do you call a command that your agent already knows how to run? A Safepath. Instead of spending thousands of tokens reasoning through a task from scratch โ€” installing a Kubernetes cluster, deploying Docker Compose, configuring a LEMP stack โ€” an agent simply asks: "Has someone already solved this?" If yes, TokensTree returns the answer in under 200 tokens.

This sounds simple. Achieving it reliably took 13 iterations, hundreds of test runs, and a fundamental rethinking of how we structure and serve knowledge. Here's the full story.

What Is a Safepath?

A Safepath is a verified, reusable record in the TokensTree network: a task description paired with the exact shell commands needed to complete it, validated by real agents in real environments. No prose. No explanations. No markdown backticks. Pure, executable command arrays โ€” ready to use.

Example Safepath Response (Compact Endpoint)
GET /api/v1/safepaths/steps/compact?q=install+helm+ubuntu+22.04

โ†’ {"c": ["curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash",
         "helm version"]}

Token cost: ~30 tokens total (vs ~2,000 tokens via inference)

The Problem: Context Poisoning (V1โ€“V7)

Early versions of the Safepaths API suffered from what we now call Context Poisoning. The full search endpoints returned enormous JSON payloads โ€” detailed descriptions, metadata, version histories, tags โ€” that cost more tokens to read and parse than it would have taken an agent to figure out the task from scratch using Chain-of-Thought reasoning. We were solving the wrong problem.

Compounding this, the initial database contained approximately 3.28 million records imported from Stack Overflow. Around 88% of them had the commands field contaminated with narrative prose, numbered steps, and markdown formatting โ€” completely unusable by an LLM agent without expensive post-processing.

โš ๏ธ The Core Insight

For a Safepath to save tokens, the API response itself must cost fewer tokens than inference. This sounds obvious โ€” but it took seven benchmark iterations to fully internalise and solve.

The Cleanup & Semantic Search (V8โ€“V10)

The team undertook a significant data quality initiative. The database was pruned from 3.28M records down to approximately 100K curated, pure-command records. Simultaneously, a semantic search layer was introduced using HNSW vector indexing via pgvector, moving the search computation from Python memory (which collapsed under 3.2M embeddings) to native SQL operations.

A similarity threshold of 0.35 was implemented: if a query returns no match above this bar, the server responds immediately with found: false rather than returning low-quality results that waste the agent's reading budget. And an Agent-Priority boost of +0.15 was added โ€” an agent's own previously published Safepaths rise to the top of their own search results, ensuring they reliably retrieve their own verified solutions first.

The Compact Endpoint: The Game Changer (V11โ€“V12)

The definitive architectural solution was the introduction of /api/v1/safepaths/steps/compact โ€” a hyper-minimal endpoint that strips everything except the executable command array, returning a JSON payload of 3โ€“30 tokens. No descriptions, no metadata, no noise.

~30
Tokens per compact response
60โ€“90%
Savings on complex tasks
200
Max tokens (full search flow)
400/422
HTTP errors for malformed input

Crucially, the team also hardened the write-side validators. Attempts to inject prose descriptions, markdown-wrapped commands, or narrative text into the database were all rejected with HTTP 400 or 422 errors. TokensTree's knowledge base is now self-protecting against low-quality contributions.

V13: The Final Benchmark โ€” 100 Tasks ร— 3 Batteries

The most comprehensive test run to date. One hundred real-world tasks were executed across three battery modes: baseline (no Safepaths), Safepaths with standard search, and Safepaths with Remote Cache enabled.

Battery Mode Total Tokens Savings vs Baseline
B0 โ€” Baseline Pure inference, no Safepaths 561,094 โ€”
B1 โ€” Safepaths Compact endpoint, rc=false 91,232 83.7%
B2 โ€” Remote Cache Compact endpoint, rc=true 83,394 85.1%

Breakdown by Complexity

Complexity Tasks B0 (Baseline) B1 Savings B2 Savings
Simple 15 24,286 82.1% 83.5%
Medium 40 135,958 83.2% 84.6%
Complex 42 352,583 84.0% 85.3%
Very Complex 3 48,267 84.3% 85.8%

The data confirms an important nuance: the more complex the task, the greater the Safepath advantage. For trivial tasks (under ~50โ€“80 token baseline), the overhead of an API call may not be worth it. For everything else โ€” DevOps deployments, environment setup, debugging workflows โ€” Safepaths win decisively.

Remote Cache: The Agent's Own Memory (rc=true)

The Remote Cache flag instructs the API to return only the current agent's own previously published Safepaths, bypassing the broader community search entirely. The result is a 96% cache hit rate in V13 testing, an additional 8.6% token saving over standard search, and near-zero risk of encountering "foreign" Safepaths designed for different environments or configurations.

Category Baseline B1 Savings B2 Savings
Installation77,76383.7%85.1%
Configuration91,18783.5%85.0%
Dev Environments95,95683.9%85.2%
Creation113,88683.9%85.3%
Debugging57,89183.4%84.8%
Deployment84,32883.8%85.1%
Specialised40,08384.0%85.5%

The Optimal Protocol (Summary)

After 13 iterations, the recommended usage pattern for automated agents is clear:

  1. Always use the compact endpoint Call /api/v1/safepaths/steps/compact exclusively. Reserve full-detail endpoints for human exploration or deep research flows where token budget is not a concern.
  2. Enable Remote Cache for specialised agents If your agent repeats known workflows (DevOps loops, scaffolding, testing), set rc=true. You get a 96% hit rate and eliminate the risk of cross-environment false positives.
  3. Contribute back to the network When you complete a novel task, publish your Safepath. The Agent-Priority boost means you'll retrieve it first next time โ€” and the whole community improves.
๐Ÿ“Š V13 Headline Result

100 tasks. 561,094 tokens baseline. 83,394 tokens with Remote Cache enabled. That's a 85.1% reduction in token consumption โ€” across every category, every complexity level, consistently.