ContextPrune
Before every LLM call, ContextPrune strips dead context from your messages array. Deterministic. Zero extra LLM calls. Under 10ms. The dashboard shows what's eating your context budget — and ten pattern detectors tell you exactly what's misconfigured before sessions overflow.
Why this exists
Every message your agent has ever exchanged lives in one array — re-sent to the LLM on every single call. After 30 turns, most of it is fixed errors, stale file reads, and dead reasoning. The model pays to read it. You pay to send it.
You're paying for dead weight
A 30-turn session re-sends 2.4M input tokens, half of it useless. At GPT-4o pricing across 5,000 sessions, that's $6,500/month wasted on context that does nothing.
The model loses focus
Buried instructions get ignored. Your agent repeats itself, contradicts earlier turns, forgets the task. This is the "Lost in the Middle" effect — and bloated context makes it worse.
Then you hit the cliff
Around 65–75% utilisation, behaviour breaks sharply. Most teams panic-clear the whole context and start over, losing the good with the bad.
Compress. Analyze. Recommend.
Compress
Lean context, every call
Strip dead weight from your messages array deterministically — under 10ms, no LLM in the loop. 40–60% fewer input tokens with zero loss of critical context. The same input always produces the same output.
Analyze
See what your context actually costs
Track utilisation, fixed overhead, compression savings, and overflow risk across every session. Spot the bloat before it hits your bill or your output quality.
Recommend
Tuning advice that pays for itself
Ten pattern detectors evaluate your real traffic and surface exactly what's misconfigured — threshold settings, system prompt size, tool schema bloat — prioritised by impact. Not generic advice. Findings based on your sessions.
Maximum signal. Zero noise.
What gets removed
What is guaranteed to stay
A post-compression validator runs before anything is returned. Critical context is never lost.
Built for production
<10ms
p99 processing latency
0ms
local skip when context is small
40–60%
typical input token reduction
100%
deterministic — no LLM in the pipeline
PII redaction built in
Enterprise ready
18+ patterns — credentials, SSNs, API keys, credit cards, phone numbers, IPs — stripped locally before any data leaves your machine. Restored in your response.
Built for AI engineers running LLMs in production
AI engineers building agentic workflows
Multi-step agents accumulate context fast. ContextPrune keeps each step lean without manual intervention — even across 200-turn sessions.
Teams managing API costs at scale
Token costs compound quietly. ContextPrune pays for itself in the first month for any team running significant LLM volume.
Platform engineers building LLM infrastructure
Drop ContextPrune into your middleware layer and handle context management once, for every service behind it.
One SDK. Four languages.
Drop-in replacement for your messages array. No configuration required to get started.
Python
pip install contextprune from contextprune import prune messages = prune(messages)
TypeScript
npm install contextprune
import { prune } from 'contextprune'
const messages = await prune(messages)Go
go get github.com/grapine-ai/contextprune messages, _ := contextprune.Prune(messages)
Rust
contextprune = "0.1" let messages = contextprune::prune(messages)?;
Built in the open
The ContextPrune SDK is open source. The core compression algorithm and all four language SDKs are on GitHub — free to use, inspect, and contribute to.
And you're paying to send it on every call. ContextPrune fixes that — deterministically, in under 10ms, with no extra LLM in the loop.
Go to ContextPrune