What is ContextPrune?

ContextPrune is an SDK (available in Python, TypeScript, Go, and Rust) that performs garbage collection on LLM context windows. It strips dead context from your messages array before every LLM call — deterministically, in under 10ms, with no extra LLM calls — reducing input tokens by 40–60%.

How does ContextPrune reduce token costs?

ContextPrune removes stale tool results, resolved errors, intermediate reasoning, redundant status updates, and duplicate tool calls from your messages array. It preserves system prompts, user corrections, unresolved errors, and the session goal. The result is 40–60% fewer input tokens per call.

What languages does ContextPrune support?

ContextPrune has SDKs for Python, TypeScript, Go, and Rust. Each SDK has an identical one-function API: pass in your messages array, get back the pruned array.

How fast is ContextPrune?

ContextPrune runs deterministically in under 10ms at p99. It skips processing entirely (0ms) when context is small enough to not need pruning. There is no LLM in the pipeline — it is pure algorithmic compression.

What does ContextPrune remove from context?

ContextPrune removes: stale tool results, resolved errors, intermediate reasoning, redundant status updates, and duplicate tool calls.

What does ContextPrune preserve?

ContextPrune never touches system prompts, user corrections and overrides, the session's original goal, or any unresolved errors. Critical context is always preserved.

ContextPrune

Garbage collection for LLM context windows.

Before every LLM call, ContextPrune strips dead context from your messages array. Deterministic. Zero extra LLM calls. Under 10ms. The dashboard shows what's eating your context budget — and ten pattern detectors tell you exactly what's misconfigured before sessions overflow.

deterministiczero extra LLMproactive recommendations

See the docs

Why this exists

Every message your agent has ever exchanged lives in one array — re-sent to the LLM on every single call. After 30 turns, most of it is fixed errors, stale file reads, and dead reasoning. The model pays to read it. You pay to send it.

You're paying for dead weight

A 30-turn session re-sends 2.4M input tokens, half of it useless. At GPT-4o pricing across 5,000 sessions, that's $6,500/month wasted on context that does nothing.

The model loses focus

Buried instructions get ignored. Your agent repeats itself, contradicts earlier turns, forgets the task. This is the "Lost in the Middle" effect — and bloated context makes it worse.

Then you hit the cliff

Around 65–75% utilisation, behaviour breaks sharply. Most teams panic-clear the whole context and start over, losing the good with the bad.

Compress. Analyze. Recommend.

What it does

Compress

Lean context, every call

Strip dead weight from your messages array deterministically — under 10ms, no LLM in the loop. 40–60% fewer input tokens with zero loss of critical context. The same input always produces the same output.

Analyze

See what your context actually costs

Track utilisation, fixed overhead, compression savings, and overflow risk across every session. Spot the bloat before it hits your bill or your output quality.

Recommend

Tuning advice that pays for itself

Ten pattern detectors evaluate your real traffic and surface exactly what's misconfigured — threshold settings, system prompt size, tool schema bloat — prioritised by impact. Not generic advice. Findings based on your sessions.

Maximum signal. Zero noise.

What you're guaranteed

What gets removed

Stale tool results — file reads no longer referenced
Resolved errors — stack traces from fixed bugs
Intermediate reasoning — collapsed to one line
Status updates and redundant confirmations
Duplicate tool calls — deduplicated to most recent

What is guaranteed to stay

System prompts and instructions — pinned, never touched
User corrections and overrides — preserved across all turns
The session's original goal
Any error that is still unresolved

A post-compression validator runs before anything is returned. Critical context is never lost.

Built for production

<10ms

p99 processing latency

0ms

local skip when context is small

40–60%

typical input token reduction

100%

deterministic — no LLM in the pipeline

PII redaction built in

Enterprise ready

18+ patterns — credentials, SSNs, API keys, credit cards, phone numbers, IPs — stripped locally before any data leaves your machine. Restored in your response.

Built for AI engineers running LLMs in production

Who it's for

AI engineers building agentic workflows

Multi-step agents accumulate context fast. ContextPrune keeps each step lean without manual intervention — even across 200-turn sessions.

Teams managing API costs at scale

Token costs compound quietly. ContextPrune pays for itself in the first month for any team running significant LLM volume.

Platform engineers building LLM infrastructure

Drop ContextPrune into your middleware layer and handle context management once, for every service behind it.

One SDK. Four languages.

Install once. Works the same way across your entire stack.

Drop-in replacement for your messages array. No configuration required to get started.

Python

pip install contextprune

from contextprune import prune
messages = prune(messages)

TypeScript

npm install contextprune

import { prune } from 'contextprune'
const messages = await prune(messages)

go get github.com/grapine-ai/contextprune

messages, _ := contextprune.Prune(messages)

Rust

contextprune = "0.1"

let messages = contextprune::prune(messages)?;

Built in the open

The ContextPrune SDK is open source. The core compression algorithm and all four language SDKs are on GitHub — free to use, inspect, and contribute to.

View on GitHub

Your context window is mostly garbage.

And you're paying to send it on every call. ContextPrune fixes that — deterministically, in under 10ms, with no extra LLM in the loop.

Go to ContextPrune