LLM Context Compression Benchmarks

How SuperCompress compares against truncation, H2O, and summarization for oracle recall, token savings, and latency. All benchmarks run on 8 seeds at fixed 35% budget.

Policy Comparison at 35% Token Budget

Policy Oracle Recall Entity Recall Latency KV Savings Model Size
FIFO / Truncation 25% 73% ~57 ms ~65% 0 (rule-based)
Summarization 61% 65% ~63 ms ~65% LLM call
H2O (Heavy Hitter Oracle) 98% 73% ~56 ms ~65% attention-based
SuperCompress Best 100% 73% ~60 ms ~65% ~5K params

Benchmarked on 8 project seeds. Oracle recall = answer-critical lines preserved. Entity recall = named entities retained. KV savings = cache reduction at fixed budget.

Adaptive Mode — Real-World Long Context

On long agent contexts (conversations, documents, code), adaptive mode typically removes 85–95% of tokens while keeping query-critical lines intact. This differs from the fixed-budget measurement above because adaptive mode picks the budget dynamically.

Context Type Original Tokens After Compression Tokens Removed Savings
To Kill a Mockingbird (full chapter) ~4,200 ~420 ~3,780 ~90%
Long coding session log ~1,800 ~270 ~1,530 ~85%
Markdown documentation ~2,100 ~315 ~1,785 ~85%
Average ~2,700 ~335 ~2,365 ~87%

Visual Benchmarks

Oracle recall at 35% budget: SuperCompress 100%, H2O ~98%, FIFO and truncation ~25%
Oracle recall — answer-critical lines retained at 35% token budget
Adaptive KV savings on long-context presets: typically 85–95%
Adaptive mode token savings on real-world contexts

Environmental Impact at Scale

Based on documented SuperCompress assumptions (2,500 tok/GPU-s, 150W GPU, 55% KV share, 0.417 kg CO₂/kWh).

Scale Tokens Avoided kWh Saved CO₂ Avoided Water Saved (est.)
1 model call ~800 ~0.00003 ~0.00001 kg ~0.0001 L
1,000 calls ~800K ~0.03 ~0.01 kg ~0.1 L
1M calls ~800M ~29 ~12 kg ~100 L
10M calls ~8B ~290 ~120 kg ~1,000 L

Full methodology: Environment guide.

Frequently Asked Questions

What is oracle recall?

Oracle recall measures how many of the lines that contain the answer to a specific question are preserved after compression. 100% oracle recall means every answer-critical line is kept. This is the most important quality metric for context compression.

How is SuperCompress different from truncation?

Truncation keeps only the head and tail of the context, dropping the middle. If the answer-critical line sits in the middle (which it often does), truncation loses it. SuperCompress scores every line against the question and keeps only the most relevant ones, regardless of position.

Does SuperCompress require GPU or extra LLM calls?

No. SuperCompress runs entirely on CPU with ~5K parameters and ~60ms latency on benchmark seeds. It requires zero GPU time and zero extra LLM calls — it's a small learned policy that runs before the language model.

How does adaptive mode differ from fixed budget?

Fixed budget mode compresses to a target token percentage (e.g., keep 35% of tokens). Adaptive mode analyzes the context and query to dynamically pick the budget, often removing 85–95% of tokens on long contexts while still preserving answer-critical lines.

Can I run SuperCompress with any LLM?

Yes. SuperCompress is model-agnostic. It compresses the context before sending it to the language model, so it works with OpenAI, Anthropic, open-weight models, or any LLM that accepts text input.

Is there a hosted API?

Yes. The SuperCompress hosted API is available at supercompress.vercel.app/api/v1/compress. Get a free API key from the dashboard to get started. The Python client library wraps both local and API modes.

Try SuperCompress on your own context

Paste your long prompts and see exactly how much can be removed while keeping what matters.

Open Playground Get API Key
Share on X Share on HN
Star on GitHub