LLM Context Compression Benchmarks
How SuperCompress compares against truncation, H2O, and summarization for oracle recall, token savings, and latency. All benchmarks run on 8 seeds at fixed 35% budget.
Policy Comparison at 35% Token Budget
| Policy | Oracle Recall | Entity Recall | Latency | KV Savings | Model Size |
|---|---|---|---|---|---|
| FIFO / Truncation | 25% | 73% | ~57 ms | ~65% | 0 (rule-based) |
| Summarization | 61% | 65% | ~63 ms | ~65% | LLM call |
| H2O (Heavy Hitter Oracle) | 98% | 73% | ~56 ms | ~65% | attention-based |
| SuperCompress Best | 100% | 73% | ~60 ms | ~65% | ~5K params |
Adaptive Mode — Real-World Long Context
On long agent contexts (conversations, documents, code), adaptive mode typically removes 85–95% of tokens while keeping query-critical lines intact. This differs from the fixed-budget measurement above because adaptive mode picks the budget dynamically.
| Context Type | Original Tokens | After Compression | Tokens Removed | Savings |
|---|---|---|---|---|
| To Kill a Mockingbird (full chapter) | ~4,200 | ~420 | ~3,780 | ~90% |
| Long coding session log | ~1,800 | ~270 | ~1,530 | ~85% |
| Markdown documentation | ~2,100 | ~315 | ~1,785 | ~85% |
| Average | ~2,700 | ~335 | ~2,365 | ~87% |
Visual Benchmarks
Environmental Impact at Scale
Based on documented SuperCompress assumptions (2,500 tok/GPU-s, 150W GPU, 55% KV share, 0.417 kg CO₂/kWh).
| Scale | Tokens Avoided | kWh Saved | CO₂ Avoided | Water Saved (est.) |
|---|---|---|---|---|
| 1 model call | ~800 | ~0.00003 | ~0.00001 kg | ~0.0001 L |
| 1,000 calls | ~800K | ~0.03 | ~0.01 kg | ~0.1 L |
| 1M calls | ~800M | ~29 | ~12 kg | ~100 L |
| 10M calls | ~8B | ~290 | ~120 kg | ~1,000 L |
Frequently Asked Questions
What is oracle recall?
Oracle recall measures how many of the lines that contain the answer to a specific question are preserved after compression. 100% oracle recall means every answer-critical line is kept. This is the most important quality metric for context compression.
How is SuperCompress different from truncation?
Truncation keeps only the head and tail of the context, dropping the middle. If the answer-critical line sits in the middle (which it often does), truncation loses it. SuperCompress scores every line against the question and keeps only the most relevant ones, regardless of position.
Does SuperCompress require GPU or extra LLM calls?
No. SuperCompress runs entirely on CPU with ~5K parameters and ~60ms latency on benchmark seeds. It requires zero GPU time and zero extra LLM calls — it's a small learned policy that runs before the language model.
How does adaptive mode differ from fixed budget?
Fixed budget mode compresses to a target token percentage (e.g., keep 35% of tokens). Adaptive mode analyzes the context and query to dynamically pick the budget, often removing 85–95% of tokens on long contexts while still preserving answer-critical lines.
Can I run SuperCompress with any LLM?
Yes. SuperCompress is model-agnostic. It compresses the context before sending it to the language model, so it works with OpenAI, Anthropic, open-weight models, or any LLM that accepts text input.
Is there a hosted API?
Yes. The SuperCompress hosted API is available at supercompress.vercel.app/api/v1/compress. Get a free API key from the dashboard to get started. The Python client library wraps both local and API modes.
Try SuperCompress on your own context
Paste your long prompts and see exactly how much can be removed while keeping what matters.