Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory and compute that growing context demands. Most existing solutions either degrade model accuracy, require the full context to load before compression begins, or produce memory savings that don’t translate into real speedups in standard serving infrastructure.

A research team from NYU, Columbia, Princeton, University of Maryland, Harvard and Lawrence Livermore National Laboratory published a paper this week that proposes a novel fix. The researchers introduce the concept of Latent Context Language Models, or LCLMs, a family of encoder-decoder compression models that compress input context before it reaches the decoder. The models are open-sourced on HuggingFace.

SpaceX’s 13th Starship Flight Test Launches With Starlink’s Next-Gen Satellites

Meet the New Claude Opus 5: Frontier-Class Agentic Coding and Computer Use at Unchanged Opus Pricing

Unlike KV cache compression methods — the dominant approach in the field, which still materialize the full KV cache before evicting entries — LCLMs compress the input token sequence before decoder prefill, so higher compression ratios directly reduce decoder-side compute and memory. The paper reports LCLMs at 16x compression produced output 8.8 times faster than KV cache baselines on the RULER long-context benchmark.

“These ballooning contexts take up memory and compute, and they are becoming a computational bottleneck for LLMs,” Micah Goldblum, co-lead advisor on the project and a researcher at Columbia University, told VentureBeat. “Our goal was to train language models end-to-end that can handle very long contexts efficiently and accurately. If you can make such a language model, everything becomes cheaper and faster.”

What LCLMs can do

LCLMs let models process much longer contexts than would otherwise be practical, at a fraction of the memory and compute cost, without the accuracy degradation that makes most compression methods a poor tradeoff in production.

At 4x compression, the paper reports accuracy of 91.76% on the RULER benchmark, compared to 94.41% with no compression at all. That is less than a 3 point drop for cutting context to a quarter of its original size. At 16x compression, where 93.75% of input tokens are removed, accuracy fell to 75.06%. Every KV cache method tested at the same compression ratio scored lower.

The gains hold on shorter inputs too. On GSM8K math word problems, where the full prompt is compressed rather than just retrieved documents, LCLMs outscored every other method tested regardless of compression ratio.

Credit: End-to-End Context Compression at Scale research paper https://arxiv.org/pdf/2606.09659

How it was built

The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings. The decoder processes those in place of the original tokens. Training ran across more than 350 billion tokens.

The training recipe mixes three data types:

Continual pre-training data with compressed and uncompressed spans interleaved throughout
Supervised fine-tuning data covering reasoning and long-context tasks
An auxiliary reconstruction task that pushes the encoder to retain fine-grained detail

The combination addresses a tradeoff that limited earlier compression work, where preserving reconstruction accuracy came at the cost of general task performance.

An architecture search identified the optimal configuration. The paper found that scaling the decoder matters more than scaling the encoder.

Where it fits in an agentic stack

An LCLM is not an abstract research concept. It is designed to work with an existing stack. “You can simply swap out LCLMs for any existing LLM,” Goldblum said. “Whenever you retrieve data such as documents and want to dump it into your model’s context, simply run those documents through the LCLM’s compressor first.”

He noted that in the research paper, the researchers demonstrated how to build agents that selectively decompress useful text.

“Think about this like a human skimming content before zooming in on relevant details,” Goldblum said.

Goldblum also cautioned that teams integrating the approach into existing agentic pipelines will need to tune their RAG systems accordingly.

“We also haven’t worked on online compression of reasoning traces,” he said. “The naive approach of just occasionally compressing the trace while generating it might work, but that remains to be determined.”

What this means for enterprises

Context windows are growing faster than inference infrastructure can keep up, and enterprises are already spending to fix it. VB Pulse Q1 2026 survey data from 100-plus employee organizations shows hybrid retrieval adoption intent tripling from 10.3% in January to 33.3% in March. Retrieval optimization overtook evaluation as the top investment priority by March, reaching 28.9% of qualified respondents.

Three things stand out for teams evaluating production fit:

Inference cost scales with context length. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports LCLMs at 16x compression remain within memory bounds at that context length.
RAG pipeline integration requires tuning. Teams with existing RAG pipelines will need to validate compression behavior against their retrieval quality metrics before deploying at scale.
Reasoning trace compression is unsolved. For agents running long reasoning chains, context growth from the trace is a separate problem from document retrieval. Goldblum acknowledged the gap directly: the naive approach of periodic trace compression might work but has not been tested.

The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM.

“The biggest things our architectures do is give your model access to much larger contexts, but they also unlock multiscale approaches where your model can skim vast amounts of text or code super fast and then only zooms in and fully reads a small portion of the most useful text,” Goldblum said.

Credit: Source link