• bitcoinBitcoin(BTC)$78,257.001.14%
  • ethereumEthereum(ETH)$2,367.632.46%
  • tetherTether(USDT)$1.000.00%
  • rippleXRP(XRP)$1.430.75%
  • binancecoinBNB(BNB)$635.811.06%
  • usd-coinUSDC(USDC)$1.000.01%
  • solanaSolana(SOL)$86.841.29%
  • tronTRON(TRX)$0.3240130.04%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.02-1.70%
  • dogecoinDogecoin(DOGE)$0.0993031.60%
  • whitebitWhiteBIT Coin(WBT)$55.531.45%
  • USDSUSDS(USDS)$1.000.00%
  • HyperliquidHyperliquid(HYPE)$41.981.93%
  • leo-tokenLEO Token(LEO)$10.290.36%
  • cardanoCardano(ADA)$0.2525341.25%
  • bitcoin-cashBitcoin Cash(BCH)$454.290.35%
  • moneroMonero(XMR)$392.695.03%
  • chainlinkChainlink(LINK)$9.481.83%
  • zcashZcash(ZEC)$357.510.80%
  • CantonCanton(CC)$0.149239-1.63%
  • stellarStellar(XLM)$0.1712040.58%
  • MemeCoreMemeCore(M)$4.320.19%
  • daiDai(DAI)$1.000.00%
  • USD1USD1(USD1)$1.000.04%
  • litecoinLitecoin(LTC)$56.330.50%
  • avalanche-2Avalanche(AVAX)$9.471.80%
  • hedera-hashgraphHedera(HBAR)$0.0925062.04%
  • Ethena USDeEthena USDe(USDE)$1.00-0.01%
  • suiSui(SUI)$0.951.40%
  • shiba-inuShiba Inu(SHIB)$0.0000061.11%
  • RainRain(RAIN)$0.0075123.64%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.02%
  • the-open-networkToncoin(TON)$1.31-0.76%
  • crypto-com-chainCronos(CRO)$0.0702750.50%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • tether-goldTether Gold(XAUT)$4,706.950.30%
  • BittensorBittensor(TAO)$251.892.51%
  • Global DollarGlobal Dollar(USDG)$1.000.02%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0753150.51%
  • pax-goldPAX Gold(PAXG)$4,708.300.27%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • mantleMantle(MNT)$0.660.24%
  • polkadotPolkadot(DOT)$1.272.10%
  • uniswapUniswap(UNI)$3.291.67%
  • SkySky(SKY)$0.0884525.37%
  • Pi NetworkPi Network(PI)$0.1827127.16%
  • nearNEAR Protocol(NEAR)$1.39-0.25%
  • Falcon USDFalcon USD(USDF)$1.000.02%
  • okbOKB(OKB)$84.460.27%
  • pepePepe(PEPE)$0.0000042.51%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

March 27, 2026
in AI & Technology
Reading Time: 11 mins read
A A
IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models
ShareShareShareShareShare

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.

The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model.

YOU MAY ALSO LIKE

Trump has terminated several members of the independent National Science Board

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

The DSA bottleneck

Large language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token.

However, self-attention has a severe limitation. Its computational complexity scales quadratically with sequence length. For applications requiring extended context windows (e.g., large document processing, multi-step agentic workflows, or long chain-of-thought reasoning), this quadratic scaling leads to sluggish inference speeds and significant compute and memory costs.

Sparse attention offers a principled solution to this scaling problem. Instead of calculating the relationship between every token and all preceding ones, sparse attention optimizes the process by having each query select and attend to only the most relevant subset of tokens.

DeepSeek Sparse Attention (DSA) architecture (source: arXiv)

DeepSeek Sparse Attention (DSA) is a highly efficient implementation of this concept, first introduced in DeepSeek-V3.2. To determine which tokens matter most, DSA introduces a lightweight “lightning indexer module” at every layer of the model. This indexer scores all preceding tokens and selects a small batch for the main core attention mechanism to process. By doing this, DSA slashes the heavy core attention computation from quadratic to linear, dramatically speeding up the model while preserving output quality.

But the researchers identified a lingering flaw: the DSA indexer itself still operates at a quadratic complexity at every single layer. Even though the indexer is computationally cheaper than the main attention process, as context lengths grow, the time the model spends running these indexers skyrockets. This severely slows down the model, especially during the initial “prefill” stage where the prompt is first processed.

DSA index tax

The DSA indexing tax increases with context length (source: arXiv)

Caching attention with IndexCache

To solve the indexer bottleneck, the research team discovered a crucial characteristic of how DSA models process data. The subset of important tokens an indexer selects remains remarkably stable as data moves through consecutive transformer layers. Empirical tests on DSA models revealed that adjacent layers share between 70% and 100% of their selected tokens.

To capitalize on this cross-layer redundancy, the researchers developed IndexCache. The technique partitions the model’s layers into two categories. A small number of full (F) layers retain their indexers, actively scoring the tokens and choosing the most important ones to cache. The rest of the layers become shared (S), performing no indexing and reusing the cached indices from the nearest preceding F layer.

IndexCache

IndexCache splits layers into full and shared layers

During inference, the model simply checks the layer type. If it reaches an F layer, it calculates and caches fresh indices. If it is an S layer, it skips the math and copies the cached data.

There is a wide range of optimization techniques that try to address the attention bottleneck by compressing the KV cache, where the computed attention values are stored. Instead of shrinking the memory footprint like standard KV cache compression, IndexCache attacks the compute bottleneck. 

“IndexCache is not a traditional KV cache compression or sharing technique,” Yushi Bai, co-author of the paper, told VentureBeat. “It eliminates this redundancy by reusing indices across layers, thereby reducing computation rather than just memory footprint. It is complementary to existing approaches and can be combined with them.”

The researchers developed two deployment approaches for IndexCache. (It is worth noting that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest family of GLM models.)

For developers working with off-the-shelf DSA models where retraining is unfeasible or too expensive, they created a training-free method relying on a “greedy layer selection” algorithm. By running a small calibration dataset through the model, this algorithm automatically determines the optimal placement of F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely remove 75% of the indexers while matching the downstream performance of the original model.

For teams pre-training or heavily fine-tuning their own foundation models, the researchers propose a training-aware version that optimizes the network parameters to natively support cross-layer sharing. This approach introduces a “multi-layer distillation loss” during training. It forces each retained indexer to learn how to select a consensus subset of tokens that will be highly relevant for all the subsequent layers it serves.

Real-world speedups on production models

To test the impact of IndexCache, the researchers applied it to the 30-billion-parameter GLM-4.7 Flash model and compared it against the standard baseline.

At a 200K context length, removing 75% of the indexers slashed the prefill latency from 19.5 seconds down to just 10.7 seconds, delivering a 1.82x speedup. The researchers note these speedups are expected to be even greater in longer contexts.

During the decoding phase, where the model generates its response, IndexCache boosted per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, yielding a 1.48x speedup. When the server’s memory is fully saturated with requests, total decode throughput jumped by up to 51%.

IndexCache performance

IndexCache speeds up the prefill and decode stages significantly (source: arXiv)

For enterprise teams, these efficiency gains translate directly into cost savings. “In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” Bai said. “In these cases, we observe at least an approximate 20% reduction in deployment cost and similar improvements in user-perceived latency.” He added that for very short-context tasks, the benefits hover around 5%.

Remarkably, these efficiency gains did not compromise reasoning capabilities. Using the training-free approach to eliminate 75% of indexers, the 30B model matched the original baseline’s average score on long-context benchmarks, scoring 49.9 against the original 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually outperformed the original baseline, scoring 92.6 compared to 91.0.

The team also ran preliminary experiments on the production-scale 744-billion-parameter GLM-5 model. They found that eliminating 75% of its indexers with the training-free method yielded at least a 1.3x speedup on contexts over 100K tokens. At the same time, the model maintained a nearly identical quality average on long-context tasks.

IndexCache GLM-5

IndexCache increases the speed of GLM-5 by 20% while maintaining the accuracy (source: arXiv)

Getting IndexCache into production

For development teams wanting to implement the training-free approach today, the process is straightforward but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.

“We recommend using domain-specific data as a calibration set so that the discovered layer-sharing pattern aligns with real workloads,” Bai said.

Once calibrated, the optimization is highly accessible for production environments. Open-source patches are already available on GitHub for major serving engines. “Integration is relatively straightforward — developers can apply the patch to existing inference stacks, such as vLLM or SGLang, and enable IndexCache with minimal configuration changes,” Bai said.

While IndexCache provides an immediate fix for today’s compute bottlenecks, its underlying philosophy points to a broader shift in how the AI industry will approach model design.

“Future foundation models will likely be architected with downstream inference constraints in mind from the beginning,” Bai concluded. “This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency, rather than treating these as post-hoc concerns.”

Credit: Source link

ShareTweetSendSharePin

Related Posts

Trump has terminated several members of the independent National Science Board
AI & Technology

Trump has terminated several members of the independent National Science Board

April 26, 2026
Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models
AI & Technology

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

April 26, 2026
RAG Without Vectors: How PageIndex Retrieves by Reasoning
AI & Technology

RAG Without Vectors: How PageIndex Retrieves by Reasoning

April 26, 2026
A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics
AI & Technology

A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics

April 26, 2026
Next Post
Chicago River dyed green ahead of St. Patrick’s Day

Chicago River dyed green ahead of St. Patrick's Day

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
The Illusion of Number One

The Illusion of Number One

April 20, 2026
An AI-generated version of the late Val Kilmer is starring in a new movie

An AI-generated version of the late Val Kilmer is starring in a new movie

April 23, 2026
Controlled demolition takes down Miami hotel in seconds

Controlled demolition takes down Miami hotel in seconds

April 26, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!