• bitcoinBitcoin(BTC)$74,439.00-1.95%
  • ethereumEthereum(ETH)$2,025.01-2.61%
  • tetherTether(USDT)$1.00-0.01%
  • binancecoinBNB(BNB)$647.86-1.40%
  • rippleXRP(XRP)$1.31-1.65%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$82.57-1.74%
  • tronTRON(TRX)$0.367558-1.88%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.030.63%
  • dogecoinDogecoin(DOGE)$0.100375-1.20%
  • HyperliquidHyperliquid(HYPE)$57.92-4.46%
  • USDSUSDS(USDS)$1.00-0.02%
  • leo-tokenLEO Token(LEO)$10.101.01%
  • RainRain(RAIN)$0.01450825.16%
  • zcashZcash(ZEC)$534.87-7.81%
  • cardanoCardano(ADA)$0.237741-1.38%
  • moneroMonero(XMR)$392.802.99%
  • bitcoin-cashBitcoin Cash(BCH)$335.14-2.73%
  • chainlinkChainlink(LINK)$9.12-3.14%
  • whitebitWhiteBIT Coin(WBT)$54.68-2.06%
  • CantonCanton(CC)$0.156005-0.61%
  • stellarStellar(XLM)$0.17218716.51%
  • the-open-networkToncoin(TON)$1.83-6.66%
  • USD1USD1(USD1)$1.00-0.01%
  • Ethena USDeEthena USDe(USDE)$1.00-0.03%
  • daiDai(DAI)$1.000.01%
  • litecoinLitecoin(LTC)$51.92-0.27%
  • MemeCoreMemeCore(M)$2.991.24%
  • avalanche-2Avalanche(AVAX)$9.04-1.27%
  • suiSui(SUI)$0.96-5.19%
  • hedera-hashgraphHedera(HBAR)$0.085718-1.11%
  • paypal-usdPayPal USD(PYUSD)$1.000.00%
  • shiba-inuShiba Inu(SHIB)$0.000005-1.94%
  • nearNEAR Protocol(NEAR)$2.45-4.26%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • crypto-com-chainCronos(CRO)$0.066834-1.62%
  • Global DollarGlobal Dollar(USDG)$1.00-0.03%
  • tether-goldTether Gold(XAUT)$4,440.69-1.40%
  • BittensorBittensor(TAO)$265.40-6.91%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.140.42%
  • pax-goldPAX Gold(PAXG)$4,448.15-1.43%
  • polkadotPolkadot(DOT)$1.23-2.14%
  • mantleMantle(MNT)$0.63-1.61%
  • uniswapUniswap(UNI)$3.10-5.39%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0593184.63%
  • OndoOndo(ONDO)$0.382262-5.71%
  • okbOKB(OKB)$87.49-2.81%
  • AsterAster(ASTER)$0.68-1.02%
  • Ripple USDRipple USD(RLUSD)$1.000.00%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

May 27, 2026
in AI & Technology
Reading Time: 9 mins read
A A
Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference
ShareShareShareShareShare

Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully.

EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1, EAGLE 2, and EAGLE 3 has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, that family gets a targeted reliability upgrade with introduction of EAGLE 3.1.

YOU MAY ALSO LIKE

Google Employee Accused Of Making $1 Million From Insider Trading On Polymarket

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

What was Going Wrong

While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.

The EAGLE team traced this fragility to a phenomenon called attention drift as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens.

In simpler terms: the drafter is a small model that predicts future tokens. As speculation gets deeper, it starts attending to its own prior outputs instead of the original context. This degrades acceptance length and output stability.

Two underlying issues were identified. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths.

Two Architectural Fixes in EAGLE 3.1

To address attention drift, EAGLE 3.1 comes with two key architectural improvements: FC normalization after each target hidden state and before the FC layer, and feeding post-norm hidden states into the next decoding step.

FC normalization stabilizes the hidden states that the drafter receives from the target model. Without it, hidden-state magnitude grows across steps, making the drafter increasingly unreliable. Applying normalization at each step keeps the inputs bounded.

The post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model.

https://vllm.ai/blog/2026-05-26-eagle-3-1
https://vllm.ai/blog/2026-05-26-eagle-3-1

What These Fixes Deliver

Compared with EAGLE 3, EAGLE 3.1 demonstrates: better training-time to inference-time extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments.

In long-context workloads, EAGLE 3.1 achieves up to 2× longer acceptance length compared with EAGLE 3.

Training Infrastructure: TorchSpec

TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms. By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment.

Based on TorchSpec and vLLM, the research team also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6, available on HuggingFace. The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model

vLLM Integration: Config-Driven and Backward-Compatible

EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation. The integration includes FC normalization support, post-norm hidden-state feedback, and removal of hardcoded assumptions around target hidden states.

Backward compatibility with existing EAGLE 3 checkpoints is fully preserved. EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path.

vllm serve nvidia/Kimi-K2.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --attention-backend tokenspeed_mla \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
  --language-model-only

Benchmark Results on Kimi K2.6

The research team benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× higher per-user output throughput at concurrency 1. The speedup stays meaningful as concurrency scales: 1.71× at C=4 and 1.66× at C=16.

Marktechpost’s Visual Explainer

01 / 07

vLLM · May 26, 2026


The EAGLE team, vLLM team, and TorchSpec team jointly released EAGLE 3.1 — a targeted fix for speculative decoding instability in production LLM serving.

#speculative-decoding
#vLLM
#LLM inference
#performance

02 / 07

Background

What is Speculative Decoding?


A technique for speeding up LLM inference using two models working together.

  • A small, fast draft model proposes several tokens ahead
  • The large target model verifies all proposed tokens in one pass
  • Accepted tokens are kept — rejected tokens fall back gracefully
  • Result: higher output throughput with no change in output quality

03 / 07

The Problem

Attention Drift in EAGLE 3


EAGLE 3 performance degraded in real-world deployments under three conditions:

  • Different chat templates
  • Long-context inputs
  • Out-of-distribution system prompts

Root cause: attention drift — as speculation depth increases, the drafter shifts attention away from sink tokens toward its own generated tokens.

04 / 07

Root Cause

Two Underlying Issues

  • The fused input representation becomes increasingly imbalanced — higher-layer hidden states dominate the drafter input
  • Hidden-state magnitude grows across speculation steps due to the unnormalized residual path
  • Together, these make the drafter progressively less stable at deeper speculation depths

05 / 07

Architecture

Two Architectural Fixes

Fix 1
FC normalization applied after each target hidden state and before the FC layer. Keeps hidden-state magnitude bounded across decoding steps.

Fix 2
Post-norm hidden-state feedback — normalized hidden states fed into the next decoding step, making the drafter behave like recursive invocation rather than appended layers.

06 / 07

Benchmarks · SPEED-Bench Coding · GB200 TP=4

Per-User Throughput vs. No-Spec Baseline

2.03×Concurrency 1

1.71×Concurrency 4

1.66×Concurrency 16

In long-context workloads, EAGLE 3.1 achieves up to 2× longer acceptance length compared with EAGLE 3. Tested on Kimi-K2.6-NVFP4 with vLLM.

07 / 07

Deployment · vLLM v0.22.0

How to Deploy EAGLE 3.1


Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM main. Stable release: v0.22.0.

vllm serve nvidia/Kimi-K2.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --attention-backend tokenspeed_mla \
  --speculative-config \
    '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla",
      "method":"eagle3",
      "num_speculative_tokens":3}' \
  --language-model-only

Key Takeaways

  • EAGLE 3.1 fixes attention drift — a newly identified instability where the drafter loses focus on sink tokens at deeper speculation depths.
  • Two architectural changes — FC normalization and post-norm hidden-state feedback — stabilize the drafter across speculation steps.
  • In long-context workloads, EAGLE 3.1 delivers up to 2× longer acceptance length compared with EAGLE 3.
  • Benchmarks on Kimi-K2.6-NVFP4 show 2.03× per-user output throughput at concurrency 1, dropping to 1.66× at C=16.
  • EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and is already merged into vLLM main, shipping in v0.22.0.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Google Employee Accused Of Making  Million From Insider Trading On Polymarket
AI & Technology

Google Employee Accused Of Making $1 Million From Insider Trading On Polymarket

May 27, 2026
MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost
AI & Technology

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

May 27, 2026
Google Updates Gemini For Home With AI-Powered Camera Automations
AI & Technology

Google Updates Gemini For Home With AI-Powered Camera Automations

May 27, 2026
NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code
AI & Technology

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

May 27, 2026
Next Post
Rep. Al Green loses seat to Rep. Christian Menefee in Democratic primary

Rep. Al Green loses seat to Rep. Christian Menefee in Democratic primary

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Learn This Trade Management Skill

Learn This Trade Management Skill

May 23, 2026
Paraglider speaks out after surviving mid-air collision with plane

Paraglider speaks out after surviving mid-air collision with plane

May 27, 2026
Google Won’t Rest Until Gemini Is Everywhere In Your Home

Google Won’t Rest Until Gemini Is Everywhere In Your Home

May 21, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!