• bitcoinBitcoin(BTC)$75,924.00-0.64%
  • ethereumEthereum(ETH)$2,252.78-1.78%
  • tetherTether(USDT)$1.00-0.03%
  • rippleXRP(XRP)$1.37-1.09%
  • binancecoinBNB(BNB)$617.02-1.12%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • solanaSolana(SOL)$82.85-1.19%
  • tronTRON(TRX)$0.3235170.22%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.03-0.62%
  • dogecoinDogecoin(DOGE)$0.1032473.18%
  • whitebitWhiteBIT Coin(WBT)$54.01-0.18%
  • USDSUSDS(USDS)$1.00-0.01%
  • HyperliquidHyperliquid(HYPE)$40.140.31%
  • leo-tokenLEO Token(LEO)$10.36-0.10%
  • cardanoCardano(ADA)$0.244151-1.40%
  • bitcoin-cashBitcoin Cash(BCH)$446.96-1.18%
  • moneroMonero(XMR)$375.98-1.38%
  • chainlinkChainlink(LINK)$9.10-1.83%
  • CantonCanton(CC)$0.1510701.48%
  • zcashZcash(ZEC)$326.82-2.62%
  • stellarStellar(XLM)$0.160154-1.57%
  • USD1USD1(USD1)$1.00-0.03%
  • MemeCoreMemeCore(M)$3.421.79%
  • daiDai(DAI)$1.00-0.09%
  • litecoinLitecoin(LTC)$55.16-0.77%
  • avalanche-2Avalanche(AVAX)$9.12-0.64%
  • hedera-hashgraphHedera(HBAR)$0.088586-0.56%
  • Ethena USDeEthena USDe(USDE)$1.000.00%
  • RainRain(RAIN)$0.0078675.30%
  • shiba-inuShiba Inu(SHIB)$0.000006-0.23%
  • suiSui(SUI)$0.91-2.15%
  • paypal-usdPayPal USD(PYUSD)$1.000.01%
  • the-open-networkToncoin(TON)$1.321.60%
  • crypto-com-chainCronos(CRO)$0.068209-1.33%
  • Circle USYCCircle USYC(USYC)$1.120.02%
  • tether-goldTether Gold(XAUT)$4,542.08-1.09%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • BittensorBittensor(TAO)$256.25-0.48%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • pax-goldPAX Gold(PAXG)$4,536.74-1.19%
  • mantleMantle(MNT)$0.62-1.03%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.063933-13.44%
  • polkadotPolkadot(DOT)$1.21-1.80%
  • uniswapUniswap(UNI)$3.19-2.27%
  • Pi NetworkPi Network(PI)$0.190268-2.28%
  • SkySky(SKY)$0.081528-6.07%
  • Falcon USDFalcon USD(USDF)$1.00-0.01%
  • okbOKB(OKB)$82.17-0.75%
  • nearNEAR Protocol(NEAR)$1.33-1.87%
  • AsterAster(ASTER)$0.660.86%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

April 29, 2026
in AI & Technology
Reading Time: 6 mins read
A A
Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs
ShareShareShareShareShare

The race to make large language models faster and cheaper to run has largely been fought at two levels: the model architecture and the hardware. But there is a third, often underappreciated frontier — the GPU kernel. A kernel is the low-level computational routine that actually executes a mathematical operation on the GPU. Writing a good one requires understanding not just the math, but the exact memory layout, instruction scheduling, and hardware quirks of the chip you are targeting. Most ML professionals never write kernels directly; they rely on libraries like FlashAttention or Triton to do it for them.

Meet FlashQLA: a QwenLM’s contribution to this layer. Released under the MIT License and built on the TileLang compiler framework, it is a high-performance linear attention kernel library specifically optimized for the Gated Delta Network (GDN) attention mechanism — the linear attention architecture that powers the Qwen3.5 and Qwen3.6 model families.

YOU MAY ALSO LIKE

The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall

Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems

What is Linear Attention and Why Does It Matter?

To understand what FlashQLA solves, it helps to understand what standard softmax attention costs. In a conventional Transformer, the attention mechanism has O(n²) complexity — meaning that doubling the sequence length quadruples the computation. This is the fundamental bottleneck that makes processing long documents, long code files, or long conversations expensive.

Linear attention replaces the softmax with a formulation that reduces this to O(n) complexity, making it scale much more favorably with sequence length. The Gated Delta Network (GDN) is one such linear attention mechanism, and it has been integrated into Qwen’s hybrid model architecture, where GDN layers alternate with standard full attention layers. This hybrid design attempts to get the best of both worlds: the expressiveness of full attention where it is most needed, and the efficiency of linear attention everywhere else.

GDN uses what is called a ‘gated’ formulation — it applies an exponentially decaying gate to control how much past context is carried forward. This gate is key to how FlashQLA achieves its performance gains.

The Problem with Existing Kernels

Before FlashQLA, the standard implementation for GDN operations came from the Flash Linear Attention (FLA) library, which uses Triton kernels — Triton being OpenAI’s Python-based GPU programming language. While Triton makes kernel authoring more accessible, it comes with trade-offs: the kernels it produces are not always optimally scheduled for specific hardware, particularly on NVIDIA’s Hopper architecture (the H100 and H200 GPU generation).

The Hopper architecture introduced new features like warpgroup-level Tensor Core operations and asynchronous data pipelines that Triton cannot always exploit to their full potential. This is the gap FlashQLA is designed to fill.

What FlashQLA Does Differently

FlashQLA applies operator fusion and performance optimization to both the forward pass (used during inference and training) and the backward pass (used during training for gradient computation) of GDN Chunked Prefill. The result is a 2–3× speedup on forward passes and a 2× speedup on backward passes compared to the FLA Triton kernel across multiple scenarios on NVIDIA Hopper GPUs.

Three technical innovations drive these gains:

1. Gate-driven automatic intra-card context parallelism: Context parallelism (CP) refers to splitting a long sequence across multiple processing units so they can work on different parts simultaneously. FlashQLA exploits the exponential decay property of the GDN gate to make this split mathematically valid — because the gate’s decay means that tokens far apart in a sequence have diminishing influence on each other. This allows FlashQLA to automatically enable intra-card CP under tensor parallelism (TP), long-sequence, and small-head-count settings, improving GPU Streaming Multiprocessor (SM) utilization without requiring manual configuration.

2. Hardware-friendly algebraic reformulation: FlashQLA reformulates, to a certain extent, the mathematical computation of GDN Chunked Prefill’s forward and backward flows to reduce overhead on three types of GPU hardware units: Tensor Cores (which handle matrix multiplications), CUDA Cores (which handle scalar and vector operations), and the Special Function Unit (SFU, which handles operations like exponentials and square roots). Critically, this is done without sacrificing numerical precision — an important guarantee when the reformulation is being used for model training.

3. TileLang fused warp-specialized kernels: Rather than decomposing the computation into independent sequential kernels (too slow) or fusing everything into a single monolithic kernel (too rigid to optimize), FlashQLA takes a middle path. It uses TileLang to build several key fused kernels and manually implements warpgroup specialization — a technique that assigns different warpgroups (groups of 128 threads on Hopper) to specialized roles, such as one warpgroup moving data from global memory to shared memory while another simultaneously runs Tensor Core matrix multiplications. This overlap of data movement, Tensor Core computation, and CUDA Core computation is what allows FlashQLA to approach the theoretical peak throughput of the hardware.

Benchmarks

FlashQLA was benchmarked against two baselines: the FLA Triton kernel (version 0.5.0, Triton 3.5.1) and FlashInfer (version 0.6.9), using TileLang 0.1.8, on NVIDIA H200 GPUs. The benchmarks used the head configurations from the Qwen3.5 and Qwen3.6 model families, with head dimensions hv ∈ 64, 48, 32, 24, 16, 8, corresponding to tensor parallelism settings from TP1 through TP8.

The forward (FWD) benchmarks measure single-kernel latency for different models and TP settings under varying batch lengths. The backward (BWD) benchmarks examine the relationship between total token count within a batch and latency during a single update step.

https://qwen.ai/blog?id=flashqla

Key Takeaways

  • FlashQLA is a high-performance linear attention kernel library built by the Qwen team on TileLang, specifically optimized for the Gated Delta Network (GDN) Chunked Prefill forward and backward passes.
  • It achieves 2–3× forward speedup and 2× backward speedup over the FLA Triton kernel across multiple scenarios on NVIDIA Hopper GPUs (SM90+), with efficiency gains most pronounced in pretraining and edge-side agentic inference.
  • Three core innovations drive the performance gains: gate-driven automatic intra-card context parallelism, hardware-friendly algebraic reformulation that reduces Tensor Core, CUDA Core, and SFU overhead without losing numerical precision, and TileLang fused warp-specialized kernels that overlap data movement, Tensor Core computation, and CUDA Core computation.
  • GDN is a linear attention mechanism with O(n) complexity, used in Qwen’s hybrid model architecture alongside standard full attention layers — making efficient GDN kernels critical for both training and long-context inference at scale.
  • FlashQLA is open-source under the MIT License and requires SM90 or above, CUDA 12.8+, and PyTorch 2.8+, with a simple pip install and both high-level and low-level Python APIs available for integration.

Check out the GitHub Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs appeared first on MarkTechPost.

Credit: Source link

ShareTweetSendSharePin

Related Posts

The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall
AI & Technology

The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall

April 29, 2026
Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems
AI & Technology

Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems

April 29, 2026
Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified
AI & Technology

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified

April 29, 2026
How to build custom reasoning agents with a fraction of the compute
AI & Technology

How to build custom reasoning agents with a fraction of the compute

April 28, 2026
Next Post
Tornado forces joyful merging of wedding and volleyball tournament

Tornado forces joyful merging of wedding and volleyball tournament

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
The LoRA Assumption That Breaks in Production 

The LoRA Assumption That Breaks in Production 

April 27, 2026
NASA’s initial takeaways from the Artemis II mission, and more science stories

NASA’s initial takeaways from the Artemis II mission, and more science stories

April 25, 2026
UK Executives See More AI Job Cuts

UK Executives See More AI Job Cuts

April 23, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!