• bitcoinBitcoin(BTC)$64,711.00-2.82%
  • ethereumEthereum(ETH)$1,747.78-4.26%
  • tetherTether(USDT)$1.00-0.02%
  • binancecoinBNB(BNB)$606.27-1.57%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.19-4.45%
  • solanaSolana(SOL)$71.79-4.57%
  • tronTRON(TRX)$0.3197740.83%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.040.76%
  • HyperliquidHyperliquid(HYPE)$70.12-6.68%
  • dogecoinDogecoin(DOGE)$0.085651-3.32%
  • USDSUSDS(USDS)$1.000.00%
  • leo-tokenLEO Token(LEO)$9.58-1.81%
  • RainRain(RAIN)$0.0140330.47%
  • zcashZcash(ZEC)$476.30-7.43%
  • stellarStellar(XLM)$0.223980-1.92%
  • moneroMonero(XMR)$336.96-2.38%
  • whitebitWhiteBIT Coin(WBT)$53.05-3.06%
  • CantonCanton(CC)$0.161371-1.72%
  • cardanoCardano(ADA)$0.168062-6.60%
  • chainlinkChainlink(LINK)$8.11-3.80%
  • USD1USD1(USD1)$1.00-0.01%
  • Ethena USDeEthena USDe(USDE)$1.00-0.03%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.64-1.61%
  • bitcoin-cashBitcoin Cash(BCH)$211.02-3.68%
  • daiDai(DAI)$1.00-0.01%
  • LABLAB(LAB)$13.04-2.79%
  • MemeCoreMemeCore(M)$3.02-2.06%
  • hedera-hashgraphHedera(HBAR)$0.080001-4.21%
  • litecoinLitecoin(LTC)$44.99-1.31%
  • suiSui(SUI)$0.78-1.97%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • nearNEAR Protocol(NEAR)$2.28-5.97%
  • avalanche-2Avalanche(AVAX)$6.79-2.76%
  • shiba-inuShiba Inu(SHIB)$0.000005-1.69%
  • Global DollarGlobal Dollar(USDG)$1.00-0.04%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.02%
  • crypto-com-chainCronos(CRO)$0.059174-5.31%
  • tether-goldTether Gold(XAUT)$4,304.59-0.43%
  • BittensorBittensor(TAO)$250.78-5.20%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • worldcoin-wldWorldcoin(WLD)$0.650.21%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.13-0.12%
  • uniswapUniswap(UNI)$3.287.28%
  • AsterAster(ASTER)$0.7411.14%
  • pax-goldPAX Gold(PAXG)$4,315.95-0.39%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0610711.21%
  • mantleMantle(MNT)$0.55-4.75%
  • OndoOndo(ONDO)$0.365459-4.71%
  • polkadotPolkadot(DOT)$1.00-2.17%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

August 7, 2025
in AI & Technology
Reading Time: 4 mins read
A A
Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models
ShareShareShareShareShare

Reinforcement learning (RL) plays a crucial role in scaling language models, enabling them to solve complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics is a challenge when scaling RL with larger computational resources. Current state-of-the-art algorithms, such as GRPO, struggle with serious stability issues during the training of gigantic language models, often resulting in catastrophic failures. These instabilities arise from incorrect use of importance sampling weight applications, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes model collapse and hinders progress.

Existing methods like PPO and GRPO rely on mechanisms like clipping to address off-policy learning challenges where responses are taken from outdated policies. However, these approaches face limitations due to their ill-posed objectives, particularly in large models handling long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration fail, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the need for a new approach that optimizes directly at the sequence level to ensure stability and scalability.

YOU MAY ALSO LIKE

WhatsApp Is Testing Read-Once Disappearing Messages

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to train LLMs. GSPO’s main innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Moreover, it calculates normalized rewards as advantages for multiple responses to a query, promoting consistency between sequence-level rewards and optimization goals. Empirical evaluations reveal that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By resolving stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

Researchers use a cold-start model fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the training reward curves and the model performance curves across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is split into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This leads to a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

GSPO offers significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it more robust to precision mismatch. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

In conclusion, researchers introduced Group Sequence Policy Optimization (GSPO), an RL algorithm designed for training LLMs. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability, particularly for MoE models, emphasizes its importance as a strong algorithmic foundation. The advancements made possible by GSPO have played a key role in the remarkable performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, opening the door for groundbreaking progress in AI.


Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Credit: Source link

ShareTweetSendSharePin

Related Posts

WhatsApp Is Testing Read-Once Disappearing Messages
AI & Technology

WhatsApp Is Testing Read-Once Disappearing Messages

June 17, 2026
MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget
AI & Technology

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

June 17, 2026
FIFA Wants Jamal Musiala To Forget About Dre (During The World Cup)
AI & Technology

FIFA Wants Jamal Musiala To Forget About Dre (During The World Cup)

June 17, 2026
OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls
AI & Technology

OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls

June 17, 2026
Next Post
Hertz Global Holdings, Inc. (HTZ) Q2 2025 Earnings Call Transcript

Hertz Global Holdings, Inc. (HTZ) Q2 2025 Earnings Call Transcript

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
'It was surreal': British couple describe having warning shots fired near them by Russian warship – BBC

'It was surreal': British couple describe having warning shots fired near them by Russian warship – BBC

June 17, 2026
Best AI play today? Patrick Moorhead goes rapid-fire

Best AI play today? Patrick Moorhead goes rapid-fire

June 11, 2026
Claude Code Guide 2026: 25 Features with Examples + Demo

Claude Code Guide 2026: 25 Features with Examples + Demo

June 15, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!