Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

Reinforcement learning (RL) plays a crucial role in scaling language models, enabling them to solve complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics is a challenge when scaling RL with larger computational resources. Current state-of-the-art algorithms, such as GRPO, struggle with serious stability issues during the training of gigantic language models, often resulting in catastrophic failures. These instabilities arise from incorrect use of importance sampling weight applications, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes model collapse and hinders progress.

Existing methods like PPO and GRPO rely on mechanisms like clipping to address off-policy learning challenges where responses are taken from outdated policies. However, these approaches face limitations due to their ill-posed objectives, particularly in large models handling long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration fail, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the need for a new approach that optimizes directly at the sequence level to ensure stability and scalability.

WhatsApp Is Testing Read-Once Disappearing Messages

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to train LLMs. GSPO’s main innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Moreover, it calculates normalized rewards as advantages for multiple responses to a query, promoting consistency between sequence-level rewards and optimization goals. Empirical evaluations reveal that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By resolving stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

Researchers use a cold-start model fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the training reward curves and the model performance curves across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is split into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This leads to a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

GSPO offers significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it more robust to precision mismatch. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

In conclusion, researchers introduced Group Sequence Policy Optimization (GSPO), an RL algorithm designed for training LLMs. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability, particularly for MoE models, emphasizes its importance as a strong algorithmic foundation. The advancements made possible by GSPO have played a key role in the remarkable performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, opening the door for groundbreaking progress in AI.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Credit: Source link