• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$101,616.00-3.15%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • ethereumEthereum(ETH)$2,414.99-7.52%
  • tetherTether(USDT)$1.00-0.01%
  • rippleXRP(XRP)$2.09-5.16%
  • binancecoinBNB(BNB)$631.66-4.94%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • solanaSolana(SOL)$144.89-5.69%
  • usd-coinUSDC(USDC)$1.000.00%
  • tronTRON(TRX)$0.2803992.06%
  • dogecoinDogecoin(DOGE)$0.171510-9.16%
  • cardanoCardano(ADA)$0.63-6.31%
  • staked-etherLido Staked Ether(STETH)$2,416.33-7.38%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$101,486.00-3.18%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • HyperliquidHyperliquid(HYPE)$33.94-5.21%
  • Wrapped stETHWrapped stETH(WSTETH)$2,907.82-7.32%
  • SuiSui(SUI)$2.94-7.32%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • chainlinkChainlink(LINK)$12.93-6.55%
  • leo-tokenLEO Token(LEO)$8.78-3.43%
  • stellarStellar(XLM)$0.257345-3.64%
  • avalanche-2Avalanche(AVAX)$18.78-7.00%
  • bitcoin-cashBitcoin Cash(BCH)$385.03-4.18%
  • ToncoinToncoin(TON)$3.03-4.82%
  • shiba-inuShiba Inu(SHIB)$0.000012-6.45%
  • USDSUSDS(USDS)$1.000.00%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • hedera-hashgraphHedera(HBAR)$0.160071-4.36%
  • wethWETH(WETH)$2,421.37-7.43%
  • litecoinLitecoin(LTC)$83.98-4.90%
  • Wrapped eETHWrapped eETH(WEETH)$2,583.30-7.36%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • PengPeng(PENG)$0.60-13.59%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.32%
  • Ethena USDeEthena USDe(USDE)$1.000.04%
  • polkadotPolkadot(DOT)$3.84-4.54%
  • moneroMonero(XMR)$317.110.49%
  • Bitget TokenBitget Token(BGB)$4.52-5.03%
  • MurasakiMurasaki(MURA)$4.32-12.46%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

May 5, 2025
in AI & Technology
Reading Time: 7 mins read
A A
Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling
ShareShareShareShareShare

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring cross-domain generalisation.

Evolution of Reasoning in LLMs

The development of Chain-of-Thought (CoT) methodology marked a significant advancement in LLM reasoning capabilities. CoT has demonstrated substantial improvements across mathematics, science, and programming domains by incorporating multi-step intermediate reasoning processes before reaching conclusions. This approach allows models to break down complex problems into manageable steps, mirroring human problem-solving processes.

YOU MAY ALSO LIKE

Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series

While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training to diverse domains remains largely unexplored. Prior research works suggest that blending mathematical content with other verifiable domains can improve performance on broad reasoning benchmarks. However, systematic investigation into how non-mathematical reasoning data, such as legal analysis, social science, or historical interpretation, impacts RL training effectiveness still represents a significant research gap.

Challenges in Diversifying Reasoning Domains

Recent research has explored methods for diversifying RL training datasets, yet questions about optimal data-blending strategies and the relative importance of various sources remain unanswered. A fundamental challenge in applying RL to general reasoning tasks is developing verifiable reward models for domains lacking deterministic solutions. Domain-specific reasoning processes—whether rule-based and symbolic in mathematics or contextual and heuristic in fields like law and history—require different cognitive approaches. In addition to that, question formats (open-ended versus multiple-choice) demand distinct reasoning strategies, suggesting that incorporating diverse reasoning domains could significantly enhance LLMs’ broad cognitive capabilities.

Nemotron-CrossThink: A Multi-Domain Approach

Researchers from NVIDIA, Carnegie Mellon University, and Boston University introduce Nemotron-CrossThink, representing a systematic framework for incorporating multi-domain corpora into RL training to enhance cross-task generalisation. The methodology follows a comprehensive pipeline that curates diverse data sources, including synthetic data from CommonCrawl and open-source question-answer pairs across STEM, humanities, law, and social sciences. By applying templated formats (MCQ/Open-Ended) to constrain answer spaces, filtering samples for verifiable rewards, and implementing strategic data-blending recipes, the framework enables effective self-learning through RL across diverse reasoning domains.

Key Results and Innovations

Nemotron-CrossThink significantly enhances LLM reasoning capabilities by integrating multi-domain data with different question formats. Models trained with this approach demonstrate not only higher accuracy but also dynamic response strategies—generating concise answers for general-purpose questions while providing detailed responses for mathematical problems—thereby optimising inference costs while maintaining task-specific precision.

The framework addresses the challenge of verifiable rewards in non-deterministic domains through templated data curation that limits answer space diversity. It also provides an efficient filtering approach that ranks general-purpose reasoning data by complexity, showing that training with more challenging samples amplifies RL impact across all domains. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: +30.1%, AMC23: +27.5%) and non-mathematical tasks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%).

Comprehensive Data Curation

Nemotron-CrossThink begins with meticulous data curation from multiple sources to ensure diversity. The training dataset combines synthetically generated data from CommonCrawl and publicly available open-source QA datasets, encompassing both general-purpose reasoning and mathematical content. General-purpose reasoning data includes MMLU, Natural Reasoning, and synthesised QA pairs spanning STEM fields, economics, social sciences, and humanities, while mathematical reasoning incorporates datasets like MATH and Numina-Math alongside synthetically generated problems.

Template Application and Data Filtering

To address the challenge of verifiable rewards in non-mathematical domains, the framework applies specific templates to structure question-answer formats: Multiple Choice Questions (MCQ) and Open-Ended questions. This approach exposes the model to diverse answer formats and reasoning pathways while limiting answer space variability to enable effective reward modeling. Rigorous filtering removes samples that are infeasible to evaluate with rule-based reward functions, discarding MCQs where correct answers aren’t among the choices and open-ended responses exceeding ten words.

Strategic Data Blending and Reinforcement Learning

Nemotron-CrossThink employs Group Relative Policy Optimisation (GRPO) for reinforcement learning, which improves efficiency by estimating baselines from group scores rather than using a separate critic model. The methodology investigates the impact of diverse data sources, question types, and data usefulness through six distinct blending recipes. This systematic approach enables detailed analysis of how general-purpose reasoning data complements mathematical reasoning, ultimately producing more adaptable and generalizable language models.

Technical Contributions

The research demonstrates several key technical advances in multi-domain reasoning through reinforcement learning:

  1. Templated question-answer formats provide more stable reward modeling, with unified open-ended question formats improving performance by 1.21% over mixed formats, and short-form answer templates outperforming long-form ones by 1.20%.
  2. Strategic data-blending proves essential, with multi-domain corpora boosting average reasoning accuracy by 1.61% compared to math-only training while reducing token usage by 28%.
  3. Model-driven filtering techniques effectively select challenging samples by removing those solvable by smaller models, yielding an additional 2.15% accuracy gain for Qwen-2.5-32B.

These findings represent significant progress in developing LLMs with robust reasoning capabilities across diverse domains, moving beyond the traditional focus on mathematical reasoning to encompass the full spectrum of human knowledge and inference patterns.

Experiments and Results

Experimental results demonstrate that different datasets significantly impact model performance across reasoning benchmarks. NuminaMath produced the highest overall average, outperforming the baseline by 8.30%, with particular strength in mathematical tasks while also generalizing well across diverse domains. Synthetic question-answering data improved performance by approximately 1.0%, showing strong accuracy in MMLU-PRO, AGIEVAL, and MATH-500 tasks, confirming that synthetically generated instruction-style data can effectively generalize when aligned with benchmark distributions.

The Nemotron-CrossThink approach consistently outperformed the base model across various blending strategies. The general-purpose reasoning blend (Bgpr↑) achieved the highest overall average, exceeding OPEN-REASONER-ZERO by approximately 5% on average and showing substantial gains on reasoning-focused benchmarks (+12.82% on MMLU-PRO, +15.12% on AGIEVAL). Though Bonly_math performed slightly better on strictly mathematical tasks, it lagged on non-mathematical reasoning benchmarks, demonstrating Bgpr↑’s superior versatility through strong cross-domain transfer.

Further analysis revealed that open-ended question formats (Bopen↑) yielded stronger results on mathematical benchmarks than multiple-choice formats (Bmcq↑), suggesting alignment with the inherently open-ended structure of mathematical problems. Mathematical reasoning data showed transferability to structured reasoning tasks, while general-purpose data proved less effective in isolation. This counterintuitive finding confirms that optimal general-purpose reasoning performance requires including mathematical problems in training blends.

Conclusion

Nemotron-CrossThink introduces a scalable framework that enhances LLM generalization through reinforcement learning with multi-domain corpora. By strategically blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content, the approach achieves a remarkable 13.36% average improvement over baselines. The research demonstrates that data diversity, not merely volume, drives broader reasoning capabilities. Through difficulty-based filtering and thoughtful template design, Nemotron-CrossThink establishes a practical methodology for developing more generalizable, efficient, and reliable LLMs that extend self-learning beyond mathematical reasoning.


Check out the Paper and Project Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:


Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance
AI & Technology

Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance

June 5, 2025
Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series
AI & Technology

Soham Mazumdar, Co-Founder & CEO of WisdomAI – Interview Series

June 5, 2025
Current in-stock availability on consoles and games
AI & Technology

Current in-stock availability on consoles and games

June 5, 2025
Building Trust Into AI Is the New Baseline
AI & Technology

Building Trust Into AI Is the New Baseline

June 5, 2025
Next Post
Meet the 19-year-old trying to run for governor in Massachusetts

Meet the 19-year-old trying to run for governor in Massachusetts

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Egg prices have plummeted since Trump took office — after hitting all-time high

Egg prices have plummeted since Trump took office — after hitting all-time high

June 1, 2025
Sean 'Diddy' Combs trial recap: Witness testifies that Combs dangled her off 17th-floor balcony and threatened to kill her – Yahoo

Sean 'Diddy' Combs trial recap: Witness testifies that Combs dangled her off 17th-floor balcony and threatened to kill her – Yahoo

June 5, 2025
Northrop Grumman: Solid, But Still Too Expensive

Northrop Grumman: Solid, But Still Too Expensive

June 3, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!