• bitcoinBitcoin(BTC)$59,681.00-1.28%
  • ethereumEthereum(ETH)$1,573.03-1.14%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$551.88-2.18%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.05-1.72%
  • solanaSolana(SOL)$71.53-0.86%
  • tronTRON(TRX)$0.3229180.80%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.041.53%
  • HyperliquidHyperliquid(HYPE)$62.69-1.75%
  • dogecoinDogecoin(DOGE)$0.073088-3.49%
  • USDSUSDS(USDS)$1.000.02%
  • RainRain(RAIN)$0.015569-0.41%
  • leo-tokenLEO Token(LEO)$9.430.69%
  • zcashZcash(ZEC)$389.27-4.31%
  • moneroMonero(XMR)$314.160.09%
  • CantonCanton(CC)$0.150860-0.47%
  • stellarStellar(XLM)$0.170618-3.05%
  • whitebitWhiteBIT Coin(WBT)$47.66-1.88%
  • chainlinkChainlink(LINK)$7.24-2.13%
  • LABLAB(LAB)$17.258.74%
  • cardanoCardano(ADA)$0.143509-2.60%
  • USD1USD1(USD1)$1.00-0.03%
  • daiDai(DAI)$1.000.02%
  • Ethena USDeEthena USDe(USDE)$1.000.01%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.55-1.32%
  • bitcoin-cashBitcoin Cash(BCH)$190.89-3.39%
  • litecoinLitecoin(LTC)$42.840.04%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • hedera-hashgraphHedera(HBAR)$0.070650-2.41%
  • Global DollarGlobal Dollar(USDG)$1.000.00%
  • suiSui(SUI)$0.68-3.13%
  • paypal-usdPayPal USD(PYUSD)$1.000.00%
  • avalanche-2Avalanche(AVAX)$6.31-3.78%
  • crypto-com-chainCronos(CRO)$0.054072-1.87%
  • tether-goldTether Gold(XAUT)$4,062.61-0.11%
  • shiba-inuShiba Inu(SHIB)$0.000004-2.64%
  • nearNEAR Protocol(NEAR)$1.85-2.82%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.140.07%
  • BittensorBittensor(TAO)$207.20-2.38%
  • pax-goldPAX Gold(PAXG)$4,067.48-0.13%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.057819-1.56%
  • uniswapUniswap(UNI)$2.90-3.66%
  • AsterAster(ASTER)$0.62-2.01%
  • okbOKB(OKB)$77.85-2.42%
  • Ripple USDRipple USD(RLUSD)$1.00-0.03%
  • worldcoin-wldWorldcoin(WLD)$0.439577-5.48%
  • HTX DAOHTX DAO(HTX)$0.0000020.59%
  • OndoOndo(ONDO)$0.308463-2.11%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

July 30, 2025
in AI & Technology
Reading Time: 5 mins read
A A
Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals
ShareShareShareShareShare

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly.

RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM.

YOU MAY ALSO LIKE

SpaceX Has Path to AI in Space, Says Early Investor David George

OpenAI Unveils First Custom AI Chip With Broadcom | Bloomberg Tech 6/24/2026

Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks.     The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales.

Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update. 

The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA.   It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights.

In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities.


Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Credit: Source link

ShareTweetSendSharePin

Related Posts

SpaceX Has Path to AI in Space, Says Early Investor David George
AI & Technology

SpaceX Has Path to AI in Space, Says Early Investor David George

June 28, 2026
OpenAI Unveils First Custom AI Chip With Broadcom | Bloomberg Tech 6/24/2026
AI & Technology

OpenAI Unveils First Custom AI Chip With Broadcom | Bloomberg Tech 6/24/2026

June 28, 2026
Apple’s Touchscreen MacBook Reportedly Won’t Wait For The M7 Chips
AI & Technology

Apple’s Touchscreen MacBook Reportedly Won’t Wait For The M7 Chips

June 28, 2026
This  Billion Startup Is Putting Humanoid Robots To Work
AI & Technology

This $5 Billion Startup Is Putting Humanoid Robots To Work

June 28, 2026
Next Post
Wisconsin couple speaks on winning wife-carrying competition

Wisconsin couple speaks on winning wife-carrying competition

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Reflecting Pool turned green by algae

Reflecting Pool turned green by algae

June 23, 2026
What’s in the U.S. and Iran’s Framework Deal?; And Kevin Warsh Makes His Fed Debut – June 17

What’s in the U.S. and Iran’s Framework Deal?; And Kevin Warsh Makes His Fed Debut – June 17

June 22, 2026
Shockwaves of Omaha immigration raid still felt by community

Shockwaves of Omaha immigration raid still felt by community

June 28, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!