• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$108,901.000.05%
  • ethereumEthereum(ETH)$2,552.48-0.51%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.000.01%
  • rippleXRP(XRP)$2.35-0.31%
  • binancecoinBNB(BNB)$676.011.61%
  • solanaSolana(SOL)$177.09-1.37%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.227022-2.86%
  • cardanoCardano(ADA)$0.75-2.74%
  • tronTRON(TRX)$0.2710140.01%
  • staked-etherLido Staked Ether(STETH)$2,551.29-0.48%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$108,672.00-0.07%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • SuiSui(SUI)$3.63-2.85%
  • USD OneUSD One(USD1)$1.000.11%
  • HyperliquidHyperliquid(HYPE)$35.022.40%
  • Wrapped stETHWrapped stETH(WSTETH)$3,074.50-0.58%
  • chainlinkChainlink(LINK)$15.46-3.81%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • avalanche-2Avalanche(AVAX)$22.95-4.73%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • stellarStellar(XLM)$0.287116-2.09%
  • shiba-inuShiba Inu(SHIB)$0.000014-1.91%
  • bitcoin-cashBitcoin Cash(BCH)$425.36-3.64%
  • leo-tokenLEO Token(LEO)$8.810.31%
  • hedera-hashgraphHedera(HBAR)$0.190341-3.58%
  • ToncoinToncoin(TON)$3.02-1.30%
  • moneroMonero(XMR)$399.29-0.43%
  • litecoinLitecoin(LTC)$96.78-2.13%
  • wethWETH(WETH)$2,552.45-0.33%
  • polkadotPolkadot(DOT)$4.58-2.60%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • Bitget TokenBitget Token(BGB)$5.691.89%
  • USDSUSDS(USDS)$1.000.00%
  • Wrapped eETHWrapped eETH(WEETH)$2,722.68-0.67%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.35%
  • PengPeng(PENG)$0.60-13.59%
  • PepePepe(PEPE)$0.000014-7.23%
  • Pi NetworkPi Network(PI)$0.770.27%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data

April 23, 2025
in AI & Technology
Reading Time: 7 mins read
A A
LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data
ShareShareShareShareShare

YOU MAY ALSO LIKE

Can We Really Trust AI’s Chain-of-Thought Reasoning?

This gaming mouse has a tiny fan inside to keep sweaty palms at bay

Despite significant advances in reasoning capabilities through reinforcement learning (RL), most large language models (LLMs) remain fundamentally dependent on supervised data pipelines. RL frameworks such as RLHF have pushed model alignment and instruction-following performance but rely heavily on human feedback and labeled datasets. As LLMs are increasingly applied in dynamic environments—ranging from educational settings to scientific workflows—they are required to generalize beyond curated training data.

However, existing models often exhibit performance gaps when confronted with distribution shifts or novel reasoning tasks. While techniques like Test-Time Scaling (TTS) and Test-Time Training (TTT) have been proposed to mitigate this, the absence of reliable reward signals during inference poses a core challenge for deploying RL in unsupervised settings.

Test-Time Reinforcement Learning (TTRL): Leveraging Model Priors for Self-Adaptation

Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs.

Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision.

TTRL has a two-stage approach:

  • Label Estimation via Majority Voting: For each prompt, the model samples multiple outputs. The most frequent prediction is treated as the estimated label.
  • Reward Assignment and Policy Optimization: A binary reward is assigned based on whether each sampled response matches the estimated label. The model is updated using gradient-based RL algorithms (e.g., PPO or GRPO) to maximize agreement with the pseudo-labels.

This approach is notable for its simplicity and compatibility with standard RL methods. The reward function, though approximate, provides sufficient learning signal when aggregated over multiple samples. Experimental setups used temperature-controlled sampling (typically temperature = 1.0), with 64 samples for voting and 16 subsampled responses for training updates. No ground-truth labels are involved at any stage.

Empirical Findings across Mathematical Reasoning Tasks

TTRL was evaluated on three mathematical benchmarks: AIME 2024, AMC, and MATH-500. The results are consistent across both smaller and larger models:

  • For Qwen2.5-Math-7B, performance on AIME 2024 increased from 16.7% to 43.3% (pass@1), an improvement of 159.3% without any labeled data.
  • On average, across the three benchmarks, the same model achieved a relative gain of 84.1%.
  • Notably, even a smaller model, Qwen2.5-Math-1.5B, improved from 33.0% to 80.0% on MATH-500.

These gains demonstrate that TTRL supports model improvement even in the absence of supervised training signals. Moreover, TTRL often outperforms the upper bound implied by its own training signal—i.e., the accuracy of the majority-voted predictions. This suggests a self-reinforcing learning loop that can extract richer supervision from noisy consensus signals.

Additional analyses showed that TTRL generalizes beyond the dataset it was applied to. When trained on one benchmark and evaluated on others, performance improvements persisted. This cross-task transfer indicates that TTRL does not lead to narrow overfitting but supports broader generalization.

Conclusion: Toward Self-Adaptive and Label-Free Learning

TTRL represents a novel shift in how reinforcement learning can be applied to LLMs in real-world settings. By reusing the model’s own generations as a proxy for supervision, it removes the need for expensive human annotations while enabling continual adaptation. The approach scales naturally with model size, is compatible with different RL algorithms, and shows promising robustness across tasks of varying difficulty.

While this study focuses on mathematical reasoning, the underlying ideas—self-estimated supervision, test-time adaptation, and reinforcement learning without labels—may generalize to other domains. As language models increasingly encounter tasks beyond their pre-training distribution, frameworks like TTRL offer a scalable path forward.

Further exploration is needed to understand the theoretical convergence properties of TTRL and to evaluate its applicability in interactive or multi-agent scenarios. Nonetheless, TTRL provides a technically sound and computationally efficient foundation for enabling LLMs to evolve continuously from their own outputs.


Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Can We Really Trust AI’s Chain-of-Thought Reasoning?
AI & Technology

Can We Really Trust AI’s Chain-of-Thought Reasoning?

May 24, 2025
This gaming mouse has a tiny fan inside to keep sweaty palms at bay
AI & Technology

This gaming mouse has a tiny fan inside to keep sweaty palms at bay

May 24, 2025
Planner 5D Review: Can It Fix Your Mismatched Living Room?
AI & Technology

Planner 5D Review: Can It Fix Your Mismatched Living Room?

May 24, 2025
Get Peacock Premium for one year for only $25
AI & Technology

Get Peacock Premium for one year for only $25

May 24, 2025
Next Post
Nightly News Full Episode – April 2

Nightly News Full Episode - April 2

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Trump says Russia-Ukraine talks won’t progress until he and Putin are in attendance

Trump says Russia-Ukraine talks won’t progress until he and Putin are in attendance

May 22, 2025
Commuter chaos as NJ Transit engineers go on strike

Commuter chaos as NJ Transit engineers go on strike

May 21, 2025
Pharmaceutical industry ‘pushing back’ against Trump executive order to lower drug prices

Pharmaceutical industry ‘pushing back’ against Trump executive order to lower drug prices

May 24, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!