• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$104,251.001.00%
  • ethereumEthereum(ETH)$2,506.960.20%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.000.02%
  • rippleXRP(XRP)$2.37-2.25%
  • binancecoinBNB(BNB)$651.23-0.43%
  • solanaSolana(SOL)$172.26-0.21%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.230317-2.73%
  • cardanoCardano(ADA)$0.80-1.28%
  • tronTRON(TRX)$0.2641941.25%
  • staked-etherLido Staked Ether(STETH)$2,505.380.10%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$104,207.001.26%
  • SuiSui(SUI)$4.030.70%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • chainlinkChainlink(LINK)$16.950.33%
  • Wrapped stETHWrapped stETH(WSTETH)$3,017.330.91%
  • avalanche-2Avalanche(AVAX)$24.62-1.42%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • stellarStellar(XLM)$0.307343-0.49%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • shiba-inuShiba Inu(SHIB)$0.000016-0.83%
  • hedera-hashgraphHedera(HBAR)$0.205289-3.48%
  • ToncoinToncoin(TON)$3.36-0.88%
  • HyperliquidHyperliquid(HYPE)$24.87-1.96%
  • bitcoin-cashBitcoin Cash(BCH)$409.23-2.97%
  • Pi NetworkPi Network(PI)$1.1150.12%
  • USDSUSDS(USDS)$1.000.00%
  • polkadotPolkadot(DOT)$5.08-0.36%
  • leo-tokenLEO Token(LEO)$8.37-1.91%
  • litecoinLitecoin(LTC)$99.83-3.90%
  • wethWETH(WETH)$2,508.330.21%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • moneroMonero(XMR)$339.816.53%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • PengPeng(PENG)$0.60-13.59%
  • Wrapped eETHWrapped eETH(WEETH)$2,675.330.49%
  • Bitget TokenBitget Token(BGB)$4.870.16%
  • PepePepe(PEPE)$0.0000131.53%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.18%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

April 11, 2025
in AI & Technology
Reading Time: 4 mins read
A A
RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts
ShareShareShareShareShare

YOU MAY ALSO LIKE

11 New Tech Gadgets And Inventions ( 2025 ) You Should Have

Samsung has begun taking pre-orders for its 500Hz OLED gaming monitor

In recent years, the rapid progress of LLMs has given the impression that we are nearing the achievement of Artificial General Intelligence (AGI), with models seemingly capable of solving increasingly complex tasks. However, a fundamental question remains: Are LLMs genuinely reasoning like humans or merely repeating patterns learned during training? Since the release of models like GPT-3 and ChatGPT, LLMs have revolutionized the research landscape, pushing boundaries across AI and science. Data quality, model scaling, and multi-step reasoning improvements have brought LLMs close to passing high-level AGI benchmarks. Yet, their true reasoning capabilities are not fully understood. Instances where advanced models fail to solve simple math problems—despite their apparent simplicity—raise concerns about whether they are truly reasoning or just mimicking familiar solution patterns.

Although various benchmarks exist to evaluate LLMs across domains like general knowledge, coding, math, and reasoning, many rely on tasks solvable by applying memorized templates. As a result, the actual intelligence and robustness of LLMs remain debatable. Studies show LLMs struggle with subtle context shifts, simple calculations, symbolic reasoning, and out-of-distribution prompts. These weaknesses are amplified under perturbed conditions or misleading cues. Similarly, multi-modal LLMs, including vision-language models like GPT-4v and LLaVA, show the same tendency to recite instead of reason when tested with subtly altered visual or textual inputs. This suggests that issues like spurious correlations, memorization, and inefficient decoding might underlie these failures, indicating a gap between observed performance and genuine understanding.

ByteDance Seed and the University of Illinois Urbana-Champaign researchers introduce RoR-Bench, a new multi-modal benchmark designed to identify whether LLMs rely on recitation rather than genuine reasoning when solving simple problems with subtly altered conditions. The benchmark includes 158 text and 57 image problem pairs, each featuring a basic reasoning task alongside a slightly modified version. Experiments reveal that leading models like OpenAI-o1 and DeepSeek-R1 suffer drastic performance drops—often over 60% with minor changes. Alarmingly, most models struggle to recognize unsolvable problems—preliminary fixes like prompt engineering offer limited improvement, emphasizing the need for deeper solutions.

RoR-Bench is a Chinese multimodal benchmark created to assess whether LLMs rely on memorized solution patterns rather than true reasoning. It contains 215 problem pairs—158 text-based and 57 image-based—where each pair includes an original and a subtly altered version. The original problems are simple, often from children’s puzzle sets, while the modified ones introduce minor changes that require entirely different reasoning. Annotators ensured minimal wording changes and no ambiguity. Notably, some problems are designed to have no solution or feature unrelated information, testing LLMs’ ability to recognize illogical conditions and resist recitation-based answers.

The study empirically evaluates leading LLMs and VLMs on the RoR-Bench benchmark, focusing on their ability to reason through subtle problem changes rather than merely recalling learned patterns. Results reveal that most models suffer a significant performance drop—often over 50% when tested on slightly modified problems, suggesting a reliance on memorization rather than genuine reasoning. Even techniques like Chain-of-Thought prompting or “Forced Correct” instructions provide limited improvement. Few-shot in-context learning shows some gains, especially with increased examples or added instructions, but still fails to close the gap. Overall, these findings highlight the limitations of current models in adaptive reasoning.

In conclusion, the study introduces RoR-Bench, a Chinese multimodal benchmark designed to uncover a critical flaw in current large language models: their inability to handle simple reasoning tasks when problem conditions are slightly altered. The significant performance drop—often over 50% suggests that these models rely on memorization rather than true reasoning. Even with added prompts or few-shot examples, the issue remains largely unresolved. While the benchmark is limited to Chinese, initial English results indicate similar weaknesses. The findings challenge assumptions about LLM intelligence and call for future research to develop models that reason genuinely rather than reciting learned patterns from training data.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Credit: Source link

ShareTweetSendSharePin

Related Posts

11 New Tech Gadgets And Inventions ( 2025 ) You Should Have
AI & Technology

11 New Tech Gadgets And Inventions ( 2025 ) You Should Have

May 11, 2025
Samsung has begun taking pre-orders for its 500Hz OLED gaming monitor
AI & Technology

Samsung has begun taking pre-orders for its 500Hz OLED gaming monitor

May 11, 2025
How to use Gemini to generate unique backgrounds in Google Meet
AI & Technology

How to use Gemini to generate unique backgrounds in Google Meet

May 11, 2025
Dream 7B: How Diffusion-Based Reasoning Models Are Reshaping AI
AI & Technology

Dream 7B: How Diffusion-Based Reasoning Models Are Reshaping AI

May 11, 2025
Next Post
Equity Outlook: Finding Silver Linings In Very Cloudy Markets

Equity Outlook: Finding Silver Linings In Very Cloudy Markets

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
‘Higher inflation and slower growth’: Powell warns of tariff impact

‘Higher inflation and slower growth’: Powell warns of tariff impact

May 9, 2025
Dream 7B: How Diffusion-Based Reasoning Models Are Reshaping AI

Dream 7B: How Diffusion-Based Reasoning Models Are Reshaping AI

May 11, 2025
What will happen today after Pope Francis’ death

What will happen today after Pope Francis’ death

May 6, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!