• bitcoinBitcoin(BTC)$67,183.00-0.75%
  • ethereumEthereum(ETH)$1,947.83-1.22%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$619.24-1.20%
  • rippleXRP(XRP)$1.35-0.88%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$82.37-2.01%
  • tronTRON(TRX)$0.2861230.83%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.02-0.01%
  • dogecoinDogecoin(DOGE)$0.089138-1.60%
  • whitebitWhiteBIT Coin(WBT)$53.71-1.15%
  • USDSUSDS(USDS)$1.000.01%
  • cardanoCardano(ADA)$0.251781-2.47%
  • bitcoin-cashBitcoin Cash(BCH)$447.840.09%
  • leo-tokenLEO Token(LEO)$9.04-0.18%
  • HyperliquidHyperliquid(HYPE)$29.84-1.61%
  • moneroMonero(XMR)$341.34-2.25%
  • chainlinkChainlink(LINK)$8.60-1.85%
  • Ethena USDeEthena USDe(USDE)$1.000.00%
  • CantonCanton(CC)$0.152485-0.10%
  • stellarStellar(XLM)$0.149103-1.76%
  • USD1USD1(USD1)$1.00-0.01%
  • daiDai(DAI)$1.000.05%
  • RainRain(RAIN)$0.008922-1.43%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • hedera-hashgraphHedera(HBAR)$0.094693-1.87%
  • litecoinLitecoin(LTC)$53.21-0.63%
  • avalanche-2Avalanche(AVAX)$8.86-1.44%
  • suiSui(SUI)$0.89-1.55%
  • zcashZcash(ZEC)$194.35-6.32%
  • the-open-networkToncoin(TON)$1.31-2.27%
  • shiba-inuShiba Inu(SHIB)$0.000005-1.97%
  • crypto-com-chainCronos(CRO)$0.074441-0.86%
  • tether-goldTether Gold(XAUT)$5,144.230.04%
  • MemeCoreMemeCore(M)$1.553.46%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.095405-2.72%
  • pax-goldPAX Gold(PAXG)$5,177.54-0.07%
  • polkadotPolkadot(DOT)$1.44-3.68%
  • uniswapUniswap(UNI)$3.72-2.02%
  • mantleMantle(MNT)$0.67-0.68%
  • okbOKB(OKB)$98.541.79%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Pi NetworkPi Network(PI)$0.197532-15.56%
  • Falcon USDFalcon USD(USDF)$1.00-0.06%
  • Global DollarGlobal Dollar(USDG)$1.000.00%
  • AsterAster(ASTER)$0.69-1.09%
  • BittensorBittensor(TAO)$175.17-0.34%
  • SkySky(SKY)$0.0719803.72%
  • aaveAave(AAVE)$107.35-2.17%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

July 30, 2025
in AI & Technology
Reading Time: 10 mins read
A A
NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
ShareShareShareShareShare

Estimated reading time: 5 minutes

Introduction

Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to bridge high-level multimodal reasoning and low-level robot control.

Typical VLA models map raw visual and language inputs directly to actions through end-to-end training, which limits reasoning, long-term planning, and adaptability. Recent methods began to incorporate intermediate chain-of-thought (CoT) reasoning or attempt RL-based optimization, but struggled with scalability, grounding, or generalization when confronted with highly variable and long-horizon robotic manipulation tasks.

YOU MAY ALSO LIKE

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further

NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object

The ThinkAct Framework

Dual-System Architecture

ThinkAct consists of two tightly integrated components:

  • Reasoning Multimodal LLM (MLLM): Performs structured, step-by-step reasoning over visual scenes and language instructions, outputting a visual plan latent that encodes high-level intent and planning context.
  • Action Model: A Transformer-based policy conditioned on the visual plan latent, executing the decoded trajectory as robot actions in the environment.

This design allows asynchronous operation: the LLM “thinks” and generates plans at a slow cadence, while the action module carries out fine-grained control at higher frequency.

Reinforced Visual Latent Planning

A core innovation is the reinforcement learning (RL) approach leveraging action-aligned visual rewards:

  • Goal Reward: Encourages the model to align the start and end positions predicted in the plan with those in demonstration trajectories, supporting goal completion.
  • Trajectory Reward: Regularizes the predicted visual trajectory to closely match distributional properties of expert demonstrations using dynamic time warping (DTW) distance.

Total reward rrr blends these visual rewards with a format correctness score, pushing the LLM to not only produce accurate answers but also plans that translate into physically plausible robot actions.

Training Pipeline

The multi-stage training procedure includes:

  1. Supervised Fine-Tuning (SFT): Cold-start with manually-annotated visual trajectory and QA data to teach trajectory prediction, reasoning, and answer formatting.
  2. Reinforced Fine-Tuning: RL optimization (using Group Relative Policy Optimization, GRPO) further incentivizes high-quality reasoning by maximizing the newly defined action-aligned rewards.
  3. Action Adaptation: The downstream action policy is trained using imitation learning, leveraging the frozen LLM’s latent plan output to guide control across varied environments.

Inference

At inference time, given an observed scene and a language instruction, the reasoning module generates a visual plan latent, which then conditions the action module to execute a full trajectory—enabling robust performance even in new, previously unseen settings.

Experimental Results

Robot Manipulation Benchmarks

Experiments on SimplerEnv and LIBERO benchmarks demonstrate ThinkAct’s superiority:

  • SimplerEnv: Outperforms strong baselines (e.g., OpenVLA, DiT-Policy, TraceVLA) by 11–17% in various settings, especially excelling in long-horizon and visually diverse tasks.
  • LIBERO: Achieves the highest overall success rates (84.4%), excelling in spatial, object, goal, and long-horizon challenges, confirming its ability to generalize and adapt to novel skills and layouts.

Embodied Reasoning Benchmarks

On EgoPlan-Bench2, RoboVQA, and OpenEQA, ThinkAct demonstrates:

  • Superior multi-step and long-horizon planning accuracy.
  • State-of-the-art BLEU and LLM-based QA scores, reflecting improved semantic understanding and grounding for visual question answering tasks.

Few-Shot Adaptation

ThinkAct enables effective few-shot adaptation: with as few as 10 demonstrations, it achieves substantial success rate gains over other methods, highlighting the power of reasoning-guided planning for quickly learning new skills or environments.

Self-Reflection and Correction

Beyond task success, ThinkAct exhibits emergent behaviors:

  • Failure Detection: Recognizes execution errors (e.g., dropped objects).
  • Replanning: Automatically revises plans to recover and complete the task, thanks to reasoning on recent visual input sequences.

Ablation Studies and Model Analysis

  • Reward Ablations: Both goal and trajectory rewards are essential for structured planning and generalization. Removing either significantly drops performance, and relying only on QA-style rewards limits multi-step reasoning capability.
  • Reduction in Update Frequency: ThinkAct achieves a balance between reasoning (slow, planning) and action (fast, control), allowing robust performance without excessive computational demand1.
  • Smaller Models: The approach generalizes to smaller MLLM backbones, maintaining strong reasoning and action capabilities.

Implementation Details

  • Main backbone: Qwen2.5-VL 7B MLLM.
  • Datasets: Diverse robot and human demonstration videos (Open X-Embodiment, Something-Something V2), plus multimodal QA sets (RoboVQA, EgoPlan-Bench, Video-R1-CoT, etc.).
  • Uses a vision encoder (DINOv2), text encoder (CLIP), and a Q-Former for connecting reasoning output to action policy input.
  • Extensive experiments on real and simulated settings confirm scalability and robustness.

Conclusion

Nvidia’s ThinkAct sets a new standard for embodied AI agents, proving that reinforced visual latent planning—where agents “think before they act”—delivers robust, scalable, and adaptive performance in complex, real-world reasoning and robot manipulation tasks. Its dual-system design, reward shaping, and strong empirical results pave the way for intelligent, generalist robots capable of long-horizon planning, few-shot adaptation, and self-correction in diverse environments.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

You may also like NVIDIA’s Open Sourced Cosmos DiffusionRenderer [Check it now]


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

Credit: Source link

ShareTweetSendSharePin

Related Posts

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further
AI & Technology

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further

March 7, 2026
NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object
AI & Technology

NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object

March 7, 2026
OpenAI’s head of robotics resigns following deal with the Department of Defense
AI & Technology

OpenAI’s head of robotics resigns following deal with the Department of Defense

March 7, 2026
Indonesia announces a social media ban for anyone under 16
AI & Technology

Indonesia announces a social media ban for anyone under 16

March 7, 2026
Next Post
Doctors alarmed at rising meningitis cases in Gaza.

Doctors alarmed at rising meningitis cases in Gaza.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Blasts seen in Iran as U.S., Israeli strikes intensify

Blasts seen in Iran as U.S., Israeli strikes intensify

March 6, 2026
COPT Defense Properties (CDP) Presents at Citi's Miami Global Property CEO Conference 2026 – Slideshow

COPT Defense Properties (CDP) Presents at Citi's Miami Global Property CEO Conference 2026 – Slideshow

March 3, 2026
Meta signs a multimillion dollar AI licensing deal with News Corp

Meta signs a multimillion dollar AI licensing deal with News Corp

March 3, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!