• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$105,720.000.00%
  • ethereumEthereum(ETH)$2,511.622.90%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.00-0.02%
  • rippleXRP(XRP)$2.38-1.03%
  • binancecoinBNB(BNB)$651.850.87%
  • solanaSolana(SOL)$166.14-1.89%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • dogecoinDogecoin(DOGE)$0.223903-1.86%
  • cardanoCardano(ADA)$0.74-0.68%
  • tronTRON(TRX)$0.265977-0.26%
  • staked-etherLido Staked Ether(STETH)$2,517.973.33%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$105,524.000.03%
  • SuiSui(SUI)$3.83-0.23%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • Wrapped stETHWrapped stETH(WSTETH)$3,011.652.59%
  • chainlinkChainlink(LINK)$15.740.94%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • avalanche-2Avalanche(AVAX)$22.14-2.10%
  • stellarStellar(XLM)$0.285835-1.35%
  • HyperliquidHyperliquid(HYPE)$26.53-0.13%
  • shiba-inuShiba Inu(SHIB)$0.000015-1.02%
  • hedera-hashgraphHedera(HBAR)$0.192502-0.71%
  • leo-tokenLEO Token(LEO)$8.63-0.30%
  • bitcoin-cashBitcoin Cash(BCH)$391.80-3.04%
  • ToncoinToncoin(TON)$3.02-3.48%
  • litecoinLitecoin(LTC)$98.34-0.76%
  • USDSUSDS(USDS)$1.00-0.01%
  • polkadotPolkadot(DOT)$4.58-2.36%
  • wethWETH(WETH)$2,514.952.98%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • moneroMonero(XMR)$341.861.27%
  • Bitget TokenBitget Token(BGB)$5.190.92%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.04%
  • PengPeng(PENG)$0.60-13.59%
  • Wrapped eETHWrapped eETH(WEETH)$2,670.882.22%
  • PepePepe(PEPE)$0.000013-0.58%
  • Pi NetworkPi Network(PI)$0.74-0.98%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images

March 21, 2025
in AI & Technology
Reading Time: 5 mins read
A A
Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images
ShareShareShareShareShare

​Artificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.​

Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshi—a speech-text foundation model designed for real-time dialogue—MoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.

YOU MAY ALSO LIKE

The Faster AI Developers Code, the Quicker the Cloud Needs to Be

See, Think, Explain: The Rise of Vision Language Models in AI

Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshi’s speech token stream. This design ensures that Moshi’s original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.

In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:​

“I see two green metal structures with a mesh top, and they’re surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.”

This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.

In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.


Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Credit: Source link

ShareTweetSendSharePin

Related Posts

The Faster AI Developers Code, the Quicker the Cloud Needs to Be
AI & Technology

The Faster AI Developers Code, the Quicker the Cloud Needs to Be

May 19, 2025
See, Think, Explain: The Rise of Vision Language Models in AI
AI & Technology

See, Think, Explain: The Rise of Vision Language Models in AI

May 19, 2025
Razer’s new Blade 14 laptops are outfitted with RTX 5000 series cards
AI & Technology

Razer’s new Blade 14 laptops are outfitted with RTX 5000 series cards

May 19, 2025
Microsoft just launched an AI that discovered a new chemical in 200 hours instead of years
AI & Technology

Microsoft just launched an AI that discovered a new chemical in 200 hours instead of years

May 19, 2025
Next Post
Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
I Won $200,000 On A Game Show And Don’t Know What To Do With It

I Won $200,000 On A Game Show And Don’t Know What To Do With It

May 18, 2025
Avoid these bigger purchases if you can

Avoid these bigger purchases if you can

May 18, 2025
Ed Smylie, Who Saved the Apollo 13 Crew With Duct Tape, Dies at 95 – The New York Times

Ed Smylie, Who Saved the Apollo 13 Crew With Duct Tape, Dies at 95 – The New York Times

May 18, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!