• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$105,302.00-0.01%
  • ethereumEthereum(ETH)$2,520.20-0.52%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.00-0.01%
  • rippleXRP(XRP)$2.37-2.03%
  • binancecoinBNB(BNB)$646.67-0.70%
  • solanaSolana(SOL)$165.81-4.61%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • dogecoinDogecoin(DOGE)$0.224098-3.96%
  • cardanoCardano(ADA)$0.74-2.96%
  • tronTRON(TRX)$0.264948-1.48%
  • staked-etherLido Staked Ether(STETH)$2,510.54-0.44%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$105,168.00-0.12%
  • SuiSui(SUI)$3.80-2.43%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • Wrapped stETHWrapped stETH(WSTETH)$3,036.77-0.77%
  • chainlinkChainlink(LINK)$15.72-1.99%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • avalanche-2Avalanche(AVAX)$22.38-3.73%
  • stellarStellar(XLM)$0.285742-2.46%
  • HyperliquidHyperliquid(HYPE)$26.08-1.35%
  • shiba-inuShiba Inu(SHIB)$0.000015-3.45%
  • hedera-hashgraphHedera(HBAR)$0.192159-2.32%
  • leo-tokenLEO Token(LEO)$8.63-0.87%
  • bitcoin-cashBitcoin Cash(BCH)$390.14-3.54%
  • ToncoinToncoin(TON)$3.02-5.03%
  • litecoinLitecoin(LTC)$98.55-2.32%
  • polkadotPolkadot(DOT)$4.61-3.96%
  • USDSUSDS(USDS)$1.00-0.01%
  • wethWETH(WETH)$2,508.64-0.26%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • moneroMonero(XMR)$342.580.54%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • Bitget TokenBitget Token(BGB)$5.15-0.88%
  • Wrapped eETHWrapped eETH(WEETH)$2,690.350.15%
  • PengPeng(PENG)$0.60-13.59%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.10%
  • PepePepe(PEPE)$0.000013-3.73%
  • Pi NetworkPi Network(PI)$0.72-2.49%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images

March 21, 2025
in AI & Technology
Reading Time: 5 mins read
A A
Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images
ShareShareShareShareShare

​Artificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.​

Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshi—a speech-text foundation model designed for real-time dialogue—MoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.

YOU MAY ALSO LIKE

Microsoft just taught its AI agents to talk to each other—and it could transform how we work

NVIDIA and Foxconn are building an ’AI factory supercomputer’ in Taiwan

Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshi’s speech token stream. This design ensures that Moshi’s original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.

In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:​

“I see two green metal structures with a mesh top, and they’re surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.”

This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.

In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.


Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Microsoft just taught its AI agents to talk to each other—and it could transform how we work
AI & Technology

Microsoft just taught its AI agents to talk to each other—and it could transform how we work

May 19, 2025
NVIDIA and Foxconn are building an ’AI factory supercomputer’ in Taiwan
AI & Technology

NVIDIA and Foxconn are building an ’AI factory supercomputer’ in Taiwan

May 19, 2025
AI’s Struggle to Read Analogue Clocks May Have Deeper Significance
AI & Technology

AI’s Struggle to Read Analogue Clocks May Have Deeper Significance

May 19, 2025
Salesforce just unveiled AI ‘digital teammates’ in Slack — and they’re coming for Microsoft Copilot
AI & Technology

Salesforce just unveiled AI ‘digital teammates’ in Slack — and they’re coming for Microsoft Copilot

May 19, 2025
Next Post
Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Buffett to sit with board members at 2026 Berkshire meeting, won’t appear on stage: report

Buffett to sit with board members at 2026 Berkshire meeting, won’t appear on stage: report

May 18, 2025
Voyager 1’s Long-Dead Thrusters Fire Again After 20 Years – Just in Time – SciTechDaily

Voyager 1’s Long-Dead Thrusters Fire Again After 20 Years – Just in Time – SciTechDaily

May 19, 2025
Hulk Hogan Real American Beer looking to save bankrupt Hooters

Hulk Hogan Real American Beer looking to save bankrupt Hooters

May 17, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!