• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$102,770.001.46%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • ethereumEthereum(ETH)$2,313.0112.89%
  • tetherTether(USDT)$1.00-0.02%
  • rippleXRP(XRP)$2.344.09%
  • binancecoinBNB(BNB)$633.301.99%
  • solanaSolana(SOL)$170.206.54%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.2028366.59%
  • cardanoCardano(ADA)$0.786.89%
  • tronTRON(TRX)$0.2611772.18%
  • staked-etherLido Staked Ether(STETH)$2,315.2013.14%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$102,618.001.44%
  • SuiSui(SUI)$3.88-1.21%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • chainlinkChainlink(LINK)$15.893.59%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • avalanche-2Avalanche(AVAX)$22.997.74%
  • Wrapped stETHWrapped stETH(WSTETH)$2,775.8812.92%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • stellarStellar(XLM)$0.2949064.20%
  • shiba-inuShiba Inu(SHIB)$0.0000156.54%
  • hedera-hashgraphHedera(HBAR)$0.1982223.63%
  • HyperliquidHyperliquid(HYPE)$24.5711.76%
  • ToncoinToncoin(TON)$3.251.09%
  • bitcoin-cashBitcoin Cash(BCH)$406.40-3.04%
  • leo-tokenLEO Token(LEO)$8.69-1.34%
  • USDSUSDS(USDS)$1.000.00%
  • litecoinLitecoin(LTC)$98.315.67%
  • polkadotPolkadot(DOT)$4.749.78%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • wethWETH(WETH)$2,314.8312.99%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • PengPeng(PENG)$0.60-13.59%
  • moneroMonero(XMR)$304.992.62%
  • Wrapped eETHWrapped eETH(WEETH)$2,471.0913.13%
  • Bitget TokenBitget Token(BGB)$4.460.31%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.35%
  • PepePepe(PEPE)$0.00001219.91%
  • Pi NetworkPi Network(PI)$0.7213.62%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities

May 9, 2025
in AI & Technology
Reading Time: 4 mins read
A A
Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities
ShareShareShareShareShare

LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch using methods like autoregressive token prediction or a hybrid approach combining diffusion and language losses has shown strong performance. Still, it requires vast computational resources and retraining for each new modality. An alternative approach adapts pretrained LLMs with vision capabilities, which offers a more efficient path but often compromises the language model’s original performance.

Current research has focused on three main strategies: merging LLMs with standalone image generation models, training large multimodal models end-to-end, or using a combination of diffusion and autoregressive losses. While these methods have achieved state-of-the-art results, they either require retraining large models or result in degradation of the LLM’s core capabilities. Despite these challenges, leveraging pretrained LLMs with added vision components has demonstrated significant potential, particularly in tasks involving image understanding and generation. However, these methods still face limitations in terms of efficiency and flexibility. 

YOU MAY ALSO LIKE

Scopely appoints Shlomi Aizenberg as chief business officer

Minimalism stretched to the point of frustration

Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, which adapts pretrained LLMs for multimodal tasks while preserving language capabilities. X-Fusion utilizes a dual-tower architecture, freezing the LLM’s language weights while adding a vision-specific tower to process visual information. The approach aligns text and vision features at multiple levels, improving performance in image-to-text and text-to-image tasks. Through ablation studies, the researchers emphasize the importance of clean image data for training and show that aligning vision features with pre-trained representations accelerates convergence, especially for smaller models. 

X-Fusion is a unified framework that adapts pretrained LLMs for vision tasks while retaining their language capabilities. It uses a dual-tower design, freezing the LLM’s text weights while introducing a separate vision tower for processing visual information. Images are tokenized using a pretrained encoder, and image and text tokens are jointly optimized. The model incorporates an optional X-Fuse operation to merge features from both towers for enhanced performance. X-Fusion is trained with autoregressive and image denoising losses, and its performance is evaluated on image generation (text-to-image) and image understanding (image-to-text) tasks. 

The study evaluates the Dual Tower architecture against alternative transformer variants for multimodal integration. It compares the Single Tower, Gated Tower, and Dual Projection designs, highlighting the flexibility of the Dual Tower for image and text tasks. The Dual Tower performs best in image generation and understanding, outperforming other designs by 23% in FID without increasing training parameters. The study also investigates the effects of noise and data ratios on performance, finding that clean images improve understanding and generation. Additionally, aligning vision features with a pretrained encoder like CLIP boosts performance, especially for smaller models. 

In conclusion, X-Fusion is a framework that adapts pretrained LLMs to multimodal tasks, such as image understanding and generation, while preserving language capabilities. It introduces a Dual Tower architecture where language weights remain fixed, and a separate trainable vision tower processes visual features. Experimental results show that X-Fusion outperforms alternative designs in image and text-to-image tasks. Key findings include the benefits of incorporating understanding-focused data, reducing noise in image data, and the positive impact of feature alignment, especially for smaller models. The research contributes valuable insights into building efficient multimodal models. 


Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Scopely appoints Shlomi Aizenberg as chief business officer
AI & Technology

Scopely appoints Shlomi Aizenberg as chief business officer

May 9, 2025
Minimalism stretched to the point of frustration
AI & Technology

Minimalism stretched to the point of frustration

May 9, 2025
Ming-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure
AI & Technology

Ming-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure

May 9, 2025
Square Enix’s Symbiogenesis onchain game debuts on Sony’s Soneium blockchain
AI & Technology

Square Enix’s Symbiogenesis onchain game debuts on Sony’s Soneium blockchain

May 9, 2025
Next Post
Akamai Technologies, Inc. (AKAM) Q1 2025 Earnings Call Transcript

Akamai Technologies, Inc. (AKAM) Q1 2025 Earnings Call Transcript

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
What history reveals about presidents with rocky market starts

What history reveals about presidents with rocky market starts

May 3, 2025
Conclave to Choose Next Pope Is Predicted to ‘Be Very Short’ – The Daily Beast

Conclave to Choose Next Pope Is Predicted to ‘Be Very Short’ – The Daily Beast

May 3, 2025
We Hired 400 People Out of 15,000 Job Applications Last Year

We Hired 400 People Out of 15,000 Job Applications Last Year

May 3, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!