• bitcoinBitcoin(BTC)$77,208.00-0.29%
  • ethereumEthereum(ETH)$2,133.26-0.16%
  • tetherTether(USDT)$1.00-0.01%
  • binancecoinBNB(BNB)$652.180.54%
  • rippleXRP(XRP)$1.36-0.88%
  • usd-coinUSDC(USDC)$1.000.01%
  • solanaSolana(SOL)$86.380.09%
  • tronTRON(TRX)$0.3633141.38%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.03-0.50%
  • dogecoinDogecoin(DOGE)$0.1046550.41%
  • HyperliquidHyperliquid(HYPE)$60.6217.14%
  • whitebitWhiteBIT Coin(WBT)$56.97-0.21%
  • zcashZcash(ZEC)$654.923.49%
  • USDSUSDS(USDS)$1.000.02%
  • leo-tokenLEO Token(LEO)$10.040.02%
  • cardanoCardano(ADA)$0.248261-0.66%
  • bitcoin-cashBitcoin Cash(BCH)$379.282.41%
  • moneroMonero(XMR)$400.631.23%
  • chainlinkChainlink(LINK)$9.650.16%
  • CantonCanton(CC)$0.1588604.27%
  • the-open-networkToncoin(TON)$2.02-0.40%
  • stellarStellar(XLM)$0.1449080.62%
  • USD1USD1(USD1)$1.000.02%
  • suiSui(SUI)$1.113.25%
  • Ethena USDeEthena USDe(USDE)$1.00-0.05%
  • daiDai(DAI)$1.00-0.01%
  • litecoinLitecoin(LTC)$53.93-0.55%
  • avalanche-2Avalanche(AVAX)$9.370.20%
  • hedera-hashgraphHedera(HBAR)$0.088612-0.84%
  • paypal-usdPayPal USD(PYUSD)$1.000.01%
  • RainRain(RAIN)$0.0075450.47%
  • MemeCoreMemeCore(M)$2.69-19.25%
  • shiba-inuShiba Inu(SHIB)$0.000006-0.54%
  • crypto-com-chainCronos(CRO)$0.069039-0.47%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • Global DollarGlobal Dollar(USDG)$1.000.01%
  • tether-goldTether Gold(XAUT)$4,507.79-0.33%
  • BittensorBittensor(TAO)$277.801.16%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • nearNEAR Protocol(NEAR)$1.817.87%
  • uniswapUniswap(UNI)$3.57-1.83%
  • mantleMantle(MNT)$0.687.58%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.130.13%
  • polkadotPolkadot(DOT)$1.260.69%
  • pax-goldPAX Gold(PAXG)$4,509.47-0.38%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.063208-0.68%
  • OndoOndo(ONDO)$0.4086191.33%
  • AsterAster(ASTER)$0.714.83%
  • HTX DAOHTX DAO(HTX)$0.0000020.84%
  • Falcon USDFalcon USD(USDF)$1.00-0.02%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

May 21, 2026
in AI & Technology
Reading Time: 11 mins read
A A
One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing
ShareShareShareShareShare

Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.

ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.

YOU MAY ALSO LIKE

New York City Mayor Zohran Mamdani Is Launching A Twitch Show

Anthropic Is Reportedly About To Have Its First Profitable Quarter

https://arxiv.org/pdf/2605.18678

What Lance Can Do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.

This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks.

https://arxiv.org/pdf/2605.18678

How the Architecture Works

The architecture is based on two principles: unified context modeling and decoupled capability pathways.

For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types — text, semantic visual, and latent visual — live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.

For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence — they share context but don’t compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.

Modality-Aware Rotary Positional Encoding (MaPE)

Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone — it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment.

Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 — consistent degradation across generation, editing, and understanding.

Training: Four Stages, One Unified Framework

Lance is trained through four sequential stages, each building on the last.

Pre-Training (PT) lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained.

Continual Training (CT) expands the task space by introducing interleaved multi-task data — editing samples, subject-driven generation samples, and multimodal understanding data — across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds.

Supervised Fine-Tuning (SFT) tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens.

Reinforcement Learning (RL) uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment.

Everything fits within a maximum training budget of 128 GPUs.

Results

Image Generation. On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling — though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters.

Video Generation. On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness.

Video Understanding. On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters — notable given that it is simultaneously trained for generation and editing.

Marktechpost’s Visual Explainer

Key Takeaways

  1. Lance is a 3B activated parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single jointly trained framework.
  2. A dual-stream mixture-of-experts architecture with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and generation pathways while keeping them in shared interleaved multimodal context.
  3. Lance achieves 0.90 on GenEval and 85.11 on VBench, the highest Total Score among unified models, trained within a maximum budget of 128 GPUs.
  4. On MVBench, Lance scores 62.0, the highest among unified models — outperforming Show-o2 (7B) at 55.7, while also supporting generation and editing.
  5. Lance is open-source under Apache 2.0, with weights available on Hugging Face.

Check out the Paper, Model Weights and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


Credit: Source link

ShareTweetSendSharePin

Related Posts

New York City Mayor Zohran Mamdani Is Launching A Twitch Show
AI & Technology

New York City Mayor Zohran Mamdani Is Launching A Twitch Show

May 21, 2026
Anthropic Is Reportedly About To Have Its First Profitable Quarter
AI & Technology

Anthropic Is Reportedly About To Have Its First Profitable Quarter

May 21, 2026
Samsung Union Suspends Strike After Reaching Tentative Deal On Bonuses
AI & Technology

Samsung Union Suspends Strike After Reaching Tentative Deal On Bonuses

May 21, 2026
What is a Forward Deployed Engineer: The AI Role OpenAI, Anthropic, and Google Are Hiring in 2026
AI & Technology

What is a Forward Deployed Engineer: The AI Role OpenAI, Anthropic, and Google Are Hiring in 2026

May 21, 2026
Next Post
Alert: Trump Cancels Iran Attack Tomorrow…

Alert: Trump Cancels Iran Attack Tomorrow...

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Trump says Navy secretary was fired over ‘some conflict’

Trump says Navy secretary was fired over ‘some conflict’

May 18, 2026
Full Episode: TODAY Show – April 21

Full Episode: TODAY Show – April 21

May 21, 2026
Your Mouse Pointer Is Getting an AI Brain | Latest in AI

Your Mouse Pointer Is Getting an AI Brain | Latest in AI

May 15, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!