• bitcoinBitcoin(BTC)$63,505.002.15%
  • ethereumEthereum(ETH)$1,683.192.97%
  • tetherTether(USDT)$1.00-0.01%
  • binancecoinBNB(BNB)$607.641.85%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • rippleXRP(XRP)$1.182.46%
  • solanaSolana(SOL)$67.423.05%
  • tronTRON(TRX)$0.325629-0.48%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.032.56%
  • HyperliquidHyperliquid(HYPE)$64.258.19%
  • dogecoinDogecoin(DOGE)$0.0870302.60%
  • USDSUSDS(USDS)$1.00-0.01%
  • leo-tokenLEO Token(LEO)$9.48-0.23%
  • RainRain(RAIN)$0.013220-0.67%
  • zcashZcash(ZEC)$446.126.91%
  • stellarStellar(XLM)$0.201477-2.95%
  • cardanoCardano(ADA)$0.1717955.56%
  • CantonCanton(CC)$0.158548-3.98%
  • moneroMonero(XMR)$316.972.70%
  • chainlinkChainlink(LINK)$8.033.71%
  • whitebitWhiteBIT Coin(WBT)$45.271.85%
  • the-open-networkToncoin(TON)$1.794.36%
  • USD1USD1(USD1)$1.00-0.06%
  • Ethena USDeEthena USDe(USDE)$1.00-0.03%
  • daiDai(DAI)$1.00-0.01%
  • bitcoin-cashBitcoin Cash(BCH)$209.67-6.89%
  • MemeCoreMemeCore(M)$3.203.40%
  • LABLAB(LAB)$12.55-4.59%
  • hedera-hashgraphHedera(HBAR)$0.0819280.42%
  • litecoinLitecoin(LTC)$43.343.20%
  • suiSui(SUI)$0.772.19%
  • avalanche-2Avalanche(AVAX)$6.842.14%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • paypal-usdPayPal USD(PYUSD)$1.000.01%
  • nearNEAR Protocol(NEAR)$2.185.95%
  • shiba-inuShiba Inu(SHIB)$0.0000051.88%
  • crypto-com-chainCronos(CRO)$0.0621983.54%
  • tether-goldTether Gold(XAUT)$4,320.140.40%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.12-0.77%
  • BittensorBittensor(TAO)$218.383.85%
  • pax-goldPAX Gold(PAXG)$4,332.770.58%
  • worldcoin-wldWorldcoin(WLD)$0.5514.98%
  • mantleMantle(MNT)$0.552.69%
  • OndoOndo(ONDO)$0.3737158.28%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0561811.65%
  • Ripple USDRipple USD(RLUSD)$1.00-0.01%
  • polkadotPolkadot(DOT)$0.992.07%
  • AsterAster(ASTER)$0.640.18%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

June 8, 2026
in AI & Technology
Reading Time: 8 mins read
A A
Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs
ShareShareShareShareShare

Inference speed is becoming a competitive metric for large language models. Xiaomi’s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node.

YOU MAY ALSO LIKE

Indie Game Dogpile Is Coming To Switch And Mobile

Crazy Taxi World Tour Will Offer More Freedom, Bite-Sized Missions And Fishing With A Car

The Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original.

The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass.

Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency.

Acceptance length measures how many draft tokens survive verification each round.

Scenario Acceptance Length
Coding 6.30
Math / Reasoning 5.56
Agent 4.29

In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.

TileRT: Squeezing the Microseconds

At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward.

Use Cases

The release targets latency-sensitive work where waiting breaks the loop:

  • Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
  • Coding agents: faster code generation cuts the wait between agent steps.
  • Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
  • Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads where raw token speed is the binding constraint.

How It Compares

The first table contrasts the two routes to extreme decode speed.

Approach Hardware How speed is achieved
Cerebras Wafer-Scale integration (custom) Scale on a single custom wafer
Groq Custom architecture Pure on-chip SRAM
MiMo × TileRT Commodity GPUs (8-GPU node) Model-system codesign: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

Dimension MiMo-V2.5-Pro MiMo-V2.5-Pro-UltraSpeed
Decode speed Baseline ~10× faster (1000+ TPS)
Price 1× 3×
Weight precision Standard FP4 MoE Experts via QAT
Decoding Standard autoregressive DFlash speculative decoding
Access Standard model plans API only, application-based trial
Token Plan Supported Not supported

Access, Pricing, and Open Source

UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the standard MiMo-V2.5-Pro rate, for roughly 10× the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.

Strengths and Limitations

Strengths

  • 1000+ TPS on a 1T model without custom silicon.
  • Lossless decoding through rejection sampling in DFlash.
  • FP4 applied only where tolerance is highest, preserving quality.
  • An open checkpoint lets the community test the claims.

Limitations

  • Access is gated, short, and approval-based at launch.
  • Pricing triples per token versus the standard model.
  • Acceptance length drops in open-ended conversation.
  • Independent third-party speed verification is not yet public.

Key Takeaways

  • Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs.
  • The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
  • FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
  • DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
  • UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.

Marktechpost’s Visual Explainer

01 / 08

What It Is

  • Xiaomi’s MiMo team built it with the TileRT systems group.
  • It decodes over 1000 tokens/s on a 1-trillion-parameter model.
  • Demos show generation peaks near 1200 tokens/s.
  • It runs on commodity GPUs, a single standard 8-GPU node.
  • Released June 8, 2026.

1000+tokens / second

1Tparameters (MoE)

8commodity GPUs

02 / 08

Three Layers Working Together

  • FP4 quantization shrinks weights and eases bandwidth pressure.
  • DFlash speculative decoding predicts many tokens in parallel.
  • TileRT executes the whole pipeline at microsecond scale.
  • Xiaomi calls this approach extreme model-system codesign.
  • No single technique is enough; all three must align.

03 / 08

Layer 1 — FP4 Quantization

  • Uses the MXFP4 format to lower memory and bandwidth cost.
  • Applied selectively to the MoE Experts only.
  • Other modules keep higher precision (FP8, per TileRT).
  • Experts hold most parameters and tolerate quantization best.
  • QAT keeps capability essentially on par with the original.

04 / 08

Layer 2 — DFlash Speculative Decoding

  • A research-community method using block-level masked parallel prediction.
  • The draft model fills a whole block in one forward pass.
  • It uses Sliding Window Attention; block size capped at 8.
  • Rejection sampling keeps the output lossless.
Scenario Acceptance Length
Coding 6.30
Math / Reasoning 5.56
Agent 4.29

05 / 08

Layer 3 — TileRT Runtime

  • At 1000 TPS, each operator runs for only microseconds.
  • A Persistent Engine Kernel stays resident on the GPU.
  • Warp Specialization splits data movement, compute, and communication.
  • Small ops like RMSNorm and RoPE become bottlenecks here.
  • The runtime was co-designed with the FP4 and DFlash choices.

06 / 08

Where It Fits

  • Parallel reasoning: many Best-of-N or tree-search paths at once.
  • Coding agents: less wait between agent steps.
  • Real-time loops: trading signals, fraud interception, live dialogue.
  • Interactive prototyping: a Snake game in about 10 seconds.

07 / 08

Standard vs UltraSpeed

Dimension MiMo-V2.5-Pro UltraSpeed
Decode speed Baseline ~10× (1000+ TPS)
Price 1× 3×
Weights Standard FP4 MoE Experts (QAT)
Decoding Autoregressive DFlash speculative
Access Standard plans API only, by application

08 / 08

Access, Pricing & Open Source

  • API trial runs June 9 to June 23, 2026 (Beijing time).
  • Pricing is 3× the standard rate for roughly 10× speed.
  • API only; the Token Plan is not supported.
  • Checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
  • TileRT has open-sourced select modules on GitHub.

Marktechpost
AI research, models, and developer tools — explained for engineers.


Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


Credit: Source link

ShareTweetSendSharePin

Related Posts

Indie Game Dogpile Is Coming To Switch And Mobile
AI & Technology

Indie Game Dogpile Is Coming To Switch And Mobile

June 8, 2026
Crazy Taxi World Tour Will Offer More Freedom, Bite-Sized Missions And Fishing With A Car
AI & Technology

Crazy Taxi World Tour Will Offer More Freedom, Bite-Sized Missions And Fishing With A Car

June 8, 2026
Live Updates From Apple Park On Siri, iOS 27, Apple Intelligence And More
AI & Technology

Live Updates From Apple Park On Siri, iOS 27, Apple Intelligence And More

June 8, 2026
Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription
AI & Technology

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

June 8, 2026
Next Post
Surveillance video shows a Philippine senator being chased by police

Surveillance video shows a Philippine senator being chased by police

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Termites swarm parts of New Orleans after intense rainfall

Termites swarm parts of New Orleans after intense rainfall

June 6, 2026
Aaron Judge Diagnosed With Rib Stress Fracture, Reevaluated In 4-6 Weeks – MLB Trade Rumors

Aaron Judge Diagnosed With Rib Stress Fracture, Reevaluated In 4-6 Weeks – MLB Trade Rumors

June 5, 2026
JD Vance discusses who could be compensated by the ‘anti-weaponization fund’

JD Vance discusses who could be compensated by the ‘anti-weaponization fund’

June 3, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!