• bitcoinBitcoin(BTC)$80,249.000.98%
  • ethereumEthereum(ETH)$2,316.861.89%
  • tetherTether(USDT)$1.000.00%
  • rippleXRP(XRP)$1.423.09%
  • binancecoinBNB(BNB)$650.862.47%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • solanaSolana(SOL)$93.796.73%
  • tronTRON(TRX)$0.3515000.08%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.032.53%
  • dogecoinDogecoin(DOGE)$0.1103363.82%
  • whitebitWhiteBIT Coin(WBT)$59.241.26%
  • HyperliquidHyperliquid(HYPE)$43.953.65%
  • zcashZcash(ZEC)$615.387.63%
  • USDSUSDS(USDS)$1.00-0.02%
  • cardanoCardano(ADA)$0.2748485.12%
  • leo-tokenLEO Token(LEO)$10.29-0.75%
  • bitcoin-cashBitcoin Cash(BCH)$450.490.19%
  • chainlinkChainlink(LINK)$10.486.52%
  • moneroMonero(XMR)$410.023.42%
  • the-open-networkToncoin(TON)$2.56-6.15%
  • CantonCanton(CC)$0.1544546.54%
  • stellarStellar(XLM)$0.1655184.74%
  • litecoinLitecoin(LTC)$58.553.97%
  • MemeCoreMemeCore(M)$3.47-8.87%
  • daiDai(DAI)$1.000.00%
  • USD1USD1(USD1)$1.00-0.02%
  • avalanche-2Avalanche(AVAX)$9.975.27%
  • suiSui(SUI)$1.0711.55%
  • hedera-hashgraphHedera(HBAR)$0.0934663.96%
  • Ethena USDeEthena USDe(USDE)$1.000.00%
  • shiba-inuShiba Inu(SHIB)$0.0000062.89%
  • RainRain(RAIN)$0.0075260.14%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • crypto-com-chainCronos(CRO)$0.0712182.55%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • BittensorBittensor(TAO)$311.633.21%
  • tether-goldTether Gold(XAUT)$4,699.090.03%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0751811.53%
  • uniswapUniswap(UNI)$3.687.85%
  • polkadotPolkadot(DOT)$1.373.93%
  • mantleMantle(MNT)$0.693.86%
  • pax-goldPAX Gold(PAXG)$4,704.290.11%
  • OndoOndo(ONDO)$0.43752420.36%
  • internet-computerInternet Computer(ICP)$3.7513.79%
  • nearNEAR Protocol(NEAR)$1.574.46%
  • SkySky(SKY)$0.0819992.15%
  • AsterAster(ASTER)$0.729.29%
  • okbOKB(OKB)$88.133.58%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

January 23, 2026
in AI & Technology
Reading Time: 8 mins read
A A
Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control
ShareShareShareShareShare

Alibaba Cloud’s Qwen team has open-sourced Qwen3-TTS, a family of multilingual text-to-speech models that target three core tasks in one stack, voice clone, voice design, and high quality speech generation.

https://arxiv.org/pdf/2601.15621v1

Model family and capabilities

Qwen3-TTS uses a 12Hz speech tokenizer and 2 language model sizes, 0.6B and 1.7B, packaged into 3 main tasks. The open release exposes 5 models, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset speakers, and Qwen3-TTS-12Hz-1.7B-VoiceDesign for free form voice creation from natural language descriptions, along with the Qwen3-TTS-Tokenizer-12Hz codec.

YOU MAY ALSO LIKE

9 Best AI Tools for Spec-Driven Development in 2026: Kiro, BMAD, GSD, and More Compare

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents

All models support 10 languages, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. CustomVoice variants ship with 9 curated timbres, such as Vivian, a bright young Chinese female voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese female voice, each with a short description that encodes timbre and speaking style.

The VoiceDesign model maps text instructions directly to new voices, for example ‘speak in a nervous teenage male voice with rising intonation’ and can then be combined with the Base model by first generating a short reference clip and reusing it via create_voice_clone_prompt.

https://arxiv.org/pdf/2601.15621v1

Architecture, tokenizer, and streaming path

Qwen3-TTS is a dual track language model, one track predicts discrete acoustic tokens from text, the other handles alignment and control signals. The system is trained on more than 5 million hours of multilingual speech in 3 pre training stages that move from general mapping, to high quality data, to long context support up to 32,768 tokens.

A key component is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, about 80 ms per token, and uses 16 quantizers with a 2048 entry codebook. On LibriSpeech test clean it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2 and other recent semantic tokenizers, while using a similar or lower frame rate.

The tokenizer is implemented as a pure left context streaming decoder, so it can emit waveforms as soon as enough tokens are available. With 4 tokens per packet, each streaming packet carries 320 ms of audio. The non-DiT decoder and BigVGAN free design reduces decode cost and simplifies batching.

On the language model side, the research team reports end to end streaming measurements on a single vLLM backend with torch.compile and CUDA Graph optimizations. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at concurrency 1, the first packet latency is around 97 ms and 101 ms, with real time factors of 0.288 and 0.313 respectively. Even at concurrency 6, first packet latency stays around 299 ms and 333 ms.

https://arxiv.org/pdf/2601.15621v1

Alignment and control

Post training uses a staged alignment pipeline. First, Direct Preference Optimization aligns generated speech with human preferences on multilingual data. Then GSPO with rule based rewards improves stability and prosody. A final speaker fine tuning stage on the Base model yields target speaker variants while preserving the core capabilities of the general model.

Instruction following is implemented in a ChatML style format, where text instructions about style, emotion or tempo are prepended to the input. This same interface powers VoiceDesign, CustomVoice style prompts, and fine grained edits for cloned speakers.

Benchmarks, zero shot cloning, and multilingual speech

On the Seed-TTS test set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-Base model reaches a Word Error Rate of 0.77 on test-zh and 1.24 on test-en. The research team highlights the 1.24 WER on test-en as state of the art among the compared systems, while the Chinese WER is close to, but not lower than, the best CosyVoice 3 score.

https://arxiv.org/pdf/2601.15621v1

On a multilingual TTS test set covering 10 languages, Qwen3-TTS achieves the lowest WER in 6 languages, Chinese, English, Italian, French, Korean, and Russian, and competitive performance on the remaining 4 languages, while also obtaining the highest speaker similarity in all 10 languages compared to MiniMax-Speech and ElevenLabs Multilingual v2.

Cross-lingual evaluations show that Qwen3-TTS-12Hz-1.7B-Base reduces mixed error rate for several language pairs, such as zh-to-ko, where the error drops from 14.4 for CosyVoice3 to 4.82, about a 66 percent relative reduction.

On InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign model sets new state of the art scores among open source models on Description-Speech Consistency and Response Precision in both Chinese and English, and is competitive with commercial systems like Hume and Gemini on several metrics.

Key Takeaways

  • Full open source multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite that covers 3 tasks in one stack, high quality TTS, 3 second voice cloning, and instruction based voice design across 10 languages using the 12Hz tokenizer family.
  • Efficient discrete codec and real time streaming: The Qwen3-TTS-Tokenizer-12Hz uses 16 codebooks at 12.5 frames per second, reaches strong PESQ, STOI and UTMOS scores, and supports packetized streaming with about 320 ms of audio per packet and sub 120 ms first packet latency for the 0.6B and 1.7B models in the reported setup.
  • Task specific model variants: The release offers Base models for cloning and generic TTS, CustomVoice models with 9 predefined speakers and style prompts, and a VoiceDesign model that generates new voices directly from natural language descriptions which can then be reused by the Base model.
  • Strong alignment and multilingual quality: A multi stage alignment pipeline with DPO, GSPO and speaker fine tuning gives Qwen3-TTS low word error rates and high speaker similarity, with lowest WER in 6 of 10 languages and the best speaker similarity in all 10 languages among the evaluated systems, and state of the art zero shot English cloning on Seed TTS.

Check out the Model Weights, Repo and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control appeared first on MarkTechPost.

Credit: Source link

ShareTweetSendSharePin

Related Posts

9 Best AI Tools for Spec-Driven Development in 2026: Kiro, BMAD, GSD, and More Compare
AI & Technology

9 Best AI Tools for Spec-Driven Development in 2026: Kiro, BMAD, GSD, and More Compare

May 9, 2026
Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents
AI & Technology

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents

May 9, 2026
OpenAI Adds Chrome Extension to Codex, Letting Its AI Agent Access LinkedIn, Salesforce, Gmail, and Internal Tools via Signed-In Sessions
AI & Technology

OpenAI Adds Chrome Extension to Codex, Letting Its AI Agent Access LinkedIn, Salesforce, Gmail, and Internal Tools via Signed-In Sessions

May 8, 2026
Anthropic says it hit a  billion revenue run rate after ‘crazy’ 80x growth
AI & Technology

Anthropic says it hit a $30 billion revenue run rate after ‘crazy’ 80x growth

May 8, 2026
Next Post
Live updates: Vance blames Minnesota officials for ICE presence but concedes misconduct should be investigated – CNN

Live updates: Vance blames Minnesota officials for ICE presence but concedes misconduct should be investigated - CNN

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Mark Zuckerberg is trying to wiggle out of testifying in person at a slew of social media trials

Mark Zuckerberg is trying to wiggle out of testifying in person at a slew of social media trials

May 4, 2026
Tesla's Robotaxi Opportunity Is Dead In Light Of Waymo's Dominance

Tesla's Robotaxi Opportunity Is Dead In Light Of Waymo's Dominance

May 4, 2026
Iran, U.S. ceasefire in jeopardy after new attacks

Iran, U.S. ceasefire in jeopardy after new attacks

May 9, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!