• bitcoinBitcoin(BTC)$60,796.00-1.71%
  • ethereumEthereum(ETH)$1,559.54-6.25%
  • tetherTether(USDT)$1.000.04%
  • binancecoinBNB(BNB)$574.73-2.99%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • rippleXRP(XRP)$1.09-3.26%
  • solanaSolana(SOL)$62.34-6.11%
  • tronTRON(TRX)$0.320109-1.92%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.030.95%
  • HyperliquidHyperliquid(HYPE)$58.73-5.89%
  • dogecoinDogecoin(DOGE)$0.081793-3.06%
  • USDSUSDS(USDS)$1.000.00%
  • leo-tokenLEO Token(LEO)$9.56-2.64%
  • RainRain(RAIN)$0.012914-6.84%
  • stellarStellar(XLM)$0.1988734.55%
  • zcashZcash(ZEC)$359.2313.54%
  • CantonCanton(CC)$0.1538454.41%
  • cardanoCardano(ADA)$0.158325-3.14%
  • moneroMonero(XMR)$297.43-8.54%
  • chainlinkChainlink(LINK)$7.35-3.48%
  • whitebitWhiteBIT Coin(WBT)$43.25-2.89%
  • USD1USD1(USD1)$1.000.14%
  • Ethena USDeEthena USDe(USDE)$1.000.02%
  • bitcoin-cashBitcoin Cash(BCH)$217.23-2.88%
  • daiDai(DAI)$1.000.01%
  • the-open-networkToncoin(TON)$1.571.38%
  • MemeCoreMemeCore(M)$2.83-8.37%
  • hedera-hashgraphHedera(HBAR)$0.078926-4.23%
  • litecoinLitecoin(LTC)$42.76-3.64%
  • avalanche-2Avalanche(AVAX)$6.76-5.80%
  • suiSui(SUI)$0.72-0.87%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • LABLAB(LAB)$9.20-15.55%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • shiba-inuShiba Inu(SHIB)$0.000005-4.07%
  • tether-goldTether Gold(XAUT)$4,285.82-2.00%
  • crypto-com-chainCronos(CRO)$0.057829-2.46%
  • Global DollarGlobal Dollar(USDG)$1.000.02%
  • nearNEAR Protocol(NEAR)$1.90-7.83%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.131.40%
  • pax-goldPAX Gold(PAXG)$4,294.39-2.22%
  • BittensorBittensor(TAO)$196.66-1.67%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.055384-2.43%
  • mantleMantle(MNT)$0.52-3.06%
  • Ripple USDRipple USD(RLUSD)$1.000.00%
  • polkadotPolkadot(DOT)$0.95-4.51%
  • OndoOndo(ONDO)$0.326283-9.17%
  • AsterAster(ASTER)$0.62-7.31%
  • HTX DAOHTX DAO(HTX)$0.000002-2.23%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

May 15, 2026
in AI & Technology
Reading Time: 12 mins read
A A
Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags
ShareShareShareShareShare

Supertone released Supertonic 3, the third generation of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language support, improved reading accuracy, fewer repeat and skip failures, and v2-compatible public ONNX assets. It is Lightning Fast, On-Device, Multilingual and Accurate TTS.

What Changed from v2 to v3

Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity across the shared-language set, and expands language coverage from 5 to 31 languages. Version 2 supported English, Korean, Spanish, Portuguese, and French. Version 3 adds Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 total ISO language codes. There is also a special na fallback for text whose language is unknown or outside the supported set.

The model grows modestly to accommodate the added languages. At about 99M parameters across the public ONNX assets, Supertonic 3 is much smaller than 0.7B to 2B class open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device inference. The update also brings the total disk footprint of the public ONNX assets to 404 MB. Additionally, Supertone recently launched the Voice Builder, allowing developers to create custom, edge-native TTS models from their own voice recordings.

One new capability in v3 that wasn’t present in v2 is expressive tag support. Supertonic 3 supports simple expression tags such as , , and . These let you embed prosodic cues directly into input text without a separate preprocessing step or a separate model for expressiveness. For engineers building voice interfaces or accessibility tools, this means you can specify breathing pauses or laughter inline in your text payload.

Architecture and Runtime

The underlying architecture carries over from prior versions: a speech autoencoder that encodes waveforms into continuous latent representations, a flow-matching based text-to-latent module that maps text to audio features, and a duration predictor that controls natural timing. Flow matching is a generative modeling technique that learns a vector field to transform a simple distribution into a target distribution — it samples faster than diffusion models at low step counts, which is why Supertonic can produce usable output in just 2 inference steps. To further refine output, v3 integrates Length-Aware Rotary Position Embedding (LARoPE) for superior text-speech alignment and utilizes a Self-Purifying Flow Matching technique during training to remain robust against noisy data labels.

On runtime efficiency, Supertonic 3 runs fast on CPU, even compared with larger baselines measured on A100 GPU, and uses substantially less memory. It does not require a GPU, which makes local, browser, and edge deployment much easier.

Reading Accuracy

Across measured languages, Supertonic 3 stays within a competitive WER/CER range against much larger open TTS models such as VoxCPM2, while preserving a lightweight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are standard TTS readability metrics: you synthesize a passage, run ASR over the output, and compare the transcription to the original text. CER is used for languages without clear word boundaries; the others use WER. The system’s efficiency is best demonstrated on extreme edge hardware; it achieves an average RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Furthermore, the ecosystem has expanded to include Flutter (with macOS support), .NET 9, and Go, while the web implementation leverages onnxruntime-web for pure client-side execution.

Text Normalization

A differentiating property carried forward from v2 is built-in text normalization. Supertonic handles complex surface forms — financial expressions like $5.2M, phone numbers with area codes and extensions like (212) 555-0142 ext. 402, time and date formats like 4:45 PM on Wed, Apr 3, 2024, and technical units like 2.3h and 30kph — without any preprocessing pipeline or phonetic annotations. The financial expression “$5.2M” must read as “five point two million dollars,” and “$450K” as “four hundred fifty thousand dollars.” All four competing systems failed this. The technical unit “2.3h” must read as “two point three hours” and “30kph” as “thirty kilometers per hour.” All four competitors also failed this category. The competing systems evaluated include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

https://github.com/supertone-inc/supertonic

Getting Started

The Python SDK install is pip install supertonic. On first run, the SDK downloads the model assets from Hugging Face automatically. A minimal example:

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Marktechpost’s Visual Explainer

Supertonic 3 — Developer Guide

1 / 7

Overview

Supertonic 3: On-Device TTS,
Now in 31 Languages

Supertonic 3 is a lightweight, open-weight text-to-speech system by Supertone Inc. It runs entirely via ONNX Runtime on your device — no cloud, no API call, no data leaving your machine. v3 expands from 5 to 31 languages, adds expressive tags, reduces reading failures, and stays compatible with the v2 ONNX interface.

31
Languages

~99M
Parameters

404 MB
ONNX Assets

MIT
Code License

What’s New in v3

Four Core Improvements Over Supertonic 2

Version 3 is a focused upgrade — same inference contract, meaningfully better output.

  • 🌐
    31 languages — Expanded from the 5-language v2 release (en, ko, es, pt, fr). Now includes Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 more ISO codes, plus a special na fallback for unknown languages.
  • ✅
    More stable reading — Fewer repeat and skip failures, especially on short and long utterances. This was a known limitation in v2 that v3 directly addresses.
  • 🎭
    Expression tags — Supports , , and inline in text, without any separate preprocessing or external model.
  • 🔊
    Higher speaker similarity — Improved similarity across the shared-language set compared with Supertonic 2. Voices are more consistent across languages.

Installation

Get Running in Under a Minute

Install the Python SDK via pip. On first run, model assets are downloaded automatically from Hugging Face — no manual setup required.

pip install supertonic

Quick Start

Basic Python Usage

The SDK auto-downloads model assets on first run. Specify a voice, pass your text with a language code, and save the WAV output.

from supertonic import TTS

# Auto-downloads ONNX assets on first run
tts = TTS(auto_download=True)

# Select a preset voice (M1—M5 male, F1—F5 female)
style = tts.get_voice_style(voice_name="M1")

text = "A gentle breeze moved through the open window."

# synthesize() returns (wav_array, duration_in_seconds)
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

text = "I can't believe it  that actually worked!"
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

Languages

31 Supported Languages + na Fallback

All 31 languages share the same model architecture and ONNX inference pipeline. Use the na code for text whose language is unknown or outside the supported set.

en English

ko Korean

ja Japanese

ar Arabic

bg Bulgarian

cs Czech

da Danish

de German

el Greek

es Spanish

et Estonian

fi Finnish

fr French

hi Hindi

hr Croatian

hu Hungarian

id Indonesian

it Italian

lt Lithuanian

lv Latvian

nl Dutch

pl Polish

pt Portuguese

ro Romanian

ru Russian

sk Slovak

sl Slovenian

sv Swedish

tr Turkish

uk Ukrainian

vi Vietnamese

Text Normalization

Handles Complex Inputs Without Pre-Processing

Supertonic 3 reads financial expressions, dates, phone numbers, and technical units correctly out of the box — no G2P module or phonetic annotations required. Below: Supertonic vs. four major commercial/open-source systems.

Category Input Example Supertonic 3 ElevenLabs / OpenAI / Gemini / Microsoft
Financial Expression $5.2M / $450K ✓ ✗ All four failed
Time & Date 4:45 PM, Wed Apr 3 ✓ ✗ All four failed
Phone Number (212) 555-0142 ext. 402 ✓ ✗ All four failed
Technical Unit 2.3h at 30kph ✓ ✗ All four failed

Deployment & Resources

Runs Everywhere — 11 Platforms, No GPU Required

The public ONNX assets run on CPU in fixed-voice mode with no GPU dependency. Browser support is via WebGPU and WASM through onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

🐍PythonONNX Runtime

🟨Node.jsServer-side JS

🌐BrowserWebGPU / WASM

☕JavaJVM

⚙️C++High-perf

🔷C#.NET

🔵GoGo runtime

🍎Swift / iOSNative

🦀RustSystems

💙FlutterCross-platform

📄Code: MITLicense

🤖Model: OpenRAIL-MLicense

Key Takeaways

  • Supertonic 3 expands language support from 5 (v2) to 31 languages, growing from 66M to ~99M parameters with a total ONNX asset size of 404 MB
  • New in v3: expressive tags (, , ), more stable reading on short and long utterances, and improved speaker similarity vs. v2
  • v2-compatible public ONNX interface — existing integrations upgrade without changing inference code
  • Reading accuracy benchmarked against VoxCPM2; v3 stays within a competitive WER/CER range while being substantially smaller
  • v3-specific RTF/throughput numbers have not been published; the 167× faster-than-real-time figure is a v2 benchmark and should not be assumed identical for v3
  • Native output of 16-bit WAV files ensuring high-fidelity audio for engineering applications

Check out the GitHub Repo and Hugging Face Space. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

YOU MAY ALSO LIKE

Control Resonant’s Take On New York Feels Like The Backrooms

Google Will Pay SpaceX $920 Million A Month To Use xAI’s Data Centers


Credit: Source link

ShareTweetSendSharePin

Related Posts

Control Resonant’s Take On New York Feels Like The Backrooms
AI & Technology

Control Resonant’s Take On New York Feels Like The Backrooms

June 6, 2026
Google Will Pay SpaceX 0 Million A Month To Use xAI’s Data Centers
AI & Technology

Google Will Pay SpaceX $920 Million A Month To Use xAI’s Data Centers

June 6, 2026
Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents
AI & Technology

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

June 6, 2026
NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time
AI & Technology

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

June 6, 2026
Next Post
Blanche says ‘justice will be served’ after shooting outside White House Correspondents’ Dinner

Blanche says 'justice will be served' after shooting outside White House Correspondents' Dinner

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
U.S. military releases footage of ships and aircrafts enforcing the blockade in the Strait of Hormuz

U.S. military releases footage of ships and aircrafts enforcing the blockade in the Strait of Hormuz

June 1, 2026
X Is Now Doing TikTok-Style Reaction Videos

X Is Now Doing TikTok-Style Reaction Videos

June 2, 2026
How To Grow A Small Trading Account With

How To Grow A Small Trading Account With $50

May 30, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!