Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Simultaneous interpretation is one of the harder problems in applied AI. You’re asking a model to translate speech before the speaker has finished a sentence. Every extra second of delay breaks the illusion of real-time communication. Alibaba’s Qwen team has been chipping away at this with each release. Their latest model, Qwen3.5-LiveTranslate-Flash, brings that latency down to 2.8 seconds and expands input language coverage to 60 languages.

https://qwen.ai/blog?id=qwen3.5-livetranslate

A Meaningful Jump From the Previous Release

The Qwen3-LiveTranslate-Flash handled 18 input languages at roughly three seconds of latency. Qwen3.5-LiveTranslate-Flash brings that down to 2.8 seconds, expands input coverage to 60 languages, and adds speech output in 29 languages. That’s more than a 3× expansion in language coverage on the input side. For devs building multilingual products, this reduces the need for per-language model switching in most global enterprise scenarios.

SAP Unveils Automation Suite Amid Software Market Doubts

A-Star: Small Bets Still Crucial for VC-Style Returns

The latency improvement comes from a technique for processing what the team calls ‘reading units.’ Rather than waiting for a full sentence to arrive before producing output, the model decides when enough meaning has accumulated in a segment to commit to a translation. It streams output continuously while the speaker is still talking. This is the same underlying logic as semantic unit prediction but with a tighter implementation that shaves off that extra 200 milliseconds.

Vision Is Now a First-Class Input

Most translation systems treat audio as the only input signal. That works fine in clean studio conditions. It breaks down in a crowded conference room, a noisy trade floor, or anywhere with overlapping voices and bad acoustics.

Qwen3.5-LiveTranslate-Flash takes a different approach. It analyzes visual information in parallel with audio on-screen text, physically shown objects, lip movements, and gestures. When a word is phonetically ambiguous or the audio stream degrades, the visual context fills the gap and sharpens the translation decision. This is not a minor feature. In real-world deployment, audio quality is rarely guaranteed. Having a vision channel means the model handles the messy reality of live interpretation more gracefully than audio-only systems.

Voice Cloning Happens in Real Time

This is the part that stands out most in the Qwen3.5 release. Standard translation systems replace the speaker’s voice with a generic synthesis voice. Qwen3.5-LiveTranslate-Flash instead clones the characteristic voice features of the original speaker during the translation itself. A single spoken sentence is enough for the model to perform this acoustic adaptation.

For listeners on the receiving end, the translated output sounds like the same person speaking the target language and not a robotic substitute. In live conference interpretation, multilingual livestreams, or international customer calls, this is important. The experience feels noticeably more human than what current systems deliver.

Configure Domain-Specific Keywords

One persistent failure mode for translation models in professional settings is proper nouns and specialized vocabulary. A model translating a medical briefing might consistently mistranslate a drug name. A legal interpretation session breaks down over a technical statute term.

Qwen3.5-LiveTranslate-Flash addresses this with dynamic keyword configuration at runtime. Developers can inject a glossary of brand names, medical terms, legal terminology, or technical vocabulary, and the model handles those terms significantly more reliably. This isn’t available in most general-purpose translation APIs and it closes a real gap for domain-specific enterprise deployments.

Benchmark Performance

On FLEURS and CoVoST2 — two established benchmarks for multilingual speech translation — Qwen3.5-LiveTranslate-Flash outperforms major commercial alternatives. FLEURS tests translation quality across a wide variety of language pairs under real acoustic conditions. CoVoST2 covers 21 translation directions from speech, making it a practical proxy for multilingual pipeline performance.

Marktechpost’s Visual Explainer

What it does

Qwen3.5-LiveTranslate-Flash at a glance

Qwen3.5-LiveTranslate-Flash is an API-only, closed-weight real-time translation model from Alibaba’s Qwen team. It takes audio and video frames as simultaneous inputs and outputs translated text and speech. The model uses a WebSocket-based protocol over Alibaba Cloud Model Studio.

Latency

2.8s

Per token to audio out

Input languages

Speech + visual input

Speech output

Languages with voice

Protocol

WebSocket

Persistent connection

✓
Vision-enhanced comprehension — lip movements, gestures, and on-screen text all feed into the translation decision alongside audio
◆
Real-time voice cloning — clones the original speaker’s voice profile in the translated output from a single spoken sentence
◆
Semantic unit prediction — commits to output segments before a full sentence ends, enabling continuous streaming without waiting for complete utterances
◆
Dynamic keyword configuration — inject domain-specific glossaries at runtime for technical, medical, or legal terminology

Before you start

Prerequisites

You need an Alibaba Cloud account with Model Studio access and a valid DashScope API key. The model is available through the qwen3-livetranslate-flash-realtime model ID.

Create an Alibaba Cloud account

Get your DashScope API key

Navigate to Model Studio → API Keys. Generate a key and store it as the environment variable DASHSCOPE_API_KEY. Never hardcode it in source files.

Install the Python dependency

Install the websocket-client package for the WebSocket connection. For audio capture, also install pyaudio.

Check your audio setup

The model accepts 16kHz, 16-bit PCM mono audio on input. Confirm your microphone or audio source can output in this format before connecting.

BASH

# Install dependencies
pip install websocket-client pyaudio

# Set your API key as an environment variable
export DASHSCOPE_API_KEY="your_key_here"

Step 3 — Connection

Establish the WebSocket connection

The model uses the WebSocket protocol for a persistent, bidirectional connection. You authenticate via a Bearer token in the connection header using your DashScope API key.

PYTHON

import json, websocket, os

API_KEY = os.getenv("DASHSCOPE_API_KEY")
API_URL = (
    "wss://dashscope-intl.aliyuncs.com"
    "/api-ws/v1/realtime"
    "?model=qwen3-livetranslate-flash-realtime"
)

def on_open(ws):
    print("Connected to Qwen3.5-LiveTranslate-Flash")

def on_message(ws, message):
    data = json.loads(message)
    print("Translation event:", data)

def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=["Authorization: Bearer " + API_KEY],
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)
ws.run_forever()

ⓘ

The connection stays open for the full session. You do not reconnect per utterance. Send audio chunks and image frames continuously over the same socket.

Step 4 — Audio streaming

Configure and stream audio input

After connecting, send a session configuration event to set the source and target languages. Then stream PCM audio chunks continuously. The model uses session.input_audio_transcription.language to identify the input language.

PYTHON

import base64, pyaudio

# Audio input config: 16kHz, 16-bit PCM mono
INPUT_RATE    = 16000
INPUT_CHUNK   = 1600  # 100ms per chunk
INPUT_FORMAT  = pyaudio.paInt16
INPUT_CHANNELS = 1

def on_open(ws):
    # 1. Send session config first
    session_cfg = {
        "type": "session.update",
        "session": {
            "input_audio_transcription": {
                "language": "zh"  # source: Chinese
            },
            "translation": {
                "target_language": "en"  # target: English
            }
        }
    }
    ws.send(json.dumps(session_cfg))

    # 2. Stream microphone audio
    pa = pyaudio.PyAudio()
    stream = pa.open(
        rate=INPUT_RATE, channels=INPUT_CHANNELS,
        format=INPUT_FORMAT, input=True,
        frames_per_buffer=INPUT_CHUNK
    )
    while True:
        chunk = stream.read(INPUT_CHUNK)
        audio_b64 = base64.b64encode(chunk).decode()
        ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_b64
        }))

⚠

Do not send audio before the session.update event is acknowledged. Wait for the server’s session confirmation event before streaming audio chunks.

Step 5 — Vision input

Send video frames for vision-enhanced comprehension

Qwen3.5-LiveTranslate-Flash reads lip movements, gestures, and on-screen text from video frames alongside audio. Send base64-encoded JPEG frames at a regular interval during the session. Even a low frame rate significantly improves accuracy in noisy audio conditions.

PYTHON

import cv2, base64, threading, time

def stream_video_frames(ws):
    cap = cv2.VideoCapture(0)  # 0 = default camera
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        # Encode frame as JPEG → base64
        _, buf = cv2.imencode(".jpg", frame)
        img_b64 = base64.b64encode(buf).decode()
        ws.send(json.dumps({
            "type": "input_image_buffer.append",
            "image": img_b64
        }))
        time.sleep(0.5)  # ~2fps is sufficient

# Run video streaming in a separate thread
threading.Thread(
    target=stream_video_frames,
    args=(ws,), daemon=True
).start()

ⓘ

Vision input is optional but recommended for live human speech scenarios. For pre-recorded audio files without a camera feed, you can omit image frames entirely and rely on audio alone.

Step 6 — Domain accuracy

Dynamic keyword configuration

For technical, medical, legal, or brand-specific vocabulary, you can inject a keyword glossary at session start. The model uses this list to significantly improve translation reliability for terms that standard training data may handle inconsistently.

PYTHON

# Add to your session.update payload
session_cfg = {
    "type": "session.update",
    "session": {
        "input_audio_transcription": {
            "language": "zh"
        },
        "translation": {
            "target_language": "en"
        },
        # Inject domain keywords here
        "keywords": [
            {"source": "达芬奇机器人",  "target": "da Vinci Surgical System"},
            {"source": "腹腔镜",      "target": "laparoscope"},
            {"source": "实体瘤",      "target": "solid tumor"}
        ]
    }
}
ws.send(json.dumps(session_cfg))

✓Works for brand names, drug names, legal statutes, and technical model numbers
✓Keywords are scoped to the session and do not persist across connections
◆Keep the list focused — only terms where mistranslation would cause real errors

Reference

Supported languages

Qwen3.5-LiveTranslate-Flash understands 60 input languages and can produce speech output in 29 languages. The highlighted pills below are confirmed speech output languages. All pills represent supported input.

Chinese

English

French

German

Spanish

Japanese

Korean

Russian

Portuguese

Italian

Arabic

Hindi

Turkish

Indonesian

Thai

Vietnamese

Greek

Mandarin

Cantonese

Wu dialect

Sichuanese

Tianjin dialect

Beijing dialect

+ 37 more

ⓘ

Highlighted pills have confirmed speech (audio) output support. Plain pills are input-only or unconfirmed for voice output. Verify your specific target language pair in the Alibaba Cloud Model Studio documentation before building audio-output pipelines.

⚠

The model supports text output for all 60 input languages. Speech output is available for 29 languages only. If your pipeline requires audio delivery and your target language is not in the confirmed list, plan for a fallback TTS step.

Key Takeaways

Qwen3.5-LiveTranslate-Flash delivers real-time multimodal interpretation across 60 input languages and 29 speech output languages at 2.8 seconds of latency.
The model uses vision-enhanced comprehension — reading lip movements, gestures, and on-screen text — to maintain accuracy in noisy or degraded audio environments.
Real-time voice cloning replicates the original speaker’s voice profile in the translated output using just a single spoken sentence.
Semantic unit prediction via “reading units” processing enables continuous streaming output without waiting for full sentences, reducing latency to 2.8 seconds.
Dynamic keyword configuration allows developers to inject domain-specific glossaries at runtime, improving translation reliability for technical, medical, and legal terminology.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link