Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That’s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class of system they call interaction models to address it. The main idea for their research is interactivity should be native to the model itself, not bolted on as an afterthought.
What’s Wrong with Turn-Based AI
If you’ve built anything with a language model or voice API, you’ve worked around the limitations of turn-based interaction. The model has no awareness of what’s happening while you’re still typing or speaking. It can’t see you pause mid-sentence, notice your camera feed, or react to something visual in real time. While the model is generating, it’s equally blind — perception freezes until it finishes or gets interrupted.
This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge, intent, and judgment can reach the model, and how much of the model’s work can be understood.
To work around this, most real-time AI systems use a harness — a collection of separate components stitched together to simulate responsiveness. A common example is voice-activity detection (VAD), which predicts when a user has finished speaking so a turn-based model knows when to start generating. This harness is made out of components that are meaningfully less intelligent than the model itself, and it precludes capabilities like proactive visual reactions, speaking while listening, or responding to cues that are never explicitly stated aloud.
Thinking Machines Lab’s argument is a version of the ‘bitter lesson’ in machine learning: hand-crafted systems will eventually be outpaced by scaling general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator.

The Architecture: Multi-Stream, Micro-Turn Design
The system has two components working in parallel: an interaction model that maintains constant real-time exchange with the user, and a background model that handles deeper reasoning tasks asynchronously.
The interaction model is always on — continuously taking in audio, video, and text and producing responses in real time. When a task requires sustained reasoning (tool use, web search, longer-horizon planning), it delegates to the background model by sending a rich context package containing the full conversation — not a standalone query. Results stream back as the background model produces them, and the interaction model interleaves those updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch. Both models share their context throughout.
Think of it like one person who keeps you engaged in conversation while a colleague in the background looks something up and passes notes forward in real time.
The key architectural decision enabling this is time-aligned micro-turns. The interaction model continuously interleaves the processing of 200ms worth of input with the generation of 200ms worth of output. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as streams processed in 200ms chunks. This is what allows the model to speak while listening, react to visual cues without being prompted verbally, handle true simultaneous speech, and make tool calls and browse the web while the conversation is still in progress — weaving results back in as they arrive.
Encoder-free early fusion is the specific design choice that makes multimodal processing work at this cadence. Rather than routing audio and video through large, separate pretrained encoders (like a Whisper-style ASR model or a standalone TTS decoder), the architecture uses minimal pre-processing. Audio signals are ingested as dMel and transformed via a lightweight embedding layer. Video frames are split into 40×40 patches encoded by an hMLP. Audio output uses a flow head for decoding. All components are co-trained from scratch together with the transformer — there is no separately pretrained encoder or decoder at any stage.
On the inference side, the 200ms chunk design creates engineering challenges. Existing LLM inference libraries aren’t optimized for frequent small prefills — they carry significant per-turn overhead. Thinking Machines implemented streaming sessions, where the client sends each 200ms chunk as a separate request while the inference server appends chunks into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations. They’ve upstreamed a version of this to SGLang, the open-source inference framework. Additionally, they use a gather+gemv strategy for MoE kernels instead of standard grouped gemm, following prior work from PyTorch and Cursor, to optimize for the latency-sensitive shapes required by bidirectional serving.


Benchmarks: Where It Stands
The model, named TML-Interaction-Small, is a 276B parameter Mixture-of-Experts (MoE) with 12B active parameters.
The benchmark table distinguishes between Instant models (no extended reasoning) and Thinking models (with reasoning). TML-Interaction-Small is an Instant model. Among all Instant models in the comparison, it achieves the highest score on Audio MultiChallenge APR at 43.4% — above GPT-realtime-2.0 (minimal) at 37.6%, GPT-realtime-1.5 at 34.7%, and Gemini-3.1-flash-live-preview (minimal) at 26.8%. The Thinking models, GPT-realtime-2.0 (xhigh) at 48.5% and Gemini-3.1-flash-live (high) at 36.1%, use extended reasoning to achieve their scores.
On FD-bench v1.5, which measures interaction quality across user interruption, backchanneling, talking-to-others, and background speech scenarios, TML-Interaction-Small scores 77.8 average quality — compared to 54.3 for Gemini-3.1-flash-live (minimal), 48.3 for GPT-realtime-1.5, and 47.8 for GPT-realtime-2.0 (xhigh).
On FD-bench v1 turn-taking latency, the model responds in 0.40 seconds — compared to 0.57s for Gemini, 0.59s for GPT-realtime-1.5, and 1.18s for GPT-realtime-2.0 (minimal).
On FD-bench v3, which evaluates response quality and tool use (audio + tools combined), TML-Interaction-Small (with background agent enabled) scores 82.8% Response Quality / 68.0% Pass@1 — the highest in the comparison table.


Thinking Machines research team also introduced new internal benchmarks targeting capabilities that no existing model handles:
- TimeSpeak — Tests whether the model initiates speech at user-specified times with correct content. TML: 64.7 macro-accuracy vs. 4.3 for GPT-realtime-2.0 (minimal).
- CueSpeak — Tests whether the model responds to verbal cues at the correct moment. TML: 81.7 vs. 2.9.
- RepCount-A (adapted from an existing repetition-counting dataset) — Tests visual counting of repeated physical actions in a streaming setting. TML: 35.4 off-by-one accuracy vs. 1.3.
- ProactiveVideoQA (adapted benchmark) — Tests whether the model answers a question at the exact moment the answer becomes visually available in a streamed video. TML: 33.5 PAUC@ω=0.5 vs. 25.0 (the no-response baseline).
- Charades (adapted for temporal action localization) — The model is asked to say “start” and “stop” as an action begins and ends in a streamed video. TML: 32.4 mIoU vs. 0 for GPT-realtime-2.0 (minimal) — a clean zero.
So far, no existing model can meaningfully perform any of these tasks.
Marktechpost’s Visual Explainer
Key Takeaways
- Thinking Machines Lab’s interaction model handles real-time audio, video, and text natively — no VAD harness, no turn boundaries, no stitched components.
- The architecture splits into two models: an interaction model that stays live with the user, and a background model that handles reasoning and tool use asynchronously — sharing full conversation context throughout.
- 200ms micro-turns replace the standard request-response loop, enabling simultaneous speech, visual proactivity, and live tool calls without waiting for a user turn to end.
- On FD-bench v1.5 (interaction quality), TML-Interaction-Small scores 77.8 — versus 54.3 for Gemini and 47.8 for GPT-realtime-2.0 (xhigh) — while also leading all instant models on Audio MultiChallenge intelligence benchmarks.
- Existing real-time APIs score near zero on time-awareness and visual proactivity benchmarks (TimeSpeak, CueSpeak, Charades, RepCount-A) — TML-Interaction-Small is the only model that can meaningfully perform these tasks today.
Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Credit: Source link
























