Mira Murati's Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That’s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class of system they call interaction models to address it. The main idea for their research is interactivity should be native to the model itself, not bolted on as an afterthought.

What’s Wrong with Turn-Based AI

If you’ve built anything with a language model or voice API, you’ve worked around the limitations of turn-based interaction. The model has no awareness of what’s happening while you’re still typing or speaking. It can’t see you pause mid-sentence, notice your camera feed, or react to something visual in real time. While the model is generating, it’s equally blind — perception freezes until it finishes or gets interrupted.

HPE CEO Neri on Blowout AI Revenue Forecast, Pricing and Strategy

What’s Behind the Blue Origin Rocket Explosion?

This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge, intent, and judgment can reach the model, and how much of the model’s work can be understood.

To work around this, most real-time AI systems use a harness — a collection of separate components stitched together to simulate responsiveness. A common example is voice-activity detection (VAD), which predicts when a user has finished speaking so a turn-based model knows when to start generating. This harness is made out of components that are meaningfully less intelligent than the model itself, and it precludes capabilities like proactive visual reactions, speaking while listening, or responding to cues that are never explicitly stated aloud.

Thinking Machines Lab’s argument is a version of the ‘bitter lesson’ in machine learning: hand-crafted systems will eventually be outpaced by scaling general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator.

https://thinkingmachines.ai/blog/interaction-models/

The Architecture: Multi-Stream, Micro-Turn Design

The system has two components working in parallel: an interaction model that maintains constant real-time exchange with the user, and a background model that handles deeper reasoning tasks asynchronously.

The interaction model is always on — continuously taking in audio, video, and text and producing responses in real time. When a task requires sustained reasoning (tool use, web search, longer-horizon planning), it delegates to the background model by sending a rich context package containing the full conversation — not a standalone query. Results stream back as the background model produces them, and the interaction model interleaves those updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch. Both models share their context throughout.

Think of it like one person who keeps you engaged in conversation while a colleague in the background looks something up and passes notes forward in real time.

The key architectural decision enabling this is time-aligned micro-turns. The interaction model continuously interleaves the processing of 200ms worth of input with the generation of 200ms worth of output. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as streams processed in 200ms chunks. This is what allows the model to speak while listening, react to visual cues without being prompted verbally, handle true simultaneous speech, and make tool calls and browse the web while the conversation is still in progress — weaving results back in as they arrive.

Encoder-free early fusion is the specific design choice that makes multimodal processing work at this cadence. Rather than routing audio and video through large, separate pretrained encoders (like a Whisper-style ASR model or a standalone TTS decoder), the architecture uses minimal pre-processing. Audio signals are ingested as dMel and transformed via a lightweight embedding layer. Video frames are split into 40×40 patches encoded by an hMLP. Audio output uses a flow head for decoding. All components are co-trained from scratch together with the transformer — there is no separately pretrained encoder or decoder at any stage.

On the inference side, the 200ms chunk design creates engineering challenges. Existing LLM inference libraries aren’t optimized for frequent small prefills — they carry significant per-turn overhead. Thinking Machines implemented streaming sessions, where the client sends each 200ms chunk as a separate request while the inference server appends chunks into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations. They’ve upstreamed a version of this to SGLang, the open-source inference framework. Additionally, they use a gather+gemv strategy for MoE kernels instead of standard grouped gemm, following prior work from PyTorch and Cursor, to optimize for the latency-sensitive shapes required by bidirectional serving.

Benchmarks: Where It Stands

The model, named TML-Interaction-Small, is a 276B parameter Mixture-of-Experts (MoE) with 12B active parameters.

The benchmark table distinguishes between Instant models (no extended reasoning) and Thinking models (with reasoning). TML-Interaction-Small is an Instant model. Among all Instant models in the comparison, it achieves the highest score on Audio MultiChallenge APR at 43.4% — above GPT-realtime-2.0 (minimal) at 37.6%, GPT-realtime-1.5 at 34.7%, and Gemini-3.1-flash-live-preview (minimal) at 26.8%. The Thinking models, GPT-realtime-2.0 (xhigh) at 48.5% and Gemini-3.1-flash-live (high) at 36.1%, use extended reasoning to achieve their scores.

On FD-bench v1.5, which measures interaction quality across user interruption, backchanneling, talking-to-others, and background speech scenarios, TML-Interaction-Small scores 77.8 average quality — compared to 54.3 for Gemini-3.1-flash-live (minimal), 48.3 for GPT-realtime-1.5, and 47.8 for GPT-realtime-2.0 (xhigh).

On FD-bench v1 turn-taking latency, the model responds in 0.40 seconds — compared to 0.57s for Gemini, 0.59s for GPT-realtime-1.5, and 1.18s for GPT-realtime-2.0 (minimal).

On FD-bench v3, which evaluates response quality and tool use (audio + tools combined), TML-Interaction-Small (with background agent enabled) scores 82.8% Response Quality / 68.0% Pass@1 — the highest in the comparison table.

Thinking Machines research team also introduced new internal benchmarks targeting capabilities that no existing model handles:

TimeSpeak — Tests whether the model initiates speech at user-specified times with correct content. TML: 64.7 macro-accuracy vs. 4.3 for GPT-realtime-2.0 (minimal).
CueSpeak — Tests whether the model responds to verbal cues at the correct moment. TML: 81.7 vs. 2.9.
RepCount-A (adapted from an existing repetition-counting dataset) — Tests visual counting of repeated physical actions in a streaming setting. TML: 35.4 off-by-one accuracy vs. 1.3.
ProactiveVideoQA (adapted benchmark) — Tests whether the model answers a question at the exact moment the answer becomes visually available in a streamed video. TML: 33.5 PAUC@ω=0.5 vs. 25.0 (the no-response baseline).
Charades (adapted for temporal action localization) — The model is asked to say “start” and “stop” as an action begins and ends in a streamed video. TML: 32.4 mIoU vs. 0 for GPT-realtime-2.0 (minimal) — a clean zero.

So far, no existing model can meaningfully perform any of these tasks.

Marktechpost’s Visual Explainer

Interaction Models — Getting Started Guide
01 / 07

01 — Overview

What Are Interaction Models?

Research Preview — May 2026

Thinking Machines Lab introduced interaction models — a new class of AI system where real-time interactivity is native to the model itself, not bolted on through external scaffolding.

Unlike standard LLM APIs that work in a request—response loop, interaction models continuously perceive and respond across audio, video, and text at the same time — the way a live human conversation works.

Standard LLM APIs

Turn-based. Model waits for your full input, then generates a full response. Perception freezes during generation.

Interaction Models

Continuous. The model perceives and responds in parallel in 200ms chunks — across audio, video, and text simultaneously.

02 — Architecture

How the Two-Model System Works

The system is built around two components that run in parallel and share the same context at all times.

Interaction Model

Always live. Receives audio, video, and text in continuous 200ms chunks. Handles conversation flow, interruptions, backchanneling, and immediate responses in real time.

Background Model

Runs asynchronously. Handles deep reasoning, tool calls, web search, and longer-horizon work. Receives the full conversation — not just a standalone query — and streams results back as they arrive.

The interaction model stays present during background tasks — taking new input, answering follow-ups, and weaving results into the conversation at the right moment, not as an abrupt context switch.

03 — Capabilities

What You Can Actually Do

Because interactivity is native to the model, these are built-in behaviors — not harness features:

Simultaneous speech — Speak and listen at the same time (e.g. live translation from Spanish to English as you talk)
Verbal interjections — Model jumps in mid-sentence based on context, not just when you stop talking
Visual proactivity — Model reacts to what it sees on camera without you saying anything (e.g. counting pushups, flagging a code bug it sees)
Time-awareness — Model tracks elapsed time and can initiate speech at user-specified moments
Concurrent tool use — Searches the web, calls tools, and generates UI while the conversation is still in progress
Seamless dialog management — Tracks pauses, self-corrections, and yield signals without a separate VAD component

04 — Technical Design

The Micro-Turn Architecture

For engineers curious about how this works under the hood, three design choices make real-time multimodal processing possible:

200ms micro-turns
——————————————
Input stream : [chunk 0][chunk 1][chunk 2][chunk 3]…
Output stream : [chunk 0][chunk 1][chunk 2][chunk 3]…
Interleaved : in_0 out_0 in_1 out_1 in_2 out_2…

Audio input : dMel + lightweight embedding layer
Video input : 40×40 patches via hMLP
Audio output : flow head decoder
All components co-trained from scratch with transformer

Rather than routing audio and video through large pretrained encoders (like Whisper), inputs are processed via minimal embeddings and co-trained from scratch — called encoder-free early fusion.

On the inference side, streaming sessions append each 200ms chunk into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations per request. A version of this has been upstreamed to SGLang.

05 — Benchmarks

How TML-Interaction-Small Performs

The model is a 276B parameter MoE with 12B active parameters. Key results against other instant (non-thinking) real-time models:

77.8
FD-bench v1.5
Interaction Quality

0.40s
FD-bench v1
Turn Latency

43.4
Audio MultiChallenge
APR (best instant)

82.8%
FD-bench v3
Response Quality

On proactive/time-aware benchmarks where no existing model meaningfully performs: TimeSpeak 64.7, CueSpeak 81.7, RepCount-A 35.4, Charades mIoU 32.4 — vs. near-zero for all other tested models including GPT-realtime-2.0.

06 — Getting Access

How to Join the Preview

As of May 2026, Thinking Machines Lab is opening a limited research preview to collect feedback. A wider release is planned later in 2026.

Apply for early access — Contact the team via thinkingmachines.ai (email link on the blog post)
Research grant program — A research grant is available for work on interaction model benchmarks, evaluation frameworks, and human-AI collaboration research
Follow Thinking Machines Lab — Updates and wider release announcements at thinkingmachines.ai
Contribute benchmarks — The lab explicitly invites the community to develop new frameworks for measuring interactivity quality — an area they consider underserved

Note

This is a research preview, not a production API. Access is gated and limited during this phase.

07 — Limitations

What to Know Before You Build

Thinking Machines Lab is transparent about where the current system falls short:

Long Sessions

Continuous audio and video accumulate context fast. Very long sessions still require careful context management — an active area of work.

Network Dependency

Streaming at 200ms chunks requires reliable connectivity. Poor connections significantly degrade the experience.

Model Size

Larger pretrained models exist but are currently too slow to serve in real-time. Larger variants are planned for later in 2026.

Safety & Alignment

Real-time interaction opens new alignment research questions. Feedback collection is active. Harmbench refusal rate: 99.0%.

Source: Thinking Machines Lab, “Interaction Models: A Scalable Approach to Human-AI Collaboration,” May 2026 — thinkingmachines.ai/blog/interaction-models

Key Takeaways

Thinking Machines Lab’s interaction model handles real-time audio, video, and text natively — no VAD harness, no turn boundaries, no stitched components.
The architecture splits into two models: an interaction model that stays live with the user, and a background model that handles reasoning and tool use asynchronously — sharing full conversation context throughout.
200ms micro-turns replace the standard request-response loop, enabling simultaneous speech, visual proactivity, and live tool calls without waiting for a user turn to end.
On FD-bench v1.5 (interaction quality), TML-Interaction-Small scores 77.8 — versus 54.3 for Gemini and 47.8 for GPT-realtime-2.0 (xhigh) — while also leading all instant models on Audio MultiChallenge intelligence benchmarks.
Existing real-time APIs score near zero on time-awareness and visual proactivity benchmarks (TimeSpeak, CueSpeak, Charades, RepCount-A) — TML-Interaction-Small is the only model that can meaningfully perform these tasks today.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link