TML Interaction Models: Murati's Real-Time AI Play
Thinking Machines Lab launched TML-Interaction-Small, a 276B MoE model with 0.40s latency and full-duplex conversation. Here's what it changes.
Mira Murati's first model isn't a chatbot โ it's a conversation engine
Thinking Machines Lab (TML), the startup founded by former OpenAI CTO Mira Murati, launched TML-Interaction-Small on May 11 โ and it's not competing where you'd expect. Instead of chasing benchmark supremacy on coding or reasoning, Murati's team built a 276-billion parameter mixture-of-experts model designed from the ground up for real-time multimodal interaction. The headline number: 0.40-second latency with native full-duplex support, meaning the model can listen and respond simultaneously (per TML's official blog post).
The launch thread on X pulled 6.8 million views in its first days. That's not just curiosity about Murati's post-OpenAI moves โ it's a signal that the real-time AI interaction space is heating up fast, with OpenAI's GPT-Realtime-2 and Google's Gemini Live both pushing hard on the same problem.
What "interaction model" actually means
Most AI voice and multimodal products today are bolted-on systems: a speech-to-text layer feeds a language model, which feeds a text-to-speech layer. Each handoff adds latency, drops conversational nuance, and creates the awkward turn-taking that makes talking to AI feel like a walkie-talkie call. You speak, you wait, it responds. Interrupting it either breaks the flow or gets ignored.
TML's pitch, per their blog post, is that TML-Interaction-Small was trained natively for interaction โ not as a text model with voice bolted on afterward. The key architectural claims:
- 276B mixture-of-experts: Large total parameter count but only a fraction active per forward pass, which is how they hit the latency target
- 0.40s end-to-end latency: From user input to model output beginning, per TML's published specs
- 200ms micro-turns: The model can produce brief acknowledgments, backchannels, and partial responses while still processing โ mimicking how humans say "mm-hm" or "right" during conversation
- Full-duplex interaction: The model processes incoming audio while simultaneously generating output, rather than waiting for a "your turn" signal
The micro-turns detail is the one that caught my attention. Current voice AI products โ including GPT-Realtime-2, which OpenAI launched the same week โ operate in half-duplex or near-half-duplex modes. You talk, then the model talks. TML is claiming something closer to how actual human conversation works: overlapping, interruptible, with continuous feedback signals.
How it compares to GPT-Realtime-2 and Gemini Live
TML launched TML-Interaction-Small into a week that also saw OpenAI release GPT-Realtime-2 with GPT-5-class reasoning for voice agents. That's either terrible timing or deliberate positioning. My read: it's deliberate. TML is making the argument that reasoning quality isn't the bottleneck for real-time AI โ interaction quality is.
Here's what we know from published specs and announcements:
| Feature | TML-Interaction-Small | GPT-Realtime-2 | Gemini Live |
|---|---|---|---|
| Architecture | 276B MoE (native interaction) | GPT-5 backbone + voice layer | Gemini 2.5 + streaming |
| Reported latency | 0.40s (per TML blog) | ~0.5-0.8s (per developer reports) | ~0.5-1.0s (varies by region) |
| Full-duplex | Yes (native) | Partial (interruption support) | Partial (interruption support) |
| Micro-turns | 200ms backchannels | Not documented | Not documented |
| Reasoning depth | Not emphasized | GPT-5-class (per OpenAI) | Gemini 2.5-class |
Important caveat: TML's latency and interaction claims come from their own blog post and demo. Independent benchmarks and developer reports will need to validate whether 0.40s holds under real-world conditions with diverse inputs, network variability, and concurrent load. We don't have those yet โ the model is days old.
The architectural bet: why MoE matters here
Using a mixture-of-experts architecture for a real-time interaction model is a smart engineering choice, and it's worth understanding why.
In a dense model (like Mistral Medium 3.5's 128B), every parameter fires on every input. That's great for maximizing quality per token but terrible for latency โ you're doing the maximum possible compute on every forward pass. MoE models route each input to a subset of "expert" subnetworks, typically activating 20-40% of total parameters. The result: you get the knowledge capacity of a 276B model with the inference cost of something much smaller.
For a model that needs to produce 200ms micro-turn responses while simultaneously processing incoming audio, this tradeoff makes total sense. You need fast inference more than you need maximum reasoning depth. A dense 276B model doing full-duplex interaction would require either massive hardware or unacceptable latency. MoE lets TML have the parameter count (and the knowledge it encodes) without paying the full compute cost on every tick.
This is the opposite design philosophy from what OpenAI is doing with GPT-Realtime-2, where the priority is bringing GPT-5-level reasoning into the voice pipeline. OpenAI is betting that users want smarter voice AI. TML is betting they want more natural voice AI. Both could be right โ they might be building for different use cases.
Why Murati is building this, not a chatbot
When Murati left OpenAI in September 2024, the speculation was that she'd build a competitor to ChatGPT. Instead, TML-Interaction-Small suggests a more specific thesis: the next unlock in AI isn't better answers, it's better conversations.
This tracks with what TechCrunch reported on the launch โ TML is positioning interaction models as a new category, distinct from the reasoning-focused models that dominate benchmarks today. The framing isn't "our model is smarter than GPT-5" but "our model actually listens while it talks."
My read on why this matters strategically: the major labs โ OpenAI, Anthropic, Google โ have optimized relentlessly for benchmark performance on reasoning, coding, and knowledge tasks. That's created a crowded field where differentiation is hard. By defining a new axis of competition (interaction quality), Murati is trying to avoid fighting on territory where incumbents have years of infrastructure advantage.
It's a high-risk play. If users decide they'd rather have a slightly laggy GPT-5-class voice assistant than a fluid-but-less-brilliant interaction model, TML's differentiation evaporates. But if natural-feeling conversation turns out to be what unlocks mainstream adoption of voice AI โ which currently has mediocre retention rates despite heavy investment โ then TML is early to the right problem.
What we don't know yet
The launch is three days old. Several critical questions remain unanswered:
- API availability and pricing: TML hasn't published API pricing or general availability details. For developers building voice products, this is the table-stakes information
- Reasoning quality: TML deliberately isn't emphasizing benchmark scores on standard LLM evaluations. That could mean the model trades reasoning depth for interaction speed โ a valid tradeoff, but one developers need to understand before building on it
- Scaling trajectory: "Small" in the model name implies larger variants are coming. TML hasn't confirmed what's next
- Independent validation: The 0.40s latency and full-duplex claims need third-party confirmation. Demo conditions and real-world deployment conditions are different things
- Multilingual support: No details published on language coverage. For a model pitching itself as the future of human-AI conversation, monolingual English would be a significant limitation
The real-time AI race is now three-way
A month ago, the real-time AI interaction space was a two-horse race between OpenAI (Realtime API) and Google (Gemini Live). TML just made it three, and from an unexpected angle.
OpenAI's approach: take the smartest model available and pipe it through a voice layer. Fast enough, smart as possible. Google's approach: similar architecture, with the advantage of Android integration and global infrastructure. TML's approach: purpose-build the model for interaction from day one, sacrificing some reasoning ceiling for genuine conversational fluidity.
For developers building voice-first products โ customer service agents, accessibility tools, tutoring systems, companion apps โ this is genuinely good news. Competition on interaction quality means the "talking to a slightly slow chatbot" era should end faster. The question is whether TML can ship reliable API access and developer tooling fast enough to matter before OpenAI and Google incorporate similar interaction-native features into their next model generations.
The honest take: TML-Interaction-Small is the most interesting model launch of May 2026 so far โ not because it's the most capable model, but because it's the first one built around a different definition of what "capable" means for real-time AI. Whether that bet pays off depends on details we won't have for months: real-world latency, developer adoption, and whether the "Small" model has enough reasoning depth to be useful beyond demos.
What to watch next
Three things will determine whether TML's interaction models become a real force or an interesting footnote:
- API launch and pricing. Developers can't build on demos. TML needs public API access with competitive pricing โ and soon, before GPT-Realtime-2 locks in the early-mover developer base
- Independent latency benchmarks. When third-party developers report real-world latency under load, we'll know if 0.40s holds or was a best-case demo number
- The "Medium" and "Large" variants. If TML can scale up reasoning quality while maintaining the interaction architecture, the competitive picture changes dramatically. If "Small" is all they can ship at these latency targets, the reasoning gap becomes harder to paper over
The real-time AI interaction space just got a lot more interesting. Murati didn't build a better chatbot โ she's arguing the chatbot paradigm itself is the problem. That's either visionary or premature, and we'll find out which within the next quarter.
Keep reading
Isomorphic Labs Raises $2.1B for AI Drug Design
Isomorphic Labs just raised $2.1B in Series B funding led by Thrive Capital. Here's what IsoDDE means for AI drug discovery.
OpenAI's $4B Deployment Co and Tomoro Deal Explained
OpenAI launched a $4B enterprise deployment unit backed by TPG, Bain, and Brookfield, and acquired Tomoro. Here's the full strategy breakdown.
OpenAI's MRC: Networking for 100K+ GPU Clusters
OpenAI's Multipath Reliable Connection protocol rewrites GPU cluster networking with packet spraying and microsecond failover for 100K+ GPU training runs.