🟢 News

Nemotron 3 Ultra: NVIDIA's Open Agent Model Explained

NVIDIA's Nemotron 3 Ultra packs 550B params into 55B active with a Mamba-Transformer MoE design built for long-running agents.

The AI Dude · June 6, 2026 · 8 min read

550 Billion Parameters, 55 Billion Active

NVIDIA dropped Nemotron 3 Ultra on June 4, 2026, and it's one of the more architecturally interesting open models we've seen this year. The headline numbers: 550 billion total parameters, but only 55 billion active at inference time thanks to a Mixture-of-Experts (MoE) design. That 10:1 ratio is aggressive — it means you get the knowledge capacity of a massive model with the compute footprint of something much smaller.

What makes Nemotron 3 Ultra genuinely different from most open models isn't just the parameter count. It's the hybrid Mamba-Transformer architecture underneath. NVIDIA isn't just slapping "open weights" on a standard Transformer and calling it innovation. They're shipping a production model that blends state-space models (Mamba) with traditional attention layers, specifically engineered for the workloads that matter most right now: long-running agentic tasks with heavy tool use.

The model is already live on Perplexity Pro and Max, with API access and open weights available for developers who want to self-host (per NVIDIA's announcement and Perplexity's own posts on X).

Why the Mamba-Transformer Hybrid Matters

Standard Transformers have a well-known problem: attention scales quadratically with sequence length. For a chatbot answering one-off questions, that's manageable. For an agent that's running a 50-step workflow — reading documents, calling APIs, iterating on code, checking results — it becomes a real bottleneck. Every new step adds to the context, and the cost of processing that context grows faster than linearly.

Mamba (and state-space models generally) handle long sequences differently. They process tokens in linear time relative to sequence length, which makes them dramatically more efficient for extended contexts. The tradeoff has historically been that pure SSMs don't match Transformers on tasks requiring precise attention to specific details within the context.

NVIDIA's approach with Nemotron 3 Ultra is to use both: Mamba layers for efficient long-range processing, Transformer attention layers where precise retrieval and reasoning matter. The result, per NVIDIA's published technical details, is a model that maintains strong performance on reasoning benchmarks while being substantially more efficient on the long, multi-turn interactions that agents require.

My read: This is the architecture direction the entire industry is heading. Pure Transformers are hitting efficiency walls for agentic use cases. The fact that NVIDIA is shipping a hybrid at this scale — and open-sourcing it — signals they think this design pattern is ready for production, not just research.

Built for Agents, Not Chat

The positioning here is deliberate. NVIDIA isn't marketing Nemotron 3 Ultra as a ChatGPT competitor or a general-purpose chatbot. The emphasis, across their developer blog and research page, is on long-running agents with tool use.

What that means in practice:

Tool calling as a first-class capability: The model is trained and optimized for structured tool use — function calling, API interactions, and multi-step orchestration where the model decides which tools to invoke and when.
Extended context efficiency: The Mamba-Transformer hybrid keeps inference costs manageable even as agent conversations stretch into thousands of turns or accumulate large working contexts from retrieved documents and tool outputs.
Reasoning under constraints: Agentic tasks often require the model to plan, backtrack, and reason about intermediate results. NVIDIA specifically highlights reasoning performance as a design target, not an afterthought.

This positioning makes strategic sense. The chatbot market is crowded — OpenAI, Anthropic, Google, and xAI all have strong offerings. But the agent infrastructure layer is still wide open, and the models purpose-built for it are few. Most developers building agents today are using general-purpose chat models and working around their limitations. A model designed from the ground up for agentic workflows is a different value proposition.

The MoE Efficiency Angle

The 550B total / 55B active split deserves more attention than it usually gets in these announcements. MoE models route each token to a subset of "expert" sub-networks, so only a fraction of the total parameters are used for any given computation. The benefits are real:

Lower inference cost per token: You're running 55B active parameters, not 550B. For self-hosted deployments, this translates directly to fewer GPUs required.
Knowledge capacity of a much larger model: The full 550B parameter set stores more knowledge and capability than a 55B dense model could. The routing mechanism selects the most relevant experts for each input.
Better scaling economics for agents: Agents make many sequential calls. If each call costs 10x less than a dense 550B model, the total cost of a 50-step agent workflow becomes viable for production use.

For context, this efficiency profile matters enormously for the agent use case. A single user interaction with an agent might involve dozens of model calls — planning, tool selection, result interpretation, error correction. If each call is expensive, agents become cost-prohibitive at scale. The MoE design directly addresses this.

Availability and Access

NVIDIA is making Nemotron 3 Ultra available through multiple channels:

Perplexity Pro and Max: The model is already integrated as an option for Perplexity subscribers, making it immediately accessible without any infrastructure setup.
API access: Available through NVIDIA's API endpoints for developers building applications.
Open weights: The model weights are publicly available, meaning teams can self-host, fine-tune, and modify the model for their specific use cases.

The open-weights release is the most consequential part. Closed APIs are useful, but they don't let you fine-tune for domain-specific agent behaviors, run inference on your own hardware for data privacy, or modify the architecture. For enterprises building proprietary agent systems — particularly in regulated industries where data can't leave their infrastructure — open weights are a requirement, not a nice-to-have.

How This Compares to the Open Model Landscape

Nemotron 3 Ultra enters a competitive open-weight field. Mistral's Medium 3.5 (128B dense) made waves in May with strong SWE-Bench scores. Meta's Llama family continues to expand. DeepSeek has been pushing price-performance boundaries aggressively.

Model	Total Params	Active Params	Architecture	Primary Focus
Nemotron 3 Ultra	550B	55B	Mamba-Transformer MoE	Agents / Tool Use
Mistral Medium 3.5	128B	128B (dense)	Transformer	Coding / General
Llama 4 Maverick	400B	17B	Transformer MoE	General Purpose
DeepSeek V4	685B	37B	Transformer MoE	Reasoning / Code

What stands out is that Nemotron 3 Ultra has the highest active parameter count among the MoE models listed. More active parameters generally means more compute per token but also more capacity for complex reasoning within a single forward pass. NVIDIA is betting that for agentic workloads, you need that extra capacity — agents can't afford to be "almost right" on tool selection or reasoning steps.

What This Tells Us About NVIDIA's AI Strategy

NVIDIA selling GPUs to everyone else building AI models is the obvious business. What's less obvious is why they keep investing in building their own models — particularly open ones they give away for free.

The honest take: Nemotron 3 Ultra is a reference implementation. It proves what's possible on NVIDIA hardware, it drives adoption of NVIDIA's inference stack (TensorRT-LLM, Triton Inference Server), and it establishes NVIDIA as a player in the model layer, not just the chip layer. Every developer who downloads Nemotron 3 Ultra and runs it is likely running it on NVIDIA GPUs. The model is free; the hardware isn't.

There's also a standards-setting play. By releasing a production-quality Mamba-Transformer hybrid, NVIDIA is pushing the industry toward architectures that happen to run well on their hardware. If hybrid SSM-Transformer models become the norm for agents, NVIDIA's software stack — which they've already optimized for these workloads — becomes harder to replace.

Open Questions

A few things NVIDIA hasn't fully addressed yet that developers should watch for:

Fine-tuning requirements: How much compute does it take to fine-tune a 550B MoE hybrid? The Mamba layers add complexity to the training pipeline that standard Transformer fine-tuning recipes don't cover. Community reports on this will be important.
Quantization behavior: MoE models can be tricky to quantize effectively because expert routing is sensitive to weight precision. Whether Nemotron 3 Ultra holds up under INT4/INT8 quantization for local deployment remains to be seen from independent testing.
Real-world agent benchmarks: NVIDIA highlights reasoning and tool-use performance, but standardized agent benchmarks are still maturing. How Nemotron 3 Ultra performs on extended SWE-bench runs, multi-tool orchestration tasks, and real production agent frameworks will determine whether the "built for agents" claim holds up beyond the spec sheet.
Licensing specifics: The exact license terms for commercial use, modification, and redistribution matter for enterprise adoption. Developers should check the model card on NVIDIA's research page for the fine print.

The Bottom Line

Nemotron 3 Ultra is NVIDIA's bet that the next wave of AI isn't chatbots — it's agents. The Mamba-Transformer hybrid architecture is a genuine technical contribution, not just a marketing differentiator. The 550B/55B MoE split makes economic sense for the multi-call patterns agents require. And the open-weights release means developers can actually build on it without API lock-in.

Whether it overtakes the current crop of closed frontier models on raw benchmarks matters less than whether it's the best open option for teams building agent systems today. With Perplexity already integrating it and open weights available for self-hosting, the answer to that question should become clear quickly. If you're building agents and you've been waiting for an open model that takes tool use and long-context efficiency seriously, Nemotron 3 Ultra is worth evaluating now rather than later.

NVIDIA Nemotron 3 UltraNemotron open modelMamba Transformer MoENVIDIA agent model 2026open-weight LLMagentic AI

Share 𝕏 / Twitter Reddit LinkedIn

← Back to blog

Keep reading

News

AI21 Labs Cuts 60% of Staff, Bets on Maestro

AI21 Labs slashes over 60% of staff, drops foundation models, and pivots to its Maestro agent optimization platform after Nebius acquisition talks collapse.

News

Alibaba Bans Claude Code Over Security Concerns

Alibaba told staff to remove Anthropic's Claude Code by July 10 over security concerns. Here's what triggered the ban and what it signals.

News

Anthropic Acquires Stainless: What It Means for AI

Anthropic bought Stainless, the SDK generator behind OpenAI and Cloudflare's client libraries. Here's the strategic play for AI agents.