🔌 News Beginner

OpenAI's MRC: Networking for 100K+ GPU Clusters

OpenAI's Multipath Reliable Connection protocol rewrites GPU cluster networking with packet spraying and microsecond failover for 100K+ GPU training runs.

The AI Dude · May 12, 2026 · 8 min read

The Boring Infrastructure That Changes Everything

On May 5, 2026, OpenAI announced something that won't trend on social media the way a new model drop does: a networking protocol. Specifically, Multipath Reliable Connection (MRC), an open transport protocol designed to keep 100,000+ GPU clusters running without choking on their own network traffic. OpenAI developed it alongside AMD, Broadcom, Intel, Microsoft, and NVIDIA, and they're releasing it through the Open Compute Project (OCP) so anyone can use it.

This matters more than most model announcements. Here's why: every frontier lab is racing to build clusters with hundreds of thousands of GPUs. The models are scaling. The hardware is scaling. But the networking between those GPUs has been the quiet bottleneck threatening to turn these billion-dollar clusters into very expensive paperweights.

What Problem MRC Actually Solves

Training a frontier model across tens of thousands of GPUs means those GPUs need to constantly exchange data — gradient updates, activations, parameter syncs. Traditional networking approaches route traffic along fixed paths. When you have a few hundred GPUs, that works fine. At 100,000+ GPUs, it becomes a nightmare for two reasons:

Path congestion: Fixed routing concentrates traffic on a small number of network links. Some links get hammered while others sit idle. This creates bottlenecks that slow the entire training run.
Failure fragility: When a link or switch fails — and at this scale, something is always failing — traditional protocols take seconds or even minutes to reconverge. During that time, thousands of GPUs sit idle waiting for data that isn't arriving. At the cost of running these clusters, every second of downtime is money burning.

The standard industry answer has been to overbuild the network: add more switches, more redundant links, more capacity headroom. That works, but it's expensive and wasteful. You're essentially paying for a network sized for worst-case failures rather than typical operation.

How MRC Works: Packet Spraying Across Hundreds of Paths

MRC's core innovation, per OpenAI's announcement, is multipath packet spraying. Instead of routing a flow along a single path, MRC sprays individual packets across hundreds of available network paths simultaneously. Think of it as the difference between sending all your mail through one post office versus distributing it across every post office in the city.

This approach delivers two key benefits:

Near-perfect load balancing. Because packets are spread across all available paths, no single link becomes a bottleneck. The network operates much closer to its theoretical maximum throughput. Traditional ECMP (Equal-Cost Multi-Path) routing balances at the flow level, meaning entire flows get pinned to one path. MRC balances at the packet level, which is far more granular.

Microsecond failure recovery. When a link fails, packets that were using that link simply get redistributed across the remaining paths. There's no reconvergence delay, no routing table recalculation, no multi-second outage. According to OpenAI's announcement, recovery happens at the microsecond scale — orders of magnitude faster than traditional failover mechanisms. For a 100,000-GPU training run that might cost millions of dollars, the difference between microsecond and multi-second recovery is enormous.

The key insight behind MRC is treating the network as a pool of paths rather than a set of fixed routes. When everything is a path, losing one path barely matters.

Enabling Two-Tier Networks at Scale

There's a subtler but arguably more important implication of MRC: it enables two-tier network topologies for clusters that would traditionally require three tiers.

In conventional datacenter networking, as you scale beyond a few thousand nodes, you need a three-tier spine-leaf-superspine architecture. Each tier adds cost, latency, and complexity. The superspine layer alone can represent a massive chunk of the total network investment.

Because MRC can effectively utilize all available paths and recover from failures instantly, it reduces the need for that extra tier of redundancy. OpenAI's announcement indicates MRC enables two-tier architectures for 100,000+ GPU clusters. Eliminating an entire network tier at this scale isn't an incremental saving — it's a fundamental reduction in infrastructure cost and complexity.

My read: this is the detail that should get CFOs at AI labs excited. The GPUs are the headline cost of a training cluster, but networking is typically 15-25% of total cluster cost. Dropping from three tiers to two could meaningfully change the economics of building at this scale.

The Partner List Tells a Story

OpenAI didn't build MRC alone. The partnership roster — AMD, Broadcom, Intel, Microsoft, and NVIDIA — reads like a who's who of the AI infrastructure stack:

Partner	Role in the Stack
NVIDIA	Dominant GPU supplier, networking via Mellanox/ConnectX NICs and Spectrum switches
AMD	GPU competitor (MI300X/MI400), networking via Pensando DPUs
Broadcom	Largest merchant switch silicon vendor (Memory/Memory Ultra ASICs)
Intel	CPUs, Gaudi accelerators, Ethernet networking silicon
Microsoft	Azure cloud infrastructure, OpenAI's primary compute partner

The fact that NVIDIA and AMD are both on this list is notable. These companies compete fiercely on GPUs, but they're collaborating on the networking layer. That signals MRC is aiming to be a vendor-neutral standard, not a proprietary lock-in play. Releasing through OCP reinforces this — OCP specs are open and freely implementable.

I think this matters strategically. If MRC becomes the standard transport for AI clusters, it reduces one of NVIDIA's competitive moats (their proprietary InfiniBand networking stack). A world where GPU clusters use open, commodity Ethernet networking with MRC on top is a world where it's easier to mix GPU vendors — which is exactly what labs want as they diversify supply chains.

Already Running in Production

This isn't vaporware. OpenAI's announcement states that MRC is already powering frontier model training. Given the timing — GPT-5 and its successors have been training on massive clusters throughout 2025-2026 — it's reasonable to infer that MRC has been battle-tested on real workloads at serious scale.

That's a significant credibility marker. Networking protocols that work in lab conditions and networking protocols that work when 100,000 GPUs are hammering the fabric during a multi-week training run are very different things. OpenAI claiming production deployment suggests they've worked through the hard engineering problems: packet reordering at the receiver, congestion control across sprayed paths, and interoperability with existing RDMA stacks.

Where MRC Fits in the Compute Arms Race

Context matters here. We've covered the broader AI infrastructure buildout extensively — Anthropic's $1.8B Akamai deal for compute, IREN's $3.4B NVIDIA partnership for AI cloud capacity, and the broader GPU arms race. Every major lab is locked in a spending war to build the biggest clusters possible.

MRC addresses what happens after you buy all those GPUs. Having 200,000 H100s doesn't help if your network can't keep them all fed with data. The networking layer has quietly been one of the biggest engineering challenges in scaling AI training, and it's been largely addressed through brute-force overprovisioning rather than protocol innovation.

A few implications worth tracking:

Ethernet over InfiniBand: MRC is designed for Ethernet fabrics. If it delivers on its promises, it weakens the case for InfiniBand in AI clusters. InfiniBand's main advantage has been superior congestion handling and reliability — exactly what MRC aims to bring to Ethernet. Given that Ethernet switches are cheaper and more widely available than InfiniBand, this could shift buying patterns.
Cluster efficiency gains: Better network utilization means the same GPU cluster can train faster. For labs spending billions on compute, even a few percentage points of improved GPU utilization translates to significant savings.
Democratization potential: As an open OCP specification, MRC could benefit smaller labs and cloud providers who can't afford to build NVIDIA-proprietary InfiniBand networks. Whether this actually plays out depends on how quickly switch vendors and NIC makers implement MRC support.

What We Don't Know Yet

OpenAI's announcement is light on some details that matter:

Specific performance benchmarks. We know MRC enables microsecond failover and better link utilization, but OpenAI hasn't published detailed throughput comparisons against InfiniBand or conventional ECMP Ethernet at equivalent scale. Those numbers would tell us exactly how much improvement MRC delivers.
Implementation complexity. Does MRC require new NIC hardware, or can it be implemented in software/firmware on existing NICs? The answer dramatically affects adoption speed. A firmware update is months; new silicon is years.
Congestion control details. Packet spraying creates packet reordering at the receiver, which can confuse traditional TCP congestion control. MRC presumably has a solution, but the specifics haven't been publicly detailed.
OCP timeline. The spec is going through OCP, but OCP specifications can take months to finalize. When vendors will ship MRC-compatible products to the broader market remains unclear.

These gaps aren't criticisms — it's a first announcement, and OpenAI clearly intends to publish more through OCP. But anyone planning infrastructure purchases should wait for the technical details before making bets.

The Honest Take

Networking protocols don't generate hype. Nobody's posting "MRC is the new GPT" on X. But infrastructure innovation at this layer is what separates labs that can actually train frontier models from labs that just own a lot of GPUs they can't fully utilize.

OpenAI making this open through OCP is genuinely interesting. They could have kept MRC proprietary as a competitive advantage in training efficiency. By open-sourcing it, they're betting that a rising tide lifts all boats — or more cynically, that commoditizing the networking layer benefits them by increasing competition among their hardware suppliers.

Either way, MRC is worth watching. The AI compute buildout is a trillion-dollar infrastructure wave, and the protocols that make it actually work at scale are as important as the chips themselves. OpenAI just published theirs. The question is whether the rest of the industry adopts it or builds competing alternatives.

OpenAI MRCMultipath Reliable ConnectionAI networking protocolGPU clustersOpen Compute Project

← Back to blog

Keep reading

News

GPT-Realtime-2: GPT-5 Reasoning for Voice Agents

OpenAI launched GPT-Realtime-2 with GPT-5-class reasoning, real-time translation, and whisper — starting at $0.017/min for voice agents.

News

Mistral Medium 3.5: 128B Open Model Tops SWE-Bench

Mistral's 128B dense model scores 77.6% on SWE-Bench Verified and ships with open weights. Here's what matters for developers.

News

Anthropic's $1.8B Akamai Deal: Compute Strategy Explained

Anthropic signed a $1.8B, 7-year cloud deal with Akamai. Here's what it means for Claude's infrastructure and the AI compute race.