MLCommons released MLPerf Training 6.0 on June 16, adding two mixture-of-experts benchmarks that make the industry’s AI training race look less like a raw GPU contest and more like a test of whole infrastructure systems.
The new round matters because the workloads now look much closer to where frontier AI is heading. MLPerf Training 6.0 adds DeepSeek V3, a 671-billion-parameter MoE model with 37 billion parameters active per token, and GPT-OSS 20B, a smaller MoE benchmark with 21 billion total parameters and 3.6 billion active per token. Those tests stress token routing, memory movement, networking, low-precision training, software frameworks, and multi-node coordination, not only peak accelerator performance.
NVIDIA used the round to claim the strongest headline result. Its Blackwell platform led across all seven benchmark categories, including the new sparse-model workloads, and the company highlighted an 8,192-GPU Blackwell NVL72 submission on DeepSeek V3. Microsoft Azure also scaled Llama 3.1 405B training to 8,192 GPUs on GB200 NVL72 systems, while CoreWeave posted the fastest DeepSeek V3 671B training result at that same 8,192-GPU scale using GB300 NVL72 systems.
Why Sparse Models Change the Benchmark
Mixture-of-experts models do not use every parameter for every token. A routing layer sends each token to selected expert subnetworks, which can make huge models cheaper to run than dense models of similar total size. That efficiency comes with an infrastructure bill: tokens and expert activations have to move across accelerators quickly enough that routing does not erase the gains.
That is why MLPerf’s new tests are useful beyond vendor bragging rights. A system that performs well on dense LLM pretraining may not behave the same way when the workload demands all-to-all communication across many GPUs. NVIDIA’s explanation of its results points directly to that issue: GB200 and GB300 NVL72 racks connect 72 GPUs through NVLink switches so the rack can act more like one large pool of compute and memory. The company also points to NVFP4 training methods and Blackwell Ultra’s higher compute density, memory capacity, and power envelope as reasons GB300 NVL72 improved over GB200 NVL72 at the same scale.
For model builders, the practical question is not whether one chip is faster in isolation. It is whether the training run reaches its target reliably, uses expensive accelerators efficiently, and can be repeated when the model, dataset, or checkpointing strategy changes. That is where networking, failure recovery, kernel tuning, data pipelines, and cloud orchestration start to matter as much as accelerator specs.
Cloud Training Is Becoming Part of the Main Event
MLCommons reported 95 unique systems in the v6.0 round, using 13 hardware accelerator types and 19 host processors. Sixty percent of submitted systems were multi-node, and the number of cloud systems more than doubled compared with MLPerf Training 5.1 six months earlier. That shift is important for companies that will not build their own frontier-scale clusters but still need credible training capacity for fine-tuning, multimodal workloads, or domain-specific models.
Nebius, for example, published MLPerf Training 6.0 results on NVIDIA Blackwell Ultra systems and emphasized reproducible cloud performance. Its HGX B300 submissions led comparable single-node B300 results for Llama 3.1 8B and GPT-OSS 20B pretraining, while its GB300 NVL72 submissions landed within a few percentage points of the fastest results at 72-GPU scale across Llama 3.1 8B, FLUX.1, and GPT-OSS 20B.
That kind of cloud-provider submission gives buyers something more concrete than a spec sheet. If a cloud platform can show verified results on the same benchmark suite as chip vendors and on-premises server makers, customers get a better view of whether the provider’s virtualization, networking, storage, and orchestration layers are preserving the performance promised by the underlying hardware.
AMD’s Results Show the Race Is Not Frozen
NVIDIA still owns the strongest headline position, but AMD’s MLPerf Training 6.0 submission shows a maturing alternative path. AMD said its Instinct MI355X platform came within 5% of NVIDIA B200 on Llama 2 70B fine-tuning and within 6% on Llama 3.1 8B pretraining in the closed division. It also emphasized a 3.5x generational improvement from its first MI300X training submission to MI355X on Llama 2 70B fine-tuning.
The more interesting AMD detail is scale-out validation. AMD and Oracle Cloud Infrastructure submitted FLUX.1 results at 512 GPUs, while AMD listed partner participation from Oracle, Dell, HPE, Cisco, Supermicro, MiTAC, Akash, KRAI, Vultr, and GigaComputing. Partner submissions close to AMD’s own results are a sign that ROCm, kernels, communication libraries, and server integrations are becoming more repeatable across the ecosystem.
That does not erase NVIDIA’s lead. It does suggest the training market is becoming less binary. Buyers with enough scale will still care about absolute time-to-train, but they will also compare memory capacity, software maturity, cloud availability, pricing, procurement risk, and whether a given model stack is already tuned for their workload.
The Next Benchmark Problem Is Agents
Training is only one side of the infrastructure story. NVIDIA also pointed this week to AgentPerf results from Artificial Analysis, a benchmark designed around agentic workloads rather than single-turn inference. The first published results use DeepSeek V4 Pro and measure how many concurrent agent tasks a system can support while meeting responsiveness and output-rate thresholds.
That workload behaves differently from a chatbot prompt. Coding agents read files, call tools, run commands, revise code, and carry expanding context across many model calls. AgentPerf simulates tool-call delays and CPU processing time so the accelerator results are tied to the model-serving part of the chain. NVIDIA says GB300 NVL72 ran up to 20 times more agents per megawatt than HGX H200 in the first round.
The timing is useful. AI infrastructure buyers are being asked to fund training clusters, inference fleets, private AI clouds, and agent platforms at the same time. MLPerf Training 6.0 and early agent benchmarks are not perfect proxies for any one company’s workload, but they push the conversation toward the right questions: how sparse models train, how rack-scale systems communicate, how cloud providers reproduce vendor performance, and how much useful AI work can be delivered per watt and per dollar.
That is the real takeaway from this benchmark round. The AI infrastructure race is no longer only about who has the fastest accelerator. It is about which full stack can keep increasingly irregular AI workloads moving without wasting the most expensive hardware in the data center.