Read Time:1 Minute
Microsoft has set a new industry benchmark by running inference at 1.1 million tokens per second on its Azure ND GB300 system, a move that pushes the boundaries of large-scale AI performance.
The milestone was achieved using Azure’s ND GB300 v6 virtual machines, deployed across one NVIDIA GB300 NVL72 rack — running Meta’s Llama2-70B model in an unverified MLPerf Inference v5.1 test. This beats Microsoft’s earlier record of 865,000 tokens/sec on the older ND GB200 system.
Under the hood, the GB300 architecture is designed for inference workloads. It offers 50 percent more GPU memory and 16 percent higher thermal design power (TDP) than its predecessor, enabling better throughput. Performance gains also stem from efficiency improvements in memory, interconnect, and compute.
This achievement underscores Microsoft’s deep collaboration with NVIDIA. The two have been co-engineering infrastructure optimizations across hardware, software, and networking to extract peak performance in AI workloads.
Industry analysts see this as a signal: high-throughput inference is becoming more viable and cost-efficient as organizations scale AI deployments. The ability to process more tokens per second translates into faster responses, lower latency, and support for real-time AI use cases in areas like chat, agents, and recommendation systems.
However, it’s worth noting: the 1.1 million tokens/sec result is currently unverified by MLCommons, meaning it has not passed the full standard validation process. Microsoft and third parties like Signal65 observed the test, lending credence to the claim.
Looking ahead, this breakthrough could reshape how enterprises think about deploying large language models in production. If this kind of performance becomes stable and broadly available, it will reduce infrastructure barriers and open doors for more advanced AI at scale.