NVIDIA Sets New MLPerf Inference Records: 2.5M Tokens/Sec on Blackwell Ultra, 3x Speedups and 60% VRAM Reduction via Software Optimizations

NVIDIA announced PyTorch-CUDA software optimizations achieving up to 3x performance improvements and 60% VRAM reduction for video and image generative AI workloads, with native NVFP4/FP8 precision support. Blackwell Ultra submissions reached a record 2.5M tokens/sec throughput in MLPerf inference benchmarks, while RTX AI infrastructure demonstrated 35% faster inference for small language models via Ollama and llama.cpp. NVIDIA also announced optimizations for Google's Gemma 4 on RTX PCs, DGX Spark, and edge devices, and introduced new local agent models including Nemotron 3 Nano 4B and Nemotron 3 Super 120B. Separately, NVIDIA's DLSS 5 and Neural Texture Compression technology — reducing VRAM from 6.5GB to 970MB — are facing backlash from game developers who label AI-generated frames 'AI slop,' with Jensen Huang publicly defending the technology.