NVIDIA unveils Nemotron 3 Ultra, a 550B-parameter open model for AI agents

NVIDIA introduced Nemotron 3 Ultra, a 550-billion-parameter mixture-of-experts (MoE) open language model targeted at long-running autonomous AI agents spanning coding, research and enterprise workflows. NVIDIA reports up to 5x faster inference speeds and roughly 30% lower running costs versus comparable models, with weights scheduled to land on Hugging Face, ModelScope and OpenRouter around June 4.
The MoE design is the cost lever: by activating only a subset of the 550B parameters per token, NVIDIA aims to deliver frontier-scale capability at materially lower serving cost — directly addressing the inference-cost anxieties that surfaced around Anthropic's agent-heavy Dynamic Workflows this week. Releasing open weights also seeds adoption on the very hardware NVIDIA sells.
Competitively, Nemotron 3 Ultra enters a crowded open-model field alongside MiniMax M3, Alibaba's Qwen 3.7, DeepSeek and Meta's Llama line — all chasing the agentic-coding use case. NVIDIA's angle is co-optimization with its silicon plus the efficiency claims. The skeptical read: vendor-reported speed and cost figures need independent verification, and a 550B model, even sparse, still demands serious infrastructure to run. Watch the June 4 weight release and early third-party benchmarks to see whether the 5x/30% claims hold against real workloads.