NVIDIA releases Nemotron-3-Ultra 550B LatentMoE model with 1M context
NVIDIA released Nemotron-3-Ultra-550B-A55B-Base-BF16, a large open base model built on a hybrid Latent Mixture-of-Experts (LatentMoE) architecture with 55B active and 550B total parameters. It includes Multi-Token Prediction (MTP) layers to improve generation quality and throughput, was pre-trained on 20 trillion tokens, and supports context windows up to 1 million tokens.
The LatentMoE design — activating a small fraction of total parameters per token — targets the efficiency frontier, letting the model claim large-model quality at sparse-activation inference cost. MTP layers, which predict multiple future tokens per step, are an increasingly common technique (used by DeepSeek and others) to boost decoding speed and quality.
Distribution came fast: Perplexity CEO Aravind Srinivas announced Nemotron 3 Ultra is available to all Pro and Max users, billing it 'America's leading open-source model,' and NVIDIA expanded its Nemotron coalition with new members including H Company, Nous Research, and Prime Intellect alongside existing partners Mistral, Cursor, LangChain, and Perplexity.
Strategically, Nemotron is NVIDIA's open-weights flag-plant — building an ecosystem that runs best on NVIDIA hardware while courting the open community against Llama, Qwen, and Gemma. Caveats: a 550B base model is heavy to deploy, the open release is a base (not instruction-tuned) checkpoint, and 'leading open-source' claims await independent evals. What to watch: fine-tunes built on it and real-world 1M-context performance.