NVIDIAJune 12, 20261 sources

NVIDIA Releases DiffusionGemma 26B on Hugging Face with 1,100 Tokens/Sec on H100

AI Analysis

DiffusionGemma 26B applies a diffusion-based generation approach to text, packaged as an NVFP4-quantized release optimized for NVIDIA Hopper hardware. The model is built on Google DeepMind's Gemma 4 26B A4B Mixture-of-Experts architecture (3B active parameters in the MoE design family) and claims throughput exceeding 1,100 tokens/second on H100 GPUs alongside a 256K-token context window.

The collaboration is notable: NVIDIA packaging and quantizing a Google DeepMind model for high-speed inference on its own silicon demonstrates the increasingly cross-vendor nature of open-weight model distribution on Hugging Face. NVFP4 quantization is the mechanism enabling the high throughput, trading some precision for dramatic speed and memory gains.

Diffusion-for-text remains a comparatively experimental direction versus standard autoregressive LLMs, and the LocalLLaMA community has shown interest in such releases as locally-runnable, uncensorable alternatives — a sentiment amplified this week by the Anthropic Fable 5 ban. Skeptics will want quality benchmarks, not just speed numbers. Watch next: independent quality evals of diffusion text generation versus autoregressive peers, and adoption in local-inference toolchains.

Sources

huggingface.co

https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4