NVIDIA Releases Quantized DiffusionGemma 26B on Hugging Face

NVIDIA published DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, an NVFP4-quantized version of Google DeepMind's diffusion-based DiffusionGemma model. Built on the Gemma 4 26B A4B Mixture-of-Experts architecture, the model reportedly exceeds 1,100 tokens per second on NVIDIA Hopper H100 GPUs and supports a 256K-token context window.
The release is a concrete example of the diffusion-LM approach Google introduced this week reaching the open ecosystem with hardware-optimized quantization. Diffusion-based language generation promises faster, parallel token production versus traditional autoregressive decoding — the 1,100+ tokens/sec figure is the headline draw for latency-sensitive workloads.
The r/LocalLLaMA community picked it up quickly, and Transformers v5.11.0 added native DiffusionGemma support, smoothing local deployment. NVFP4 quantization makes the 26B model practical to run on single high-end accelerators.
Watch real-world throughput and quality benchmarks from independent testers, and how diffusion LMs compare with autoregressive models on reasoning tasks where they have historically lagged.