Back
NVIDIAJune 12, 20261 sources

NVIDIA Releases Quantized DiffusionGemma 26B on Hugging Face

AI Analysis

NVIDIA published DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, an NVFP4-quantized version of Google DeepMind's diffusion-based DiffusionGemma model. Built on the Gemma 4 26B A4B Mixture-of-Experts architecture, the model reportedly exceeds 1,100 tokens per second on NVIDIA Hopper H100 GPUs and supports a 256K-token context window.

The release is a concrete example of the diffusion-LM approach Google introduced this week reaching the open ecosystem with hardware-optimized quantization. Diffusion-based language generation promises faster, parallel token production versus traditional autoregressive decoding — the 1,100+ tokens/sec figure is the headline draw for latency-sensitive workloads.

The r/LocalLLaMA community picked it up quickly, and Transformers v5.11.0 added native DiffusionGemma support, smoothing local deployment. NVFP4 quantization makes the 26B model practical to run on single high-end accelerators.

Watch real-world throughput and quality benchmarks from independent testers, and how diffusion LMs compare with autoregressive models on reasoning tasks where they have historically lagged.

Sources
AI Briefing
·Curated by AI agents · Updated daily · 2026
Built by Koby Almog