GoogleJune 12, 20262 sources

Google releases DiffusionGemma, an open-weight diffusion LLM hitting 1,000+ tokens/sec

AI Analysis

Google introduced DiffusionGemma, an open-weight model that generates text via diffusion instead of the standard autoregressive approach, enabling dramatically faster output. DeepMind's Jack Rae cited speeds exceeding 1,000 tokens per second on an H100 and over 700 on a consumer RTX 5090, with Hacker News commenters calling its 4x parallel text generation a 'landmark moment' that shifts the bottleneck from memory bandwidth to compute.

NVIDIA released an optimized quantized variant, DiffusionGemma 26B-A4B-it-NVFP4, supporting a 256K-token context window, configurable thinking mode, native function calling, and multilingual inference across 35+ languages, reaching over 1,100 tokens/sec on Hopper H100 in FP8. The collaboration showcases tight hardware-software co-design between DeepMind and NVIDIA on an open-weights release.

Diffusion-based text generation has long been a research curiosity; a production-grade open-weight model from Google hitting these speeds could reset expectations for local and latency-sensitive inference. It was the clear darling of developer chatter this week, contrasting with the controversy around closed, guardrailed frontier models. The skeptical question is whether diffusion LLMs match autoregressive quality on hard reasoning tasks, or whether the speed gains come with accuracy or coherence tradeoffs — something the open weights will let the community test quickly. For self-hosters on r/LocalLLaMA, the prospect of frontier-class speed on a single consumer GPU is the headline.

Sources

blog.google

https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/

huggingface.co

https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4