Google DeepMind releases DiffusionGemma for faster local AI text generation

Google DeepMind has released DiffusionGemma, an experimental open AI model built to speed up text generation on local machines. Google says the model can produce output about four times faster than similarly sized autoregressive Gemma models by generating blocks of text in parallel.

The model joins Google’s Gemma 4 open model family, according to Google. Unlike standard text models that write one token after another from left to right, DiffusionGemma uses a diffusion-style process closer to image-generation systems.

How DiffusionGemma generates text

Google says DiffusionGemma begins with placeholder tokens across a text canvas, then revises that canvas over multiple passes. During those passes, the model estimates likely tokens and uses those guesses to improve other positions before finalizing a larger block of output.

That design changes the speed limits for local AI systems, according to Google. Instead of being constrained mainly by memory bandwidth, the model can make heavier use of available compute and generate as many as 256 tokens at the same time.

DiffusionGemma is a Mixture of Experts model with 26 billion total parameters, Google said. Only 3.8 billion parameters are active during inference, which Google says allows the model to fit within the 18GB memory available on a high-end GPU.

Google reported that DiffusionGemma reaches about 700 tokens per second on an Nvidia RTX 5090. On a single Nvidia H100 accelerator, Google said the model can exceed 1,000 tokens per second.

Where Google says diffusion helps

Google says the model’s parallel generation can improve performance on tasks that do not map cleanly to left-to-right text production. The company cited in-line editing, molecular sequencing and mathematical graphing as examples where the approach showed gains.

Ars Technica reported that Google also tuned DiffusionGemma to solve Sudoku puzzles. The task is difficult for standard autoregressive models, Ars Technica reported, because each token can depend on information that appears later in the output.

Google says DiffusionGemma can repeatedly correct large groups of tokens before producing its final answer. That process gives the model a different way to address tasks where later information changes earlier predictions.

Why Google still uses other approaches

Diffusion-based text generation has trade-offs, according to Ars Technica. The report said text diffusion can have a higher error rate than autoregressive generation because language is discrete, and one bad token can make a larger output unusable.

Short answers can also be inefficient for diffusion models, Ars Technica reported. A model may still perform broad parallel work even when the desired response contains only a few tokens, while an autoregressive model can finish such a response in a small number of steps.

Ars Technica reported that cloud systems give autoregressive models advantages that local hardware often lacks. Large services can batch jobs from many users, and high-bandwidth memory in data center systems can keep token generation running efficiently.

Local systems face more idle compute time and lower memory bandwidth, according to Ars Technica. Google has also used Multi-Token Prediction drafters in Gemma models to use spare compute for faster token prediction, though Ars Technica reported that DiffusionGemma is faster than those MTP versions.

Google says DiffusionGemma is available under the Apache 2.0 license used for other fourth-generation Gemma models. The model weights are available on Hugging Face, and Google said it worked with Nvidia to optimize the model for setups including quantized high-end RTX GPUs, H100 systems and the DGX Spark platform.

This story draws on original reporting from Ars Technica.