Google AI team including the Google DeepMind researchers have just released DiffusionGemma, an experimental open model for text generation. It uses text diffusion instead of standard autoregressive decoding. The model ships under a permissive Apache 2.0 license. Google positions it for devs and researchers exploring speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.
Most language models in use today are autoregressive. They generate one token at a time, left to right. Each new token depends on the token before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this delivers up to 4x faster generation.
What is DiffusionGemma
DiffusionGemma is a 26B Mixture of Experts (MoE) model. It activates only 3.8B parameters during inference. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a diffusion head onto that base.
The model is multimodal. It processes interleaved text, image, and video inputs. It generates text outputs from those inputs. The context window is 256K tokens, and it supports 140+ languages.
Quantized, the model fits within 18GB of VRAM. That places it inside high-end consumer GPU limits. On a single NVIDIA H100, it reaches 1000+ tokens per second. On an NVIDIA GeForce RTX 5090, it reaches 700+ tokens per second.
Google is very direct about the trade-off. DiffusionGemma prioritizes speed and parallel layout generation. Its overall output quality is lower than standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.
How Text Diffusion Works
Text diffusion borrows its core idea from AI image generators. Those models start with visual static and refine it iteratively. DiffusionGemma applies the same pattern to text generation.
The process runs in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it makes multiple passes over that canvas. It locks in high-confidence tokens and uses them as context. Third, the text converges into the final output.
Google calls the core mechanism Uniform State Diffusion. Highly confident tokens help resolve adjacent positions during denoising. The full sequence then snaps into focus over several passes.
In practice, the model denoises a 256-token canvas in parallel. It finalizes roughly 15-20 tokens per forward pass. That parallelism is what drives the throughput gains.
The model uses bidirectional attention during denoising. Every token on the canvas can attend to every other token. This is a sharp break from autoregressive models. Those models can only look backward at prior tokens.
That bidirectional context enables real-time self-correction. If a token’s confidence drops, the sampler can re-noise it. The model then replaces that token on a later pass. Autoregressive models cannot do this, since they commit each token once.
The Architecture
The technical advancement here is hardware utilization. For local GPU inference, the main bottleneck is memory bandwidth. Autoregressive models repeatedly load weights from memory per token. During single-user serving, the GPU spends most time waiting.
DiffusionGemma shifts the bottleneck from memory bandwidth to compute. It drafts and refines a 256-token canvas in parallel. This gives idle tensor cores a large parallel workload.
The model alternates two attention modes during inference. Prefill uses causal attention to ingest the prompt and write the KV cache. Denoising uses bidirectional attention to refine the canvas.
For longer outputs, DiffusionGemma uses Block Autoregressive Diffusion. Once a 256-token block is fully denoised, it commits to the KV cache. The model then starts a fresh canvas conditioned on prior history. This pairs parallel block speed with sequential autoregressive stability.
The architecture shares the same backbone as Gemma 4 26B A4B. Developers mainly need to implement a denoising step. That makes integration into existing serving frameworks simpler.
A clear example is the Sudoku showcase from Google’s developer guide. Autoregressive models struggle with strict, multivariable constrained puzzles. The base DiffusionGemma model solves roughly 0% of Sudoku puzzles. After a simple JAX supervised fine-tuning recipe, correctness rises to 80%. The fine-tuned model also stops earlier, cutting inference steps.
Interactive Demo: How DiffusionGemma Decodes in Parallel
The interactive visualizer below illustrates how DiffusionGemma decodes text, contrasted with a standard autoregressive model. Toggle between the two modes and press Run. In Autoregressive mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token — the way most LLMs generate today. In Diffusion mode, the model starts from a canvas of masked placeholder tokens and resolves many of them in parallel each pass, in no fixed order, converging in far fewer passes. The animation also shows a brief re-noise step, where a low-confidence token is reset and refined again — a stand-in for the real model’s self-correction, which autoregressive decoding cannot do once a token is committed. Note this is a conceptual animation, not live model output: the real DiffusionGemma resolves a 256-token canvas and finalizes roughly 15–20 tokens per forward pass.
Watch DiffusionGemma Decode in Parallel
This is a conceptual animation of the denoising process — not live model output. The real model resolves a 256-token canvas, finalizing ~15–20 tokens per forward pass.
Press Run to start.
Use Cases
DiffusionGemma targets specific workloads, not general production quality. Google and ecosystem partners highlight several practical applications:
- In-line editing and code infilling: Bidirectional attention suits non-linear text structures well.
- Rapid iteration: Low local latency supports interactive, single-user developer loops.
- Long-context document analysis: The 256K window supports large input processing.
- OCR and document parsing: Multimodal input handles images and scanned documents.
- Code generation, tool calling, and agentic workflows: Unsloth lists these as supported tasks.
- Constrained generation: Sudoku, mathematical graphs, and amino acid sequences benefit from parallel attention.
One caveat shapes all of these. The speedup is designed for local, low-concurrency inference. In high-QPS cloud serving, autoregressive models saturate compute efficiently. There, parallel decoding offers diminishing returns and can raise serving costs.

DiffusionGemma vs Standard Gemma 4
| Attribute | DiffusionGemma (26B-A4B) | Standard Gemma 4 (26B A4B) |
|---|---|---|
| Generation method | Discrete text diffusion (parallel) | Autoregressive (token-by-token) |
| Decode bottleneck | Compute-bound | Memory-bandwidth-bound |
| Parallel unit | 256-token canvas per pass | One token per step |
| Attention during decode | Bidirectional | Causal (backward only) |
| Self-correction | Yes, via re-noising | No, tokens are committed once |
| Speed on dedicated GPU | Up to 4x faster | Baseline |
| H100 throughput | 1000+ tokens/sec | Lower (baseline) |
| RTX 5090 throughput | 700+ tokens/sec | Lower (baseline) |
| Output quality | Lower than Gemma 4 | Higher; recommended for production |
| Best fit | Local, low-concurrency, interactive | High-quality and high-QPS cloud serving |
| License | Apache 2.0 | Gemma terms |
Key Takeaways
- DiffusionGemma is a 26B MoE open model (3.8B active) that generates text via parallel diffusion, not token-by-token.
- It runs up to 4x faster on dedicated GPUs: 1000+ tokens/sec on H100, 700+ on RTX 5090.
- Bidirectional attention over a 256-token canvas enables real-time self-correction, unlike autoregressive models.
- Quantized, it fits in 18GB VRAM with day-zero support in vLLM, Transformers, MLX, and Unsloth.
- It’s experimental and lower-quality than standard Gemma 4; Google recommends Gemma 4 for production.
Marktechpost’s Visual Explainer
DiffusionGemma: A Visual Guide
Google DeepMind’s 26B open text diffusion model — what it is and how it works.
Check out the Model weights and Technical details. We have also created a short demo for this research paper. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link





























