docs/src/content/docs/guides/models/use-block-diffusion.mdx
import { Tabs, TabItem } from '@astrojs/starlight/components';
Block-diffusion models generate text by iteratively denoising whole blocks of tokens in parallel instead of sampling one token at a time. The mechanism:
Because each pass commits many tokens at once, decode throughput is higher than a comparable autoregressive model.
Currently supported:
google/diffusiongemma-26B-A4B-it), a 26B-A4B MoE (Mixture of Experts) model with vision input, built on the Gemma 4 architecture.No special flags or APIs: block-diffusion models are detected automatically and served through the standard endpoints.
<Tabs> <TabItem label="CLI">mistralrs run -m google/diffusiongemma-26B-A4B-it
mistralrs serve -p 1234 -m google/diffusiongemma-26B-A4B-it
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"max_tokens": 1024
}'
The standard chat-completion API works unchanged. See the Python example (vision input).
from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which
runner = Runner(
which=Which.MultimodalPlain(
model_id="google/diffusiongemma-26B-A4B-it",
arch=MultimodalArchitecture.DiffusionGemma,
)
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
max_tokens=1024,
)
)
print(response.choices[0].message.content)
The standard chat API works unchanged. See the Rust example (streaming, shows block-at-a-time output).
use mistralrs::{MultimodalModelBuilder, TextMessageRole, TextMessages};
let model = MultimodalModelBuilder::new("google/diffusiongemma-26B-A4B-it")
.build()
.await?;
let messages = TextMessages::new().add_message(TextMessageRole::User, "Why is the sky blue?");
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
canvas_length) at a time, after that block's denoising loop converges, rather than token by token.generation_config.json. Request-level temperature, top_p, and penalties are ignored. max_tokens still caps output length.tool_choice: required, named tools, and JSON schema outputs are unconstrained: the model relies on its trained formatting instead. (Per-token grammars are incompatible with parallel token refinement.)See also: model family notes and the supported models reference.