Back to Transformers

Gemma 4 Assistant

docs/source/en/model_doc/gemma4_assistant.md

5.8.04.9 KB
Original Source
<!--Copyright 2026 the HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->

This model was released on {release_date} and added to Hugging Face Transformers on 2026-05-04.

Gemma 4 Assistant

Overview

Gemma 4 Assistant is a small, text-only model that enables speculative decoding with for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. Pre-trained models are provided for the IT vairants of the Gemma 4 E2B, E4B, 31B and 26B-A4B (MoE) models.

Architecturally, the Gemma 4 Assistant shares the same Gemma4TextModel backbone as other Gemma 4 models, but differs in a few key ways:

  • The entire model uses KV sharing. This technique, originally introduced with Gemma 3n, allows the model to resuse the KV cache populated by the target model the assistant supports, allowing the assistant to skip the pre-fille phase entirely, and considerably reducing attention compute during the forward pass.
  • The position_ids value are constant. Since the KV cache is shared and the assistant does not have a mean of updating the cache, the assistant predicts all tokens from the same position ID.
  • Inputs are the concatenation of embeddings and hidden states. To adapt for the static KV cache and position_ids, the model takes its inputs as the concatenation of the embedding and hidden_states for the last seen token from the target model and projects them into assistant model space with a nn.Linear transform. The definition of last seen token changes throughout the assisted decoding loop. For the first token drafted after pre-fill, the last seen token will be the last token from the prompt. For subsequent drafting steps, the last seen token will be the last token generated by the assistant (within a drafting round) or the last token accepted by the target model (between drafting rounds).
  • Cross-attention is used to make the most of the target model's context. Cross-attention allows the query states geneated by the assistant to attend to the shared KV cache values from the target model, allowing the assistant to accurately predict more drafted tokens per drafting round.

You can find all the original Gemma 4 Assistant checkpoints under the Gemma 4 release.

Usage examples

The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.

<hfoptions id="usage"> <hfoption id="Pipeline">
py
import torch
from transformers import pipeline

pipeline = pipeline(
    task="image-text-to-text",
    model="google/gemma-4-E2B-it",
    assistant_model="google/gemma-4-E2B-it-assistant",
)
pipeline(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
    text="<|image|>\n\nWhat is shown in this image?"
)
</hfoption> <hfoption id="AutoModel">
py
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "google/gemma-4-E2B-it",
    dtype=torch.bfloat16,
    device_map="auto",
)
assistant_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B-it-assistant",
    dtype=torch.bfloat16,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "google/gemma-4-E2B-it",
    padding_side="left"
)
messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

output = model.generate(**inputs, max_new_tokens=50, assistant_model=assistant_model)
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
</hfoption>

Gemma4AssistantConfig

[[autodoc]] Gemma4AssistantConfig

Gemma4AssistantForCausalLM

[[autodoc]] Gemma4AssistantForCausalLM