docs/source/en/model_doc/gemma4.md
This model was released on {release_date} and added to Hugging Face Transformers on 2026-04-01.
Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in E2B, E4B, 31B and 26B-A4B (MoE) parameter sizes. Gemma 4 models provide the following capabilities:
You can find all the original Gemma 4 checkpoints under the Gemma 4 release.
The key difference from previous Gemma releases for vision is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There are a couple constraints to follow:
[!IMPORTANT] Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).
The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.
| Soft Tokens | Patches (before pooling) | Approx. Image Area |
|---|---|---|
| 70 | 630 | ~161K pixels |
| 140 | 1,260 | ~323K pixels |
| 280 | 2,520 | ~645K pixels |
| 560 | 5,040 | ~1.3M pixels |
| 1,120 | 10,080 | ~2.6M pixels |
To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."
Gemma 4 introduces a Per-Layer Embeddings (PLE) system that feeds an auxiliary residual signal into each decoder layer, rather than relying solely on a single shared embedding at the input.
PLE combines two components that are summed and scaled by 1/√2 before being fed to each decoder layer:
get_per_layer_inputs): looks up input_ids in embed_tokens_per_layer, a Gemma4TextScaledWordEmbedding that multiplies by √(hidden_size_per_layer_input). The packed output is reshaped from [batch, seq, num_hidden_layers * hidden_size_per_layer_input] to [batch, seq, num_hidden_layers, hidden_size_per_layer_input].project_per_layer_inputs): projects inputs_embeds through per_layer_model_projection (a Linear layer), scales by 1/√(hidden_size), reshapes to [batch, seq, num_layers, ple_dim], and normalizes with per_layer_projection_norm (RMSNorm).When both components are available, the final per-layer input is (token_identity + context_aware) * (1/√2). For multimodal inputs where input_ids are not available, only the context-aware projection is used.
The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.
from transformers import pipeline
pipeline = pipeline(
task="image-text-to-text",
model="google/gemma-4-E2B-it",
)
pipeline(
images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
text="<|image|>\n\nWhat is shown in this image?"
)
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"google/gemma-4-E2B-it",
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
"google/gemma-4-E2B-it",
padding_side="left"
)
messages = [
{
"role": "user", "content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
from transformers import AutoModelForCausalLM, AutoProcessor
WEATHER_TOOL = {
"type": "function",
"function": {
"name": "get_n_day_weather_forecast",
"description": "Get an N-day weather forecast",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use",
},
"num_days": {
"type": "integer",
"description": "The number of days to forecast",
},
},
"required": ["location", "format", "num_days"],
},
},
}
messages = [
{
"role": "user",
"content": "What's the weather like the next 3 days in San Francisco, CA (using F)?",
},
]
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B-it",
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
"google/gemma-4-E2B-it",
padding_side="left"
)
text = processor.apply_chat_template(
messages,
tools=[WEATHER_TOOL],
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))
from transformers import AutoModelForMultimodalLM, AutoProcessor
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please transcribe the following audio:"},
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav",
},
],
}
]
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-E2B-it",
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
"google/gemma-4-E2B-it",
padding_side="left"
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))
[[autodoc]] Gemma4AudioConfig
[[autodoc]] Gemma4VisionConfig
[[autodoc]] Gemma4TextConfig
[[autodoc]] Gemma4Config
[[autodoc]] Gemma4AudioFeatureExtractor - call
[[autodoc]] Gemma4ImageProcessorPil - preprocess
[[autodoc]] Gemma4ImageProcessor - preprocess
[[autodoc]] Gemma4VideoProcessor - preprocess
[[autodoc]] Gemma4Processor - call
[[autodoc]] Gemma4PreTrainedModel - forward
[[autodoc]] Gemma4AudioModel - forward
[[autodoc]] Gemma4VisionModel - forward
[[autodoc]] Gemma4TextModel - forward
[[autodoc]] Gemma4ForCausalLM
[[autodoc]] Gemma4Model - forward
[[autodoc]] Gemma4ForConditionalGeneration - forward