docs/source/en/model_doc/gemma4_unified.md
This model was contributed to Hugging Face Transformers on 2026-06-03.
Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance.
Key differences from standard Gemma 4:
Dense + LayerNorm pipeline with factorized 2D positional embeddings, replacing the vision encoder.RMSNorm → Linear pipeline, replacing the mel spectrogram + Conformer encoder.Gemma4UnifiedMultimodalEmbedder (RMSNorm → Linear) for the final projection to text hidden space.You can find the original Gemma 4 12B Unified checkpoints under the Gemma 4 release.
The key architectural difference from standard Gemma 4 is the removal of the vision encoder tower. Instead, Gemma 4 12B Unified processes images through a lightweight pipeline:
16×16 pixel patches3×3 patches are merged into 48×48 model patches, each with 48² × 3 = 6,912 raw pixel channelsLayerNorm → Dense → LayerNorm projects each merged patch into the LM embedding dimensionLayerNorm is appliedRMSNorm → Linear projects to the text hidden sizeLike standard Gemma 4, the model processes images of different sizes using a fixed-budget number of tokens. The same constraints apply:
[!IMPORTANT] Gemma 4 12B Unified does not apply mean/std normalization. The model's own patch embedding layer handles the final scaling internally.
The number of soft tokens per image is configurable. The supported options and default (280 soft tokens) are:
| Soft Tokens | Patches (before pooling) | Approx. Image Area |
|---|---|---|
| 70 | 630 | ~161K pixels |
| 140 | 1,260 | ~323K pixels |
| 280 | 2,520 | ~645K pixels |
| 560 | 5,040 | ~1.3M pixels |
| 1,120 | 10,080 | ~2.6M pixels |
The audio pipeline is similarly simplified. Instead of computing mel spectrograms and processing them through a Conformer encoder, raw 16 kHz waveform samples are:
RMSNorm → Linear via the shared Gemma4UnifiedMultimodalEmbedderSince there is no downsampling, the number of output soft tokens equals the number of input frames: ceil(num_samples / 640).
The example below demonstrates how to generate text based on an image and an audio sample with [Pipeline] or the [AutoModel] class.
from transformers import pipeline
pipe = pipeline(
task="any-to-any",
model="google/gemma-4-12B-it",
)
image_messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type": "text",
"text": "What is shown in this image?"
}
]
}
]
image_output = pipe(image_messages, return_full_text=False)
print(image_output[0]["generated_text"])
audio_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please transcribe the following audio:"},
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/bcn_weather.mp3",
},
],
}
]
audio_output = pipe(audio_messages, return_full_text=False)
print(audio_output[0]["generated_text"])
from transformers import AutoModelForMultimodalLM, AutoProcessor
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-12B-it",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"google/gemma-4-12B-it"
)
messages = [
{
"role": "user", "content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
from transformers import AutoModelForMultimodalLM, AutoProcessor
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please transcribe the following audio:"},
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/bcn_weather.mp3",
},
],
}
]
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-12B-it",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"google/gemma-4-12B-it"
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))
[[autodoc]] Gemma4UnifiedAudioConfig
[[autodoc]] Gemma4UnifiedConfig
[[autodoc]] Gemma4UnifiedTextConfig
[[autodoc]] Gemma4UnifiedVisionConfig
[[autodoc]] Gemma4UnifiedAudioFeatureExtractor - call
[[autodoc]] Gemma4UnifiedImageProcessor
[[autodoc]] Gemma4UnifiedVideoProcessor
[[autodoc]] Gemma4UnifiedProcessor - call
[[autodoc]] Gemma4UnifiedPreTrainedModel - forward
[[autodoc]] Gemma4UnifiedModel - forward
[[autodoc]] Gemma4UnifiedTextModel - forward
[[autodoc]] Gemma4UnifiedForCausalLM
[[autodoc]] Gemma4UnifiedForConditionalGeneration - forward