docs/source/en/model_doc/modernvbert.md
This model was released on 2025-10-01 and added to Hugging Face Transformers on 2026-02-23.
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks.
The model was introduced in ModernVBERT: Towards Smaller Visual Document Retrievers.
<hfoptions id="usage"> <hfoption id="Python">import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import AutoModelForMaskedLM, AutoProcessor
processor = AutoProcessor.from_pretrained("./mvb")
model = AutoModelForMaskedLM.from_pretrained("./mvb", device_map="auto")
image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
text = "This [MASK] is on the wall."
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": text}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
# Inference
with torch.no_grad():
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(processor.tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = processor.tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token) # Predicted token: painting
[[autodoc]] ModernVBertConfig
[[autodoc]] ModernVBertModel - forward
[[autodoc]] ModernVBertForMaskedLM - forward
[[autodoc]] ModernVBertForSequenceClassification - forward
[[autodoc]] ModernVBertForTokenClassification - forward