This model was released on 2025-10-01 and added to Hugging Face Transformers on 2026-02-23.

ModernVBert

Overview

ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks.

The model was introduced in ModernVBERT: Towards Smaller Visual Document Retrievers.

python

import torch
from huggingface_hub import hf_hub_download
from PIL import Image

from transformers import AutoModelForMaskedLM, AutoProcessor


processor = AutoProcessor.from_pretrained("./mvb")
model = AutoModelForMaskedLM.from_pretrained("./mvb", device_map="auto")

image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
text = "This [MASK] is on the wall."

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

# Inference
with torch.no_grad():
  outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(processor.tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = processor.tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)  # Predicted token: painting

</hfoption> </hfoptions>

ModernVBertConfig

[[autodoc]] ModernVBertConfig

ModernVBertModel

[[autodoc]] ModernVBertModel - forward

ModernVBertForMaskedLM

[[autodoc]] ModernVBertForMaskedLM - forward

ModernVBertForSequenceClassification

[[autodoc]] ModernVBertForSequenceClassification - forward

ModernVBertForTokenClassification

[[autodoc]] ModernVBertForTokenClassification - forward