docs/source/en/model_doc/minicpmv4_6.md
This model was released on 2025-09-16 and added to Hugging Face Transformers on 2026-04-28.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
MiniCPM-V is a series of efficient multimodal large language models developed by OpenBMB. The MiniCPM-V 4.6 architecture uses a SigLIP vision encoder with a window-attention merger and a Qwen3.5 language model backbone, supporting both 4x and 16x visual downsampling modes.
This model was contributed by OpenBMB. The original code can be found here.
from transformers import pipeline
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
},
{"type": "text", "text": "Describe this image."},
],
},
]
pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4_6")
outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
outputs[0]["generated_text"]
[!NOTE] The model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
from transformers import AutoProcessor, AutoModelForImageTextToText
model_checkpoint = "openbmb/MiniCPM-V-4_6"
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
MiniCPM-V 4.6 supports two visual downsampling modes:
You can change the downsampling mode at runtime by passing downsample_mode via processor_kwargs and to model.generate:
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
processor_kwargs={"downsample_mode": "4x"},
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100, downsample_mode="4x")
The model supports a thinking mode controlled by enable_thinking in the chat template. When enabled, the model generates internal reasoning before providing the final answer:
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=True,
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=1024)
To disable thinking (default for evaluation):
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=False,
).to(model.device, dtype=model.dtype)
MiniCPM-V 4.6 provides two image processing backends:
torchvision.transforms for image resizing.PIL.Image.resize, matching the original implementation.To use the PIL backend:
from transformers import AutoProcessor, AutoImageProcessor
processor = AutoProcessor.from_pretrained(model_checkpoint)
processor.image_processor = AutoImageProcessor.from_pretrained(model_checkpoint, backend="pil")
MiniCPM-V 4.6 supports video understanding.
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "path/to/video.mp4"},
{"type": "text", "text": "Describe what happens in this video."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=200)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
If you already have the rendered prompt string, you can call processor(text=..., videos=[...]) directly instead.
[[autodoc]] MiniCPMV4_6Config
[[autodoc]] MiniCPMV4_6VisionConfig
[[autodoc]] MiniCPMV4_6Model - forward - get_image_features
[[autodoc]] MiniCPMV4_6ForConditionalGeneration - forward - get_image_features
[[autodoc]] MiniCPMV4_6Processor - call
[[autodoc]] MiniCPMV4_6ImageProcessor - preprocess
[[autodoc]] MiniCPMV4_6ImageProcessorPil - preprocess
[[autodoc]] MiniCPMV4_6VideoProcessor - preprocess