This model was released on 2025-02-19 and added to Hugging Face Transformers on 2025-09-15.

</div>

</div>

Qwen3-VL-Moe

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Model usage

python

from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration


model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Moe",
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Moe")
messages = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }

]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pop("token_type_ids", None)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

</hfoption> </hfoptions>

Qwen3VLMoeConfig

[[autodoc]] Qwen3VLMoeConfig

Qwen3VLMoeVisionConfig

[[autodoc]] Qwen3VLMoeVisionConfig

Qwen3VLMoeTextConfig

[[autodoc]] Qwen3VLMoeTextConfig

Qwen3VLMoeVisionModel

[[autodoc]] Qwen3VLMoeVisionModel - forward

Qwen3VLMoeTextModel

[[autodoc]] Qwen3VLMoeTextModel - forward

Qwen3VLMoeModel

[[autodoc]] Qwen3VLMoeModel - forward - get_video_features - get_image_features

Qwen3VLMoeForConditionalGeneration

[[autodoc]] Qwen3VLMoeForConditionalGeneration - forward - get_video_features - get_image_features