docs/source/en/model_doc/cosmos3_omni.md
This model was contributed to Hugging Face Transformers on 2026-06-04.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"> </div>
Cosmos3 is a mixture-of-transformers (MoT) Vision Foundation Model from NVIDIA, composed of a Reasoner tower and a Generator tower. The two towers share the same input embedding and visual encoder but use disjoint MoT experts for understanding vs. generation, plus cross-modal adapters (proj_out, audio_proj_out, action_proj_out, etc.) that connect the language model to image / audio / action heads.
The transformers integration loads only the Reasoner tower from a unified Cosmos3 checkpoint. The Reasoner is architecturally identical to Qwen3-VL — Cosmos3OmniForConditionalGeneration is a thin subclass of Qwen3VLForConditionalGeneration.
import torch
from transformers import AutoProcessor, Cosmos3OmniForConditionalGeneration
model = Cosmos3OmniForConditionalGeneration.from_pretrained("nvidia/Cosmos3-Nano", device_map="auto")
processor = AutoProcessor.from_pretrained("nvidia/Cosmos3-Nano")
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Caption the image in detail."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(
[out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output[0])
[[autodoc]] Cosmos3OmniConfig
[[autodoc]] Cosmos3OmniModel - forward - get_video_features - get_image_features
[[autodoc]] Cosmos3OmniForConditionalGeneration - forward - get_video_features - get_image_features