docs/source/en/model_doc/radio.md
This model was contributed to Hugging Face Transformers on 2026-06-30.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
RADIO (Reduce All Domains Into One) is a family of vision foundation models from NVIDIA trained by multi-teacher distillation (e.g. CLIP, DINOv2, SAM) into a single ViT backbone. It produces both an image-level summary embedding and dense spatial features, and supports variable input resolutions through a Cropped Position Embedding (CPE) patch generator.
The example below demonstrates how to extract image features with the [RadioModel] class.
import requests
import torch
from PIL import Image
from transformers import CLIPImageProcessor, RadioModel
hf_repo = "nvidia/C-RADIOv4-H"
model = RadioModel.from_pretrained(hf_repo)
model.eval().cuda()
image_processor = CLIPImageProcessor(
size={"height": 224, "width": 224}, do_resize=True, do_center_crop=False, do_normalize=False
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()
with torch.no_grad():
outputs = model(pixel_values)
summary = outputs.summary # (1, 2560) image-level embedding
features = outputs.features # (1, 196, 1280) dense spatial features
[[autodoc]] RadioConfig
[[autodoc]] RadioModel - forward