This model was contributed to Hugging Face Transformers on 2026-06-30.

</div>

</div>

RADIO

RADIO (Reduce All Domains Into One) is a family of vision foundation models from NVIDIA trained by multi-teacher distillation (e.g. CLIP, DINOv2, SAM) into a single ViT backbone. It produces both an image-level summary embedding and dense spatial features, and supports variable input resolutions through a Cropped Position Embedding (CPE) patch generator.

The example below demonstrates how to extract image features with the [RadioModel] class.

python

import requests
import torch
from PIL import Image

from transformers import CLIPImageProcessor, RadioModel


hf_repo = "nvidia/C-RADIOv4-H"

model = RadioModel.from_pretrained(hf_repo)
model.eval().cuda()

image_processor = CLIPImageProcessor(
    size={"height": 224, "width": 224}, do_resize=True, do_center_crop=False, do_normalize=False
)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()

with torch.no_grad():
    outputs = model(pixel_values)

summary = outputs.summary    # (1, 2560) image-level embedding
features = outputs.features   # (1, 196, 1280) dense spatial features

</hfoption> </hfoptions>

RadioConfig

[[autodoc]] RadioConfig

RadioModel

[[autodoc]] RadioModel - forward