docs/source/en/model_doc/videoprism.md
This model was published in HF papers on 2024-02-20 and contributed to Hugging Face Transformers on 2026-06-19.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
The VideoPrism model was proposed in the paper VideoPrism: A Foundational Visual Encoder for Video Understanding by Google DeepMind (blog post).
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. The model is pretrained on a large-scale heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding through global-local distillation of semantic video embeddings and a token shuffling scheme, enabling the model to focus primarily on the video modality while leveraging text associated with videos. VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks across four broad task groups, from web video question answering to computer vision for science.
<div class="flex justify-center"> </div>You can find all original VideoPrism checkpoints under the VideoPrism collection.
Notes:
VideoPrismClipModel, which combines a video encoder and a text encoder. VideoPrismConfig must be used with this model.VideoPrismForVideoClassification which adds a classification head on top of the video encoder. VideoPrismVisionConfig must be used with this model.VideoPrismVisionModel for extracting video features. VideoPrismVisionConfig must be used with this model.This model was contributed by MHRDYN7 and reviewed by vasqu & zucchini-nlp. The original code can be found here.
The snippet below shows how to load the VideoPrismVisionModel for feature extraction using the AutoModel class.
import torch
from transformers import AutoModel, AutoVideoProcessor
processor = AutoVideoProcessor.from_pretrained("google/videoprism-base-f16r288", revision="refs/pr/4")
model = AutoModel.from_pretrained(
"google/videoprism-base-f16r288",
revision="refs/pr/4",
device_map="auto",
# use "flash_attention_2" for faster inference on supported hardware
# attn_implementation="flash_attention_2"
)
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
# when do_sample_frames=True, 16/8 frames will be sampled by default depending on the checkpoint size base/large.
processed_video_inputs = processor(videos=[video_url], return_metadata=True, do_sample_frames=True)
video_metadata = processed_video_inputs["video_metadata"]
video_inputs = processed_video_inputs["pixel_values_videos"].to(model.device)
outputs = model(video_inputs)
# VideoPrism encoder outputs
encoder_outputs = outputs.last_hidden_state
[[autodoc]] VideoPrismVisionConfig
[[autodoc]] VideoPrismTextConfig
[[autodoc]] VideoPrismConfig
[[autodoc]] VideoPrismTokenizer
[[autodoc]] VideoPrismProcessor
[[autodoc]] VideoPrismVisionModel - forward
[[autodoc]] VideoPrismVideoModel - forward
[[autodoc]] VideoPrismTextModel - forward
[[autodoc]] VideoPrismClipModel - forward
[[autodoc]] VideoPrismForVideoClassification - forward