Back to Transformers

VideoPrism

docs/source/en/model_doc/videoprism.md

5.13.05.2 KB
Original Source
<!--Copyright 2026 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

This model was published in HF papers on 2024-02-20 and contributed to Hugging Face Transformers on 2026-06-19.

<div style="float: right;"> <div class="flex flex-wrap space-x-1">
</div>
</div>

VideoPrism

The VideoPrism model was proposed in the paper VideoPrism: A Foundational Visual Encoder for Video Understanding by Google DeepMind (blog post).

VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. The model is pretrained on a large-scale heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding through global-local distillation of semantic video embeddings and a token shuffling scheme, enabling the model to focus primarily on the video modality while leveraging text associated with videos. VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks across four broad task groups, from web video question answering to computer vision for science.

<div class="flex justify-center"> </div>

You can find all original VideoPrism checkpoints under the VideoPrism collection.

Notes:

  • VideoPrism uses a factorized spatio-temporal encoder architecture, processing videos through separate spatial and temporal transformers.
  • The model supports video-text contrastive learning through VideoPrismClipModel, which combines a video encoder and a text encoder. VideoPrismConfig must be used with this model.
  • For video classification tasks, use VideoPrismForVideoClassification which adds a classification head on top of the video encoder. VideoPrismVisionConfig must be used with this model.
  • The vision encoder can be used standalone via VideoPrismVisionModel for extracting video features. VideoPrismVisionConfig must be used with this model.
  • The default input resolution is 288x288 pixels with 16 frames per video clip for the base models and 8 frames for the large models. Set interpolate_pos_encoding=True to use the models with custom resolution and frames per clip.

This model was contributed by MHRDYN7 and reviewed by vasqu & zucchini-nlp. The original code can be found here.

Usage example

The snippet below shows how to load the VideoPrismVisionModel for feature extraction using the AutoModel class.

py
import torch
from transformers import AutoModel, AutoVideoProcessor

processor = AutoVideoProcessor.from_pretrained("google/videoprism-base-f16r288", revision="refs/pr/4")
model = AutoModel.from_pretrained(
    "google/videoprism-base-f16r288",
    revision="refs/pr/4",
    device_map="auto",
    # use "flash_attention_2" for faster inference on supported hardware
    # attn_implementation="flash_attention_2" 
)

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"

# when do_sample_frames=True, 16/8 frames will be sampled by default depending on the checkpoint size base/large.
processed_video_inputs = processor(videos=[video_url], return_metadata=True, do_sample_frames=True)
video_metadata = processed_video_inputs["video_metadata"]
video_inputs = processed_video_inputs["pixel_values_videos"].to(model.device)
outputs = model(video_inputs)

# VideoPrism encoder outputs
encoder_outputs = outputs.last_hidden_state

VideoPrismVisionConfig

[[autodoc]] VideoPrismVisionConfig

VideoPrismTextConfig

[[autodoc]] VideoPrismTextConfig

VideoPrismConfig

[[autodoc]] VideoPrismConfig

VideoPrismTokenizer

[[autodoc]] VideoPrismTokenizer

VideoPrismProcessor

[[autodoc]] VideoPrismProcessor

VideoPrismVisionModel

[[autodoc]] VideoPrismVisionModel - forward

VideoPrismVideoModel

[[autodoc]] VideoPrismVideoModel - forward

VideoPrismTextModel

[[autodoc]] VideoPrismTextModel - forward

VideoPrismClipModel

[[autodoc]] VideoPrismClipModel - forward

VideoPrismForVideoClassification

[[autodoc]] VideoPrismForVideoClassification - forward