docs/source/en/model_doc/pe_audio.md
This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.
PE Audio is the audio branch of Meta's Perception Encoder family. It contrastively aligns raw waveforms with text into a shared embedding space, trained on paired audio–caption data for cross-modal retrieval and zero-shot audio classification.
Two heads are exposed on top of the same encoder. [PeAudioModel] returns one pooled embedding per clip for clip-level retrieval, while [PeAudioFrameLevelModel] returns one embedding every 40 ms for event localization and fine-grained temporal analysis.
You can find all the official PE Audio checkpoints under the perception-encoder-audio-visual collection.
import torch
from datasets import load_dataset
from transformers import AutoProcessor, PeAudioModel
processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
model = PeAudioModel.from_pretrained(
"facebook/pe-av-large",
device_map="auto",
)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
labels = ["a dog barking", "a person speaking", "music playing"]
audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
inputs = {**audio_inputs, **text_inputs}
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_audio_text.sigmoid()
print({label: p.item() for label, p in zip(labels, probs[0])})
feature_size=1) and resampled to 48 kHz — the feature extractor warns but does not resample for you. Stereo input is not supported.padding_mask (not the usual attention_mask). The mask is downsampled internally by dac_config.hop_length before it reaches the encoder, so pass the raw waveform-resolution mask that the feature extractor returns.PeAudioModel] returns logits of shape (n_audio, n_text). [PeAudioFrameLevelModel] returns (n_audio, n_text, n_frames) with one frame every 40 ms. Pick the class that matches the task — they share weights so swapping is cheap.AutoModel from config.text_config. The tokenizer is attached to the processor via AutoTokenizer, not a dedicated class.[[autodoc]] PeAudioConfig
[[autodoc]] PeAudioEncoderConfig
[[autodoc]] PeAudioFeatureExtractor - call
[[autodoc]] PeAudioProcessor
[[autodoc]] PeAudioEncoder - forward
[[autodoc]] PeAudioModel - forward
[[autodoc]] PeAudioFrameLevelModel - forward