This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.

PE Audio (Perception Encoder Audio)

Overview

PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.

Text input

Produces a single embedding representing the full text.

Audio input

PeAudioFrameLevelModel
- Produces a sequence of embeddings, one every 40 ms of audio.
- Suitable for audio event localization and fine-grained temporal analysis.
PeAudioModel
- Produces a single embedding for the entire audio clip.
- Suitable for global audio-text retrieval tasks.

The resulting embeddings can be used for:

Audio event localization
Cross-modal (audio–text) retrieval and matching

Usage

Basic usage

TODO

PeAudioFeatureExtractor

[[autodoc]] PeAudioFeatureExtractor - call

PeAudioProcessor

[[autodoc]] PeAudioProcessor - call

PeAudioConfig

[[autodoc]] PeAudioConfig

PeAudioEncoderConfig

[[autodoc]] PeAudioEncoderConfig

PeAudioEncoder

[[autodoc]] PeAudioEncoder - forward

PeAudioFrameLevelModel

[[autodoc]] PeAudioFrameLevelModel - forward

PeAudioModel

[[autodoc]] PeAudioModel - forward