docs/source/en/model_doc/pe_audio.md
This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.
PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.
Text input
Audio input
The resulting embeddings can be used for:
TODO
[[autodoc]] PeAudioFeatureExtractor - call
[[autodoc]] PeAudioProcessor - call
[[autodoc]] PeAudioConfig
[[autodoc]] PeAudioEncoderConfig
[[autodoc]] PeAudioEncoder - forward
[[autodoc]] PeAudioFrameLevelModel - forward
[[autodoc]] PeAudioModel - forward