docs/source/en/model_doc/cohere_asr.md
This model was released on {release_date} and added to Hugging Face Transformers on 2026-03-26.
Cohere ASR, released by Cohere on March 26th, 2026, is a 2B parameter Conformer-based encoder-decoder speech recognition model.
This model was contributed by Eustache Le Bihan.
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
revision = "refs/pr/6"
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026", revision=revision)
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto", revision=revision)
audio = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=16000,
)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en").to(model.device)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
Pass punctuation=False to obtain lower-cased output without punctuation marks.
inputs_pnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=True).to(model.device)
inputs_nopnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=False).to(model.device)
For audio longer than the feature extractor's max_audio_clip_s, the feature extractor automatically splits the waveform into chunks.
The processor reassembles the per-chunk transcriptions using the returned audio_chunk_index.
audio_long = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
sampling_rate=16000,
)
inputs = processor(audio=audio_long, return_tensors="pt", language="en", sampling_rate=16000).to(model.device)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en")
print(text)
Multiple audio files can be processed in a single call. When the batch mixes short-form and long-form audio, the processor handles chunking and reassembly.
audio_short = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=16000,
)
audio_long = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
sampling_rate=16000,
)
inputs = processor([audio_short, audio_long], sampling_rate=16000, return_tensors="pt", language="en").to(model.device)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en"
)
print(text)
Specify the language code to transcribe in any of the 14 supported languages.
audio_es = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/fleur_es_sample.wav",
sampling_rate=16000,
)
inputs = processor(audio_es, sampling_rate=16000, return_tensors="pt", language="es", punctuation=True).to(model.device)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
[[autodoc]] CohereAsrConfig
[[autodoc]] CohereAsrFeatureExtractor - call
[[autodoc]] CohereAsrProcessor - call
[[autodoc]] CohereAsrPreTrainedModel - forward
[[autodoc]] CohereAsrModel - forward
[[autodoc]] CohereAsrForConditionalGeneration - forward