docs/source/en/model_doc/voxtral_realtime.md
This model was released on {release_date} and added to Hugging Face Transformers on 2026-02-15.
VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
For transcribing complete audio files, use the processor and model directly. The generation length is automatically determined from the audio length.
from datasets import load_dataset
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
inputs = processor(audio, return_tensors="pt").to(model.device)
inputs = inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs)
decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
print(decoded_outputs[0])
Multiple audio samples can be transcribed in a single forward pass:
from datasets import load_dataset
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = [ds[i]["audio"]["array"] for i in range(2)]
inputs = processor(audio, return_tensors="pt").to(model.device)
inputs = inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs)
decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
for decoded_output in decoded_outputs:
print(decoded_output)
[!NOTE] This is an experimental feature and the API is subject to change.
For real-time transcription, audio is split into chunks following:
from threading import Thread
import numpy as np
from datasets import load_dataset
from transformers import (
TextIteratorStreamer,
VoxtralRealtimeForConditionalGeneration,
VoxtralRealtimeProcessor,
)
model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = VoxtralRealtimeProcessor.from_pretrained(model_id)
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(model_id, device_map="cuda:0")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
# Manually pad the audio to account for right padding tokens required by the model
xaudio = np.pad(audio, (0, processor.num_right_pad_tokens * processor.raw_audio_length_per_tok))
first_chunk_inputs = processor(
audio[:processor.num_samples_first_audio_chunk],
is_streaming=True,
is_first_audio_chunk=True,
return_tensors="pt"
)
first_chunk_inputs.to(model.device, dtype=model.dtype)
def input_features_generator():
yield first_chunk_inputs.input_features
mel_frame_idx = processor.num_mel_frames_first_audio_chunk
hop_length = processor.feature_extractor.hop_length
win_length = processor.feature_extractor.win_length
start_idx = mel_frame_idx * hop_length - win_length // 2
end_idx = start_idx + processor.num_samples_per_audio_chunk
while (end_idx:=start_idx + processor.num_samples_per_audio_chunk) < audio.shape[0]:
inputs = processor(
audio[start_idx:end_idx],
is_streaming=True,
is_first_audio_chunk=False,
return_tensors="pt"
)
inputs.to(model.device, dtype=model.dtype)
yield inputs.input_features
mel_frame_idx += processor.audio_length_per_tok
start_idx = mel_frame_idx * hop_length - win_length // 2
streamer = TextIteratorStreamer(processor.tokenizer, skip_special_tokens=True, clean_up_tokenization_spaces=True)
generate_kwargs = {
"input_ids": first_chunk_inputs.input_ids,
"input_features": input_features_generator(),
"num_delay_tokens": first_chunk_inputs.num_delay_tokens,
"streamer": streamer,
}
thread = Thread(target=model.generate, kwargs=generate_kwargs)
thread.start()
# Iterate over the streamer to get text chunks as they are generated
print("Model output (streaming):", end=" ", flush=True)
for text_chunk in streamer:
print(text_chunk, end="", flush=True)
This model was contributed by Eustache Le Bihan.
[[autodoc]] VoxtralRealtimeConfig
[[autodoc]] VoxtralRealtimeEncoderConfig
[[autodoc]] VoxtralRealtimeTextConfig
[[autodoc]] VoxtralRealtimeFeatureExtractor
[[autodoc]] VoxtralRealtimeProcessor - call
[[autodoc]] VoxtralRealtimeEncoder - forward
[[autodoc]] VoxtralRealtimeForConditionalGeneration - forward - get_audio_features