docs/src/content/docs/guides/models/use-speech-models.mdx
import { Tabs, TabItem } from '@astrojs/starlight/components';
mistral.rs supports two speech-related model families:
/v1/chat/completions. It uses a Whisper-style audio encoder./v1/audio/speech.Voxtral is classified as a multimodal model (audio is one of its input modalities); Dia is classified as a dedicated speech model.
mistralrs serve -m mistralai/Voxtral-Mini-3B-2507
-m alone is enough: the auto-loader detects Voxtral's native Mistral layout (params.json, consolidated.safetensors, tekken.json).
Voxtral fits the multimodal chat shape: audio is an input content part, the response is text. The text prompt selects the task: transcription, summarization, speaker analysis, etc.
<Tabs> <TabItem label="HTTP">curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "file:///clip.wav"}},
{"type": "text", "text": "Transcribe this."}
]
}]
}'
from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which
runner = Runner(
which=Which.MultimodalPlain(
model_id="mistralai/Voxtral-Mini-3B-2507",
arch=MultimodalArchitecture.Voxtral,
)
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "file:///absolute/path/clip.wav"}},
{"type": "text", "text": "Transcribe this audio."},
],
}
],
max_tokens=256,
temperature=0,
)
)
print(response.choices[0].message.content)
use mistralrs::{AudioInput, MultimodalMessages, MultimodalModelBuilder, TextMessageRole};
let model = MultimodalModelBuilder::new("mistralai/Voxtral-Mini-3B-2507")
.build()
.await?;
let audio_bytes = std::fs::read("clip.wav")?;
let audio = AudioInput::from_bytes(&audio_bytes)?;
let messages = MultimodalMessages::new().add_audio_message(
TextMessageRole::User,
"Transcribe this audio.",
vec![audio],
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
/v1/audio/speech matches OpenAI:
mistralrs serve -m nari-labs/Dia-1.6B
Dia understands dialogue speaker tags such as [S1] and [S2], and nonverbal parentheticals such as (laughs) or (coughs). Use them in the input string when you want dialogue or expressive speech.
curl http://localhost:1234/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": "[S1] Hello. This is a test of the text-to-speech system.",
"response_format": "wav"
}' \
--output out.wav
response_format: only wav and pcm are read; mp3/opus/aac/flac return a validation error.voice, speed, and instructions are silently ignored (the request reads only model, input, and response_format).import struct
import wave
from pathlib import Path
from mistralrs import Runner, SpeechLoaderType, Which
runner = Runner(
which=Which.Speech(
model_id="nari-labs/Dia-1.6B",
arch=SpeechLoaderType.Dia,
)
)
response = runner.generate_audio("[S1] mistral r s can generate speech locally.")
output_path = Path("out.wav")
pcm_ints = [int(max(-32768, min(32767, sample * 32767))) for sample in response.pcm]
with wave.open(output_path, "wb") as wav:
wav.setnchannels(response.channels)
wav.setsampwidth(2)
wav.setframerate(response.rate)
wav.writeframes(b"".join(struct.pack("<h", sample) for sample in pcm_ints))
use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};
let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia)
.build()
.await?;
let (pcm, rate, channels) = model
.generate_speech("[S1] mistral r s can generate speech locally.")
.await?;
let mut output = std::fs::File::create("out.wav")?;
speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16)?;