Back to Mistral Rs

Speech models

docs/src/content/docs/guides/models/use-speech-models.mdx

0.8.224.7 KB
Original Source

import { Tabs, TabItem } from '@astrojs/starlight/components';

mistral.rs supports two speech-related model families:

  • Voxtral: multimodal model accepting audio input. Used for transcription and audio understanding through /v1/chat/completions. It uses a Whisper-style audio encoder.
  • Dia: dedicated text-to-speech model served via /v1/audio/speech.

Voxtral is classified as a multimodal model (audio is one of its input modalities); Dia is classified as a dedicated speech model.

Voxtral: audio in, text out

bash
mistralrs serve -m mistralai/Voxtral-Mini-3B-2507

-m alone is enough: the auto-loader detects Voxtral's native Mistral layout (params.json, consolidated.safetensors, tekken.json).

Voxtral fits the multimodal chat shape: audio is an input content part, the response is text. The text prompt selects the task: transcription, summarization, speaker analysis, etc.

<Tabs> <TabItem label="HTTP">
bash
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "audio_url", "audio_url": {"url": "file:///clip.wav"}},
        {"type": "text", "text": "Transcribe this."}
      ]
    }]
  }'
</TabItem> <TabItem label="Python">
python
from mistralrs import ChatCompletionRequest, MultimodalArchitecture, Runner, Which

runner = Runner(
    which=Which.MultimodalPlain(
        model_id="mistralai/Voxtral-Mini-3B-2507",
        arch=MultimodalArchitecture.Voxtral,
    )
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "audio_url", "audio_url": {"url": "file:///absolute/path/clip.wav"}},
                    {"type": "text", "text": "Transcribe this audio."},
                ],
            }
        ],
        max_tokens=256,
        temperature=0,
    )
)
print(response.choices[0].message.content)
</TabItem> <TabItem label="Rust">
rust
use mistralrs::{AudioInput, MultimodalMessages, MultimodalModelBuilder, TextMessageRole};

let model = MultimodalModelBuilder::new("mistralai/Voxtral-Mini-3B-2507")
    .build()
    .await?;

let audio_bytes = std::fs::read("clip.wav")?;
let audio = AudioInput::from_bytes(&audio_bytes)?;

let messages = MultimodalMessages::new().add_audio_message(
    TextMessageRole::User,
    "Transcribe this audio.",
    vec![audio],
);

let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
</TabItem> </Tabs>

Dia: text-to-speech

/v1/audio/speech matches OpenAI:

bash
mistralrs serve -m nari-labs/Dia-1.6B

Dia understands dialogue speaker tags such as [S1] and [S2], and nonverbal parentheticals such as (laughs) or (coughs). Use them in the input string when you want dialogue or expressive speech.

<Tabs> <TabItem label="HTTP">
bash
curl http://localhost:1234/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": "[S1] Hello. This is a test of the text-to-speech system.",
    "response_format": "wav"
  }' \
  --output out.wav
  • Output: raw audio bytes.
  • response_format: only wav and pcm are read; mp3/opus/aac/flac return a validation error.
  • Extra OpenAI fields such as voice, speed, and instructions are silently ignored (the request reads only model, input, and response_format).
</TabItem> <TabItem label="Python">
python
import struct
import wave
from pathlib import Path

from mistralrs import Runner, SpeechLoaderType, Which

runner = Runner(
    which=Which.Speech(
        model_id="nari-labs/Dia-1.6B",
        arch=SpeechLoaderType.Dia,
    )
)

response = runner.generate_audio("[S1] mistral r s can generate speech locally.")

output_path = Path("out.wav")
pcm_ints = [int(max(-32768, min(32767, sample * 32767))) for sample in response.pcm]
with wave.open(output_path, "wb") as wav:
    wav.setnchannels(response.channels)
    wav.setsampwidth(2)
    wav.setframerate(response.rate)
    wav.writeframes(b"".join(struct.pack("<h", sample) for sample in pcm_ints))
</TabItem> <TabItem label="Rust">
rust
use mistralrs::{speech_utils, SpeechLoaderType, SpeechModelBuilder};

let model = SpeechModelBuilder::new("nari-labs/Dia-1.6B", SpeechLoaderType::Dia)
    .build()
    .await?;

let (pcm, rate, channels) = model
    .generate_speech("[S1] mistral r s can generate speech locally.")
    .await?;

let mut output = std::fs::File::create("out.wav")?;
speech_utils::write_pcm_as_wav(&mut output, &pcm, rate as u32, channels as u16)?;
</TabItem> </Tabs>