Text-to-Speech & Speech-to-Text

LobeHub supports voice capabilities — listen to Agent responses hands-free, speak your messages instead of typing, and hold natural back-and-forth voice conversations. TTS converts text to speech; STT converts your voice to text.

Overview

Voice features in LobeHub provide:

Text-to-Speech (TTS): Convert AI text responses into spoken audio
Speech-to-Text (STT): Speak your messages instead of typing
Hands-Free Mode: Continuous voice conversations combining both
Multiple Providers: Choose from various voice styles and languages
Per-Agent Voices: Configure unique voices for different agents

Text-to-Speech (TTS)

TTS converts AI text responses into spoken audio, allowing you to listen instead of read.

To have AI read text aloud, simply highlight any content in the chat window and select Text-to-Speech. The AI will use a TTS model to convert the selected text into speech.

You can also configure an Agent to automatically read all responses as soon as each message completes — useful for hands-free workflows.

Voice Providers and Options

LobeHub supports two voice providers:

<Tabs> <Tab title="OpenAI Voices"> Premium neural voices with natural prosody and intonation:

| Voice   | Character           |
| ------- | ------------------- |
| Alloy   | Neutral, balanced   |
| Echo    | Clear, professional |
| Fable   | Warm, friendly      |
| Onyx    | Deep, authoritative |
| Nova    | Energetic, engaging |
| Shimmer | Soft, gentle        |

**Best for**: Long-form content listening, professional use cases, content requiring natural flow.

</Tab> <Tab title="Microsoft Edge Speech"> Azure Neural Voices — an extensive library with 100+ voices across languages, regional accents (US, UK, AU, and more), and male/female options.

**Best for**: Specific accent requirements, multi-language content, variety.

</Tab> </Tabs>

Playback Controls

When audio is playing:

Play/Pause — Control playback
Progress bar — See and seek through audio
Speed control — Adjust playback speed (0.5× to 2×)
Volume — Adjust audio level
Download — Save audio file for offline use

TTS audio is automatically cached — the first playback generates audio in real time, and subsequent playbacks are instant from cache.

Configuring Voice Settings

You can customize the voice conversion experience by selecting your preferred models in the settings.

Open the Settings panel
Navigate to the Text-to-Speech section
Choose your desired voice service and AI model

Each Agent can have its own voice. To configure per-Agent: open Agent settings → TTS section → select voice provider → choose a voice → test with sample text → save.

Speech-to-Text (STT)

STT converts your spoken words into text, enabling voice input for messages.

To input text using your voice, click the voice input option in the message box. LobeHub will convert your speech into text and insert it into the input field. Once you're done, you can send it directly to the AI.

Supported Languages

STT supports a wide range of languages including English (US, UK, AU, CA, IN), Spanish, French, German, Italian, Portuguese, Chinese (Mandarin), Japanese, Korean, and many more. Language is typically auto-detected or set based on your interface language.

Tips for Best Results

<AccordionGroup> <Accordion title="Speak Clearly"> Use a normal pace and enunciate. Avoid mumbling or speaking too quickly. </Accordion> <Accordion title="Optimize Your Environment"> Use good microphone positioning and minimize background noise for better accuracy. </Accordion> <Accordion title="Use Complete Sentences"> Speak in complete sentences and pause briefly between thoughts for more accurate transcription. </Accordion> <Accordion title="Review Before Sending"> After voice input, review the transcribed text, edit any mistakes, and send when ready. This hybrid approach combines the speed of voice with the precision of text editing. </Accordion> </AccordionGroup>

Voice Conversations (Hands-Free Mode)

Combine TTS and STT for natural, continuous voice conversations:

Configure your agent to use automatic TTS playback
Click the microphone and speak your message
Review the transcription and send
The AI response plays automatically via TTS
Speak your next message when ready

Use Cases

<Tabs> <Tab title="Accessibility"> Screen reader users, users with mobility limitations, or anyone who prefers audio interaction. Voice mode makes LobeHub accessible to a broader range of users. </Tab> <Tab title="Language Learning"> Practice speaking in a target language and hear correct pronunciation via TTS. The AI can provide feedback on phrasing and suggest improvements. </Tab> <Tab title="Multitasking"> Get AI assistance while cooking, commuting, or exercising. Hands-free mode lets you interact without looking at a screen. </Tab> <Tab title="Content Consumption"> Listen to long articles, research papers, or learning materials at your own pace. Adjust speed to match your preference. </Tab> </Tabs>

Performance

Sample rates: Voice output supports standard audio sample rates for high-quality playback
Latency: First playback requires real-time generation; cached playback is instant
Formats: Audio is generated in optimized formats for web playback
Caching: TTS output is cached locally to avoid redundant generation

Privacy Note

<Callout type={'warning'}> Voice input is processed by AI services for transcription. Avoid speaking sensitive information unless you are using a private or local deployment. </Callout>

Voice data handling:

STT audio is sent to the provider for transcription
TTS audio is cached locally for performance
Audio is not stored permanently by providers
Transcriptions become part of conversation data

Best practices: review transcriptions before sending, don't speak passwords or sensitive data, and clear the local audio cache periodically if you are concerned about storage.