Use Voice Mode with Hermes

This guide is the practical companion to the Voice Mode feature reference.

If the feature page explains what voice mode can do, this guide shows how to actually use it well.

:::tip Nous Portal bundles both the LLM and TTS through one OAuth — voice mode works end-to-end with no extra credentials. :::

What voice mode is good for

Voice mode is especially useful when:

you want a hands-free CLI workflow
you want spoken responses in Telegram or Discord
you want Hermes sitting in a Discord voice channel for live conversation
you want quick idea capture, debugging, or back-and-forth while walking around instead of typing

Choose your voice mode setup

There are really three different voice experiences in Hermes.

Mode	Best for	Platform
Interactive microphone loop	Personal hands-free use while coding or researching	CLI
Voice replies in chat	Spoken responses alongside normal messaging	Telegram, Discord
Live voice channel bot	Group or personal live conversation in a VC	Discord voice channels

A good path is:

get text working first
enable voice replies second
move to Discord voice channels last if you want the full experience

Step 1: make sure normal Hermes works first

Before touching voice mode, verify that:

Hermes starts
your provider is configured
the agent can answer text prompts normally

bash

hermes

Ask something simple:

text

What tools do you have available?

If that is not solid yet, fix text mode first.

Step 2: install the right extras

CLI microphone + playback

bash

pip install "hermes-agent[voice]"

Messaging platforms

bash

pip install "hermes-agent[messaging]"

Premium ElevenLabs TTS

bash

pip install "hermes-agent[tts-premium]"

Local NeuTTS (optional)

bash

python -m pip install -U neutts[all]

Everything

bash

pip install "hermes-agent[all]"

Step 3: install system dependencies

macOS

bash

brew install portaudio ffmpeg opus
brew install espeak-ng

Ubuntu / Debian

bash

sudo apt install portaudio19-dev ffmpeg libopus0
sudo apt install espeak-ng

Why these matter:

portaudio → microphone input / playback for CLI voice mode
ffmpeg → audio conversion for TTS and messaging delivery
opus → Discord voice codec support
espeak-ng → phonemizer backend for NeuTTS

Step 4: choose STT and TTS providers

Hermes supports both local and cloud speech stacks.

Easiest / cheapest setup

Use local STT and free Edge TTS:

STT provider: local
TTS provider: edge

This is usually the best place to start.

Environment file example

Add to ~/.hermes/.env:

bash

# Cloud STT options (local needs no key)
GROQ_API_KEY=***
VOICE_TOOLS_OPENAI_KEY=***

# Premium TTS (optional)
ELEVENLABS_API_KEY=***

Provider recommendations

Speech-to-text

local → best default for privacy and zero-cost use
groq → very fast cloud transcription
openai → good paid fallback

Text-to-speech

edge → free and good enough for most users
neutts → free local/on-device TTS
elevenlabs → best quality
openai → good middle ground
mistral → multilingual, native Opus

If you use `hermes setup`

If you choose NeuTTS in the setup wizard, Hermes checks whether neutts is already installed. If it is missing, the wizard tells you NeuTTS needs the Python package neutts and the system package espeak-ng, offers to install them for you, installs espeak-ng with your platform package manager, and then runs:

bash

python -m pip install -U neutts[all]

If you skip that install or it fails, the wizard falls back to Edge TTS.

Step 5: recommended config

yaml

voice:
  record_key: "ctrl+b"
  max_recording_seconds: 120
  auto_tts: false
  beep_enabled: true
  silence_threshold: 200
  silence_duration: 3.0

stt:
  provider: "local"
  local:
    model: "base"

tts:
  provider: "edge"
  edge:
    voice: "en-US-AriaNeural"

This is a good conservative default for most people.

If you want local TTS instead, switch the tts block to:

yaml

tts:
  provider: "neutts"
  neutts:
    ref_audio: ''
    ref_text: ''
    model: neuphonic/neutts-air-q4-gguf
    device: cpu

Use case 1: CLI voice mode

Turn it on

Start Hermes:

bash

hermes

Inside the CLI:

text

/voice on

Recording flow

Default key:

Ctrl+B

Workflow:

press Ctrl+B
speak
wait for silence detection to stop recording automatically
Hermes transcribes and responds
if TTS is on, it speaks the answer
the loop can automatically restart for continuous use

Useful commands

text

/voice
/voice on
/voice off
/voice tts
/voice status

Good CLI workflows

Walk-up debugging

Say:

text

I keep getting a docker permission error. Help me debug it.

Then continue hands-free:

"Read the last error again"
"Explain the root cause in simpler terms"
"Now give me the exact fix"

Research / brainstorming

Great for:

walking around while thinking
dictating half-formed ideas
asking Hermes to structure your thoughts in real time

Accessibility / low-typing sessions

If typing is inconvenient, voice mode is one of the fastest ways to stay in the full Hermes loop.

Tuning CLI behavior

Silence threshold

If Hermes starts/stops too aggressively, tune:

yaml

voice:
  silence_threshold: 250

Higher threshold = less sensitive.

Silence duration

If you pause a lot between sentences, increase:

yaml

voice:
  silence_duration: 4.0

Record key

If Ctrl+B conflicts with your terminal or tmux habits:

yaml

voice:
  record_key: "ctrl+space"

Use case 2: voice replies in Telegram or Discord

This mode is simpler than full voice channels.

Hermes stays a normal chat bot, but can speak replies.

Start the gateway

bash

hermes gateway

Turn on voice replies

Inside Telegram or Discord:

text

/voice on

text

/voice tts

Modes

Mode	Meaning
`off`	text only
`voice_only`	speak only when the user sent voice
`all`	speak every reply

When to use which mode

/voice on if you want spoken replies only for voice-originating messages
/voice tts if you want a full spoken assistant all the time

Good messaging workflows

Telegram assistant on your phone

Use when:

you are away from your machine
you want to send voice notes and get quick spoken replies
you want Hermes to function like a portable research or ops assistant

Discord DMs with spoken output

Useful when you want private interaction without server-channel mention behavior.

Use case 3: Discord voice channels

This is the most advanced mode.

Hermes joins a Discord VC, listens to user speech, transcribes it, runs the normal agent pipeline, and speaks replies back into the channel.

Required Discord permissions

In addition to the normal text-bot setup, make sure the bot has:

Connect
Speak
preferably Use Voice Activity

Also enable privileged intents in the Developer Portal:

Presence Intent
Server Members Intent
Message Content Intent

Join and leave

In a Discord text channel where the bot is present:

text

/voice join
/voice leave
/voice status

What happens when joined

users speak in the VC
Hermes detects speech boundaries
transcripts are posted in the associated text channel
Hermes responds in text and audio
the text channel is the one where /voice join was issued

Best practices for Discord VC use

keep DISCORD_ALLOWED_USERS tight
use a dedicated bot/testing channel at first
verify STT and TTS work in ordinary text-chat voice mode before trying VC mode

Voice quality recommendations

Best quality setup

STT: local large-v3 or Groq whisper-large-v3
TTS: ElevenLabs

Best speed / convenience setup

STT: local base or Groq
TTS: Edge

Best zero-cost setup

STT: local
TTS: Edge

Common failure modes

"No audio device found"

Install portaudio.

"Bot joins but hears nothing"

Check:

your Discord user ID is in DISCORD_ALLOWED_USERS
you are not muted
privileged intents are enabled
the bot has Connect/Speak permissions

"It transcribes but does not speak"

Check:

TTS provider config
API key / quota for ElevenLabs or OpenAI
ffmpeg install for Edge conversion paths

"Whisper outputs garbage"

Try:

quieter environment
higher silence_threshold
different STT provider/model
shorter, clearer utterances

"It works in DMs but not in server channels"

That is often mention policy.

By default, the bot needs an @mention in Discord server text channels unless configured otherwise.

Suggested first-week setup

If you want the shortest path to success:

get text Hermes working
install hermes-agent[voice]
use CLI voice mode with local STT + Edge TTS
then enable /voice on in Telegram or Discord
only after that, try Discord VC mode

That progression keeps the debugging surface small.

Use Voice Mode with Hermes

Use Voice Mode with Hermes

What voice mode is good for

Choose your voice mode setup

Step 1: make sure normal Hermes works first

Step 2: install the right extras

CLI microphone + playback

Messaging platforms

Premium ElevenLabs TTS

Local NeuTTS (optional)

Everything

Step 3: install system dependencies

macOS

Ubuntu / Debian

Step 4: choose STT and TTS providers

Easiest / cheapest setup

Environment file example

Provider recommendations

Speech-to-text

Text-to-speech

If you use hermes setup

Step 5: recommended config

Use case 1: CLI voice mode

Turn it on

Recording flow

Useful commands

Good CLI workflows

Walk-up debugging

Research / brainstorming

Accessibility / low-typing sessions

Tuning CLI behavior

Silence threshold

Silence duration

Record key

Use case 2: voice replies in Telegram or Discord

Start the gateway

Turn on voice replies

Modes

When to use which mode

Good messaging workflows

Telegram assistant on your phone

Discord DMs with spoken output

Use case 3: Discord voice channels

Required Discord permissions

Join and leave

What happens when joined

Best practices for Discord VC use

Voice quality recommendations

Best quality setup

Best speed / convenience setup

Best zero-cost setup

Common failure modes

"No audio device found"

"Bot joins but hears nothing"

"It transcribes but does not speak"

"Whisper outputs garbage"

"It works in DMs but not in server channels"

Suggested first-week setup

Where to read next

If you use `hermes setup`