Back to Open Notebook

Local Speech-to-Text Setup

docs/5-CONFIGURATION/local-stt.md

1.8.59.3 KB
Original Source

Local Speech-to-Text Setup

Run speech-to-text locally for free, private audio/video transcription using OpenAI-compatible STT servers.


Why Local STT?

BenefitDescription
FreeNo per-minute costs after setup
PrivateAudio never leaves your machine
UnlimitedNo rate limits or quotas
OfflineWorks without internet

Quick Start with Speaches

Speaches is an open-source, OpenAI-compatible server that supports both TTS and STT. It uses faster-whisper for transcription.

💡 Ready-made Docker Compose files available:

These include complete setup instructions and configuration examples. Just copy and run!

Step 1: Create Docker Compose File

Create a folder and add docker-compose.yml:

yaml
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cpu
    container_name: speaches
    ports:
      - "8969:8000"
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    restart: unless-stopped

volumes:
  hf-hub-cache:

Step 2: Start and Download Model

bash
# Start Speaches
docker compose up -d

# Wait for startup
sleep 10

# Download Whisper model (~500MB for small)
docker compose exec speaches uv tool run speaches-cli model download Systran/faster-whisper-small

Models can also be downloaded automatically on first use, but pre-downloading avoids delays.

Step 3: Test

bash
# Create a test audio file (or use your own)
# Then transcribe it:
curl "http://localhost:8969/v1/audio/transcriptions" \
  -F "[email protected]" \
  -F "model=Systran/faster-whisper-small"

You should see the transcribed text in the response.

Step 4: Configure Open Notebook

Via Settings UI (Recommended):

  1. Go to SettingsAPI Keys
  2. Click Add Credential → Select OpenAI-Compatible
  3. Enter base URL for STT: http://host.docker.internal:8969/v1 (Docker) or http://localhost:8969/v1 (local)
  4. Click Save, then Test Connection

Legacy (Deprecated) — Environment variables:

yaml
# In your Open Notebook docker-compose.yml
environment:
  - OPENAI_COMPATIBLE_BASE_URL_STT=http://host.docker.internal:8969/v1
bash
# Local development
export OPENAI_COMPATIBLE_BASE_URL_STT=http://localhost:8969/v1

Step 5: Add Model in Open Notebook

  1. Go to SettingsModels
  2. Click Add Model in Speech-to-Text section
  3. Configure:
    • Provider: openai_compatible
    • Model Name: Systran/faster-whisper-small
    • Display Name: Local Whisper
  4. Click Save
  5. Set as default if desired

Available Models

Speaches supports various Whisper model sizes. Larger models are more accurate but slower:

ModelSizeSpeedAccuracyVRAM (GPU)
Systran/faster-whisper-tiny~75 MBFastestBasic~1 GB
Systran/faster-whisper-base~150 MBFastGood~1 GB
Systran/faster-whisper-small~500 MBMediumBetter~2 GB
Systran/faster-whisper-medium~1.5 GBSlowGreat~5 GB
Systran/faster-whisper-large-v3~3 GBSlowestBest~10 GB
Systran/faster-distil-whisper-small.en~400 MBFastGood (English only)~2 GB

List Available Models

bash
docker compose exec speaches uv tool run speaches-cli registry ls --task automatic-speech-recognition
  • For speed: Systran/faster-whisper-tiny or Systran/faster-whisper-base
  • For balance: Systran/faster-whisper-small (recommended)
  • For accuracy: Systran/faster-whisper-large-v3

GPU Acceleration

For faster transcription with NVIDIA GPUs:

yaml
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cuda
    container_name: speaches
    ports:
      - "8969:8000"
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    environment:
      - WHISPER__TTL=-1  # Keep model in VRAM (recommended if you have enough memory)
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  hf-hub-cache:

Keep Model in Memory

By default, Speaches unloads models after some time. To keep the Whisper model loaded for instant transcription:

yaml
environment:
  - WHISPER__TTL=-1  # Never unload

This is recommended if you have enough RAM/VRAM, as loading the model can take a few seconds.


Docker Networking

When configuring your OpenAI-Compatible credential in Settings → API Keys, use the appropriate STT base URL for your setup:

Open Notebook in Docker (macOS/Windows)

STT Base URL: http://host.docker.internal:8969/v1

Open Notebook in Docker (Linux)

STT Base URL (Option 1 — Docker bridge IP): http://172.17.0.1:8969/v1

Option 2: Use host networking mode (docker run --network host ...), then use: http://localhost:8969/v1

Remote Server

Run Speaches on a different machine:

STT Base URL: http://server-ip:8969/v1 (replace with your server's IP)


Language Support

Whisper supports 99+ languages. Specify the language for better accuracy:

bash
curl "http://localhost:8969/v1/audio/transcriptions" \
  -F "[email protected]" \
  -F "model=Systran/faster-whisper-small" \
  -F "language=ru"

Common language codes:

  • en - English
  • ru - Russian
  • es - Spanish
  • fr - French
  • de - German
  • zh - Chinese
  • ja - Japanese

Troubleshooting

Service Won't Start

bash
# Check logs
docker compose logs speaches

# Verify port available
lsof -i :8969

# Restart
docker compose down && docker compose up -d

Connection Refused

bash
# Test Speaches is running
curl http://localhost:8969/v1/models

# From inside Open Notebook container
docker exec -it open-notebook curl http://host.docker.internal:8969/v1/models

Model Download Fails

Models are downloaded automatically on first use. If download fails:

bash
# Check available disk space
df -h

# Check Docker logs for errors
docker compose logs speaches

# Restart and try again
docker compose restart speaches

Poor Transcription Quality

  • Use a larger model (faster-whisper-medium or large-v3)
  • Specify the correct language
  • Ensure audio quality is good (clear speech, minimal background noise)
  • Try different audio formats (WAV often works better than MP3)

Slow Transcription

SolutionHow
Use GPUSwitch to latest-cuda image
Smaller modelUse faster-whisper-tiny or base
More CPUAllocate more cores in Docker
SSD storageMove Docker volumes to SSD

Performance Tips

ComponentMinimumRecommended
CPU2 cores4+ cores
RAM2 GB8+ GB
Storage5 GB10 GB (for multiple models)
GPUNoneNVIDIA (optional, much faster)

Resource Limits

yaml
services:
  speaches:
    # ... other config
    mem_limit: 4g
    cpus: 2

Monitor Usage

bash
docker stats speaches

Comparison: Local vs Cloud

AspectLocal (Speaches)Cloud (OpenAI Whisper)
CostFree$0.006/min
PrivacyCompleteData sent to provider
SpeedDepends on hardwareUsually faster
QualityExcellent (same Whisper)Excellent
SetupModerateSimple API key
OfflineYesNo
Languages99+99+

When to Use Local

  • Privacy-sensitive content
  • High-volume transcription
  • Development/testing
  • Offline environments
  • Cost control

When to Use Cloud

  • Limited hardware
  • Time-sensitive projects
  • No GPU available
  • Simple setup preferred

Using Both TTS and STT

Speaches supports both TTS and STT in one server. In Settings → API Keys, add a single OpenAI-Compatible credential and configure both the TTS and STT base URLs to point to the same Speaches server (e.g., http://localhost:8969/v1).

See Local TTS Setup for TTS configuration.


Other Local STT Options

Any OpenAI-compatible STT server works:

ServerDescription
SpeachesTTS + STT in one (recommended)
faster-whisper-serverLightweight STT only
whisper.cppC++ implementation with server mode
LocalAIMulti-model local AI server

The key requirements:

  1. Server implements /v1/audio/transcriptions endpoint
  2. Add an OpenAI-Compatible credential in Settings → API Keys with the STT base URL
  3. Add model with provider openai_compatible