Back to Picoclaw

ASR (Automatic Speech Recognition)

pkg/audio/asr/README.md

0.2.84.7 KB
Original Source

ASR (Automatic Speech Recognition)

This package handles speech-to-text for PicoClaw voice input.

If you are new to ASR setup, the simplest mental model is:

  1. Add one or more ASR-capable entries to model_list.
  2. Point voice.model_name at the one you want to use.
  3. Put the API key in .security.yml.

Quick Recommendation

For most new users, start with one of these:

ProviderExample modelWhy start here
Groqgroq/whisper-large-v3-turboFast Whisper-style transcription and a straightforward OpenAI-compatible API. Groq currently advertises a free tier plan for 2000 reqs/day.
ElevenLabselevenlabs/scribe_v1Easy setup and strong speech-to-text quality. ElevenLabs currently advertises a free plan that includes speech-to-text usage.

Pricing and free-plan limits can change, so check the linked pricing pages before depending on them in production.

How ASR Configuration Works

PicoClaw does not keep ASR API keys inside the voice section.

Instead:

  • voice.model_name chooses a named entry from model_list.
  • The matching model_list entry describes the actual provider and model.
  • .security.yml stores the API key for that named model entry.

This is the recommended pattern because it is explicit, reusable, and consistent with the rest of PicoClaw's model configuration.

Option A: Groq Whisper

config.json

json
{
  "voice": {
    "model_name": "groq-asr",
    "echo_transcription": true
  },
  "model_list": [
    {
      "model_name": "groq-asr",
      "model": "groq/whisper-large-v3-turbo"
    }
  ]
}

.security.yml

yaml
model_list:
  groq-asr:
    api_keys:
      - "gsk_your_groq_key"

Notes:

  • You can omit api_base and PicoClaw will use Groq's default API base automatically.
  • If you set api_base manually for Groq Whisper, both of these forms work:
    • https://api.groq.com/openai/v1
    • https://api.groq.com/openai/v1/audio/transcriptions
  • Any OpenAI-compatible Whisper model name containing whisper can use the Whisper transcription path, not only whisper-large-v3-turbo.

Option B: ElevenLabs

config.json

json
{
  "voice": {
    "model_name": "elevenlabs-asr",
    "echo_transcription": true
  },
  "model_list": [
    {
      "model_name": "elevenlabs-asr",
      "model": "elevenlabs/scribe_v1"
    }
  ]
}

.security.yml

yaml
model_list:
  elevenlabs-asr:
    api_keys:
      - "sk-elevenlabs-your-key"

Option C: OpenAI Whisper

config.json

json
{
  "voice": {
    "model_name": "openai-asr"
  },
  "model_list": [
    {
      "model_name": "openai-asr",
      "model": "openai/whisper-1"
    }
  ]
}

.security.yml

yaml
model_list:
  openai-asr:
    api_keys:
      - "sk-openai-your-key"

Other ASR-Capable Model Types

PicoClaw currently supports three main ASR routes:

RouteExample modelsBehavior
ElevenLabs ASRelevenlabs/scribe_v1Uses the ElevenLabs transcription API.
Whisper endpoint modelsopenai/whisper-1, groq/whisper-large-v3Uses an OpenAI-compatible /audio/transcriptions endpoint.
Audio-capable chat models (Under construction)openai/gpt-4o-audio-preview, gemini/gemini-2.5-flashSends audio to a multimodal chat model and asks it to transcribe.

If you are unsure which one to pick, choose Groq Whisper or ElevenLabs first.

How PicoClaw Chooses a Transcriber

DetectTranscriber resolves ASR in this order:

  1. Preferred path: resolve voice.model_name against model_list.
  2. If that resolved model is:
    • elevenlabs/..., PicoClaw uses the ElevenLabs transcriber.
    • an OpenAI-compatible Whisper model, PicoClaw uses the Whisper transcriber.
    • an audio-capable chat model, PicoClaw uses AudioModelTranscriber.
  3. Fallback path: if voice.model_name is not set, PicoClaw performs a compatibility scan through model_list for legacy auto-detected ASR entries.

Fallback scanning exists for backward compatibility. New configurations should set voice.model_name explicitly.

Common Mistakes

  • Defining an ASR model in model_list but forgetting to set voice.model_name.
  • Putting the API key in voice instead of .security.yml.
  • Using a non-ASR model and expecting Whisper-style transcription behavior.
  • Setting a custom api_base that points to the wrong provider endpoint.

Minimal Checklist

Before testing voice input, make sure:

  • voice.model_name matches a model_list[].model_name.
  • The matching .security.yml entry contains a valid API key.
  • The selected model is actually ASR-capable.
  • Voice input is enabled for the channel you are using.