docs/_core_features/audio-transcription.md
{: .d-inline-block .no_toc }
v1.9.0+ {: .label .label-green }
{{ page.description }} {: .fs-6 .fw-300 }
{: .no_toc .text-delta }
After reading this guide, you will know:
Transcribe audio with the global RubyLLM.transcribe method:
transcription = RubyLLM.transcribe("meeting.wav")
puts transcription.text
# => "Welcome to today's meeting. Let's discuss..."
puts transcription.model
# => "whisper-1"
Supports MP3, M4A, WAV, WebM, OGG, and more.
# Whisper-1 (default, good for general use)
RubyLLM.transcribe("audio.mp3", model: "whisper-1")
# GPT-4o Transcribe (faster, better for technical content)
RubyLLM.transcribe("audio.mp3", model: "gpt-4o-transcribe")
# GPT-4o Mini Transcribe (fastest, lowest cost)
RubyLLM.transcribe("audio.mp3", model: "gpt-4o-mini-transcribe")
# Diarization model (identifies speakers)
RubyLLM.transcribe("meeting.wav", model: "gpt-4o-transcribe-diarize")
# Gemini 2.5 Flash/Pro (Google's multimodal transcription)
RubyLLM.transcribe(
"lecture.wav",
model: "gemini-2.5-flash",
prompt: "Return only the verbatim transcript."
)
Configure the default globally:
RubyLLM.configure do |config|
config.default_transcription_model = "gpt-4o-transcribe"
end
Improve accuracy by specifying the language:
RubyLLM.transcribe("entrevista.mp3", language: "es")
RubyLLM.transcribe("conference.mp3", language: "fr")
Use ISO 639-1 codes (en, es, fr, de, etc.).
The diarization model identifies different speakers:
transcription = RubyLLM.transcribe(
"team-meeting.wav",
model: "gpt-4o-transcribe-diarize"
)
transcription.segments.each do |segment|
puts "#{segment['speaker']}: #{segment['text']}"
puts " (#{segment['start']}s - #{segment['end']}s)"
end
# Output:
# A: Hi everyone.
# (0.5s - 1.2s)
# B: Happy to be here.
# (2.8s - 3.5s)
Provide 2-10 second reference clips to map speakers to names:
transcription = RubyLLM.transcribe(
"team-meeting.wav",
model: "gpt-4o-transcribe-diarize",
speaker_names: ["Alice", "Bob"],
speaker_references: ["alice-voice.wav", "bob-voice.wav"]
)
# Now segments use the provided names
# Alice: Hi everyone.
# Bob: Happy to be here.
Speaker references accept file paths, URLs, IO objects, or ActiveStorage attachments.
Note: Gemini models currently return plain text transcripts without segment metadata. Use OpenAI's diarization models when you need speaker labels or timestamps.
Guide the model with context about technical terms or domain-specific vocabulary:
RubyLLM.transcribe(
"developer-talk.mp3",
prompt: "Discussion about Ruby, Rails, PostgreSQL, and Redis."
)
RubyLLM.transcribe(
"product-demo.mp3",
prompt: "Product demo for ZyntriQix, Digique Plus, and CynapseFive."
)
Gemini treats transcription requests like any other conversation. Use the prompt: argument to steer formatting (for example, "Respond with plain text only."), and combine it with language: when you want a specific locale in the final transcript. RubyLLM automatically adds the language hint to the Gemini request.
Access detailed timing information:
transcription = RubyLLM.transcribe("interview.mp3", model: "gpt-4o-transcribe")
puts "Duration: #{transcription.duration} seconds"
transcription.segments.each do |segment|
puts "#{segment['start']}s - #{segment['end']}s: #{segment['text']}"
end
The default timeout is 5 minutes. Increase it for longer audio:
RubyLLM.configure do |config|
config.request_timeout = 600 # 10 minutes
end
The API supports files up to 25 MB. For larger files, use compressed formats (MP3, M4A) or split into chunks.
begin
transcription = RubyLLM.transcribe("audio.mp3")
puts transcription.text
rescue RubyLLM::BadRequestError => e
puts "Invalid request: #{e.message}"
rescue RubyLLM::TimeoutError => e
puts "Transcription timed out: #{e.message}"
rescue RubyLLM::Error => e
puts "Transcription failed: #{e.message}"
end