Pipelines and workflows

[VoicePipeline][agents.voice.pipeline.VoicePipeline] is a class that makes it easy to turn your agentic workflows into a voice app. You pass in a workflow to run, and the pipeline takes care of transcribing input audio, detecting when the audio ends, calling your workflow at the right time, and turning the workflow output back into audio.

mermaid

graph LR
    %% Input
    A["🎤 Audio Input"]

    %% Voice Pipeline
    subgraph Voice_Pipeline [Voice Pipeline]
        direction TB
        B["Transcribe (speech-to-text)"]
        C["Your Code"]:::highlight
        D["Text-to-speech"]
        B --> C --> D
    end

    %% Output
    E["🎧 Audio Output"]

    %% Flow
    A --> Voice_Pipeline
    Voice_Pipeline --> E

    %% Custom styling
    classDef highlight fill:#ffcc66,stroke:#333,stroke-width:1px,font-weight:700;

Configuring a pipeline

When you create a pipeline, you can set a few things:

The [workflow][agents.voice.workflow.VoiceWorkflowBase], which is the code that runs each time new audio is transcribed.
The [speech-to-text][agents.voice.model.STTModel] and [text-to-speech][agents.voice.model.TTSModel] models used
The [config][agents.voice.pipeline_config.VoicePipelineConfig], which lets you configure things like:
- A model provider, which can map model names to models
- Tracing, including whether to disable tracing, whether audio files are uploaded, the workflow name, trace IDs etc.
- Settings on the TTS and STT models, such as the prompt, language, and data types used.

Running a pipeline

You can run a pipeline via the [run()][agents.voice.pipeline.VoicePipeline.run] method, which lets you pass in audio input in two forms:

[AudioInput][agents.voice.input.AudioInput] is used when you have a complete audio input and just want to produce a result for it. This is useful in cases where you don't need to detect when a speaker is done speaking; for example, when you have pre-recorded audio or in push-to-talk apps where it's clear when the user is done speaking.
[StreamedAudioInput][agents.voice.input.StreamedAudioInput] is used when you might need to detect when a user is done speaking. It allows you to push audio chunks as they are detected, and the voice pipeline will automatically run the agent workflow at the right time, via a process called "activity detection".

Results

The result of a voice pipeline run is a [StreamedAudioResult][agents.voice.result.StreamedAudioResult]. This is an object that lets you stream events as they occur. There are a few kinds of [VoiceStreamEvent][agents.voice.events.VoiceStreamEvent], including:

[VoiceStreamEventAudio][agents.voice.events.VoiceStreamEventAudio], which contains a chunk of audio.
[VoiceStreamEventLifecycle][agents.voice.events.VoiceStreamEventLifecycle], which informs you of lifecycle events like a turn starting or ending.
[VoiceStreamEventError][agents.voice.events.VoiceStreamEventError], which is an error event.

python


result = await pipeline.run(input)

async for event in result.stream():
    if event.type == "voice_stream_event_audio":
        # play audio
        pass
    elif event.type == "voice_stream_event_lifecycle":
        # lifecycle
        pass
    elif event.type == "voice_stream_event_error":
        # error
        pass

Best practices

Interruptions

The Agents SDK currently does not provide any built-in interruption handling for [StreamedAudioInput][agents.voice.input.StreamedAudioInput]. Instead, every detected turn triggers a separate run of your workflow. If you want to handle interruptions inside your application, you can listen to the [VoiceStreamEventLifecycle][agents.voice.events.VoiceStreamEventLifecycle] events. turn_started indicates that a new turn was transcribed and processing is beginning. turn_ended triggers after all the audio was dispatched for a respective turn. You could use these events to mute the speaker's microphone when the model starts a turn and unmute it after you flush all the related audio for a turn.