Back to Docling

Processing audio and video

docs/usage/processing_audio_media.md

2.109.012.2 KB
Original Source

Processing audio and video

Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured DoclingDocument — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes.

Under the hood, Docling transcribes with OpenAI Whisper. By default the pipeline auto-selects the best backend for your hardware — mlx-whisper on Apple Silicon and native Whisper everywhere else — so the basic example below needs no configuration. To change the model size, force a particular backend, or opt into the faster experimental WhisperS2T backend, see Choosing an ASR model and backend.

Supported formats

TypeFormats
AudioWAV, MP3, M4A, AAC, OGG, FLAC
VideoMP4, AVI, MOV

For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually.

!!! note "ffmpeg required" Whisper audio decoding requires the ffmpeg executable to be installed and available on your PATH. This applies to common audio formats such as MP3, WAV, M4A, AAC, OGG, and FLAC, and to video files whose audio track is extracted before transcription. Install it with your system package manager — e.g. brew install ffmpeg on macOS, apt-get install ffmpeg on Debian-based Linux, or winget install ffmpeg on Windows.

Installation

The ASR pipeline is an optional extra. Install it alongside the base package:

bash
pip install "docling[asr]"

Or with uv:

bash
uv add "docling[asr]"

!!! note "WhisperS2T on Linux with CUDA" The optional WhisperS2T backend uses CTranslate2, which loads NVIDIA's cuBLAS shared library at runtime. On Linux, if WhisperS2T model loading fails because the library cannot be found, add it to your LD_LIBRARY_PATH. When cuBLAS is installed from a pip wheel (e.g. nvidia-cublas-cu12), the shared library lives under the nvidia/cublas/lib directory inside your environment's site-packages.

Basic usage

python
from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert(Path("recording.mp3"))
doc = result.document

# Export to Markdown
print(doc.export_to_markdown())

The same code works for video — pass an .mp4, .mov, or .avi path and Docling handles the rest.

Exporting to different formats

result.document is a DoclingDocument. You can export it to any supported format:

python
doc.export_to_markdown()   # Markdown
doc.export_to_dict()       # JSON-serializable dict
doc.export_to_html()       # HTML
doc.export_to_doctags()    # DocTags

See Serialization for more on export options.

Understanding the output

The ASR pipeline produces paragraph-level Markdown with timestamps per segment:

[time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde

[time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.

This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage.

A practical use case: searchable meeting archives

A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files.

Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive.

Standalone transcription script

For a full working example, see the example-docling-media repository, which processes a directory of audio/video files and writes each transcript to a Markdown file.

The core of that project is ~30 lines:

python
from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline


def main():
    audio_path = Path("videoplayback.mp3")

    pipeline_options = AsrPipelineOptions()
    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    result = converter.convert(audio_path)
    md = result.document.export_to_markdown()
    Path("transcript.md").write_text(md)
    print(md)


if __name__ == "__main__":
    main()

Building a RAG pipeline with LangChain

Docling integrates with LangChain via DoclingLoader, which wraps DocumentConverter and handles chunking automatically. To build a retrieval pipeline over your audio archive:

python
from langchain_docling import DoclingLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load and chunk all audio files in a directory
loader = DoclingLoader("recordings/")
docs = loader.load()

# Embed and index
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Query in natural language
results = retriever.invoke("What did we decide about the auth service in Q3?")

See the LangChain integration guide for more details on DoclingLoader options.

Choosing an ASR model and backend

Docling ships three interchangeable ASR backends, all installed by the asr extra shown above:

BackendLibraryHardwareNotes
Native Whisperopenai-whisper (PyTorch)CPU, CUDADefault; broadest compatibility
MLX Whispermlx-whisperApple Silicon (MPS)Optimized for M-series Macs
WhisperS2Twhisper-s2t-reborn (CTranslate2)CPU, CUDAOptional & experimental; batched decoding for high throughput

Automatic backend selection

The auto-selecting presets — WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, WHISPER_LARGE, and WHISPER_TURBO — pick a backend for you based on the hardware they detect, in this priority order:

  1. MLX Whisper — on Apple Silicon, when mlx-whisper is installed.
  2. Native Whisper — on all other hardware.

WhisperS2T is never auto-selected; you opt into it explicitly (see below).

This is why the Basic usage example needs no hardware-specific code — asr_model_specs.WHISPER_TURBO runs on MLX on a Mac and on native Whisper on Linux and Windows. WHISPER_TURBO is a good default; to change the model size, swap in another auto-selecting preset:

python
from docling.datamodel import asr_model_specs

pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE

Forcing a specific backend

Each size also has explicit variants that bypass hardware detection, suffixed _NATIVE, _MLX, and _S2T. Use them to pin a backend regardless of platform:

python
from docling.datamodel import asr_model_specs

# Native OpenAI Whisper (CPU / CUDA)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_NATIVE

# MLX (Apple Silicon)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_MLX

The rest of the setup — the DocumentConverter from Basic usage — is unchanged.

WhisperS2T: high-throughput transcription

WhisperS2T runs Whisper through CTranslate2 with batched, VAD-segmented decoding. On CPU and CUDA it is typically the fastest backend and uses less VRAM than native Whisper at the larger model sizes, which makes it well suited to transcribing large batches of files. It is experimental and opt-in — select a _S2T preset:

python
from docling.datamodel import asr_model_specs

pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3_S2T

Available _S2T presets:

PresetHuggingFace modelMultilingual?
WHISPER_TINY_S2Ttinyyes
WHISPER_TINY_EN_S2Ttiny.enEnglish-only
WHISPER_BASE_S2Tbaseyes
WHISPER_BASE_EN_S2Tbase.enEnglish-only
WHISPER_SMALL_S2Tsmallyes
WHISPER_SMALL_EN_S2Tsmall.enEnglish-only
WHISPER_DISTIL_SMALL_EN_S2Tdistil-small.enEnglish-only
WHISPER_MEDIUM_S2Tmediumyes
WHISPER_MEDIUM_EN_S2Tmedium.enEnglish-only
WHISPER_DISTIL_MEDIUM_EN_S2Tdistil-medium.enEnglish-only
WHISPER_LARGE_V3_S2Tlarge-v3yes
WHISPER_DISTIL_LARGE_V3_S2Tdistil-large-v3English-only
WHISPER_DISTIL_LARGE_V3_5_S2Tdistil-large-v3.5English-only
WHISPER_LARGE_V3_TURBO_S2Tlarge-v3-turboyes (no translate)

The English-only presets reject a non-en language and the translate task; large-v3-turbo is multilingual but does not support translate. For multilingual transcription or speech translation, use a multilingual preset such as WHISPER_LARGE_V3_S2T.

To tune throughput and accuracy, construct the options directly instead of using a preset:

python
from docling.datamodel.pipeline_options_asr_model import (
    InferenceAsrFramework,
    InlineAsrWhisperS2TOptions,
)

pipeline_options.asr_options = InlineAsrWhisperS2TOptions(
    repo_id="large-v3",
    inference_framework=InferenceAsrFramework.WHISPER_S2T,
    language="en",
    torch_dtype="float16",  # float32 | float16 | bfloat16
    batch_size=8,           # higher = more throughput, more VRAM
    beam_size=1,            # 1 = greedy (fastest); higher may improve accuracy
)

!!! note "WhisperS2T is not available on Apple Silicon" The whisper-s2t-reborn dependency installs only on non-Apple-Silicon platforms, so _S2T presets can't be used on M-series Macs — use the native or MLX backends there. On Linux with CUDA, see the cuBLAS note above if model loading fails.

From the command line

The docling CLI selects any preset with --asr-model (values are the lower-case preset names). Audio and video inputs route to the ASR pipeline automatically, so no extra flag is required:

bash
# auto-selecting default
docling --to md --asr-model whisper_turbo recording.mp3

# force native Whisper
docling --to md --asr-model whisper_turbo_native recording.mp3

# WhisperS2T, distilled large-v3
docling --to md --asr-model whisper_distil_large_v3_s2t recording.mp3

See the CLI reference for the complete list of --asr-model values.

Limitations

LimitationWorkaround
No SRT/WebVTT subtitle outputUse openai-whisper CLI: whisper audio.mp3 --output_format srt
No speaker diarizationUse pyannote-audio as a pre- or post-processing step
No word-level timestampsNot available in current export formats

For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows.

See also