Processing audio and video

Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured DoclingDocument — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes.

Under the hood, Docling transcribes with OpenAI Whisper. By default the pipeline auto-selects the best backend for your hardware — mlx-whisper on Apple Silicon and native Whisper everywhere else — so the basic example below needs no configuration. To change the model size, force a particular backend, or opt into the faster experimental WhisperS2T backend, see Choosing an ASR model and backend.

Supported formats

Type	Formats
Audio	WAV, MP3, M4A, AAC, OGG, FLAC
Video	MP4, AVI, MOV

For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually.

!!! note "ffmpeg required" Whisper audio decoding requires the ffmpeg executable to be installed and available on your PATH. This applies to common audio formats such as MP3, WAV, M4A, AAC, OGG, and FLAC, and to video files whose audio track is extracted before transcription. Install it with your system package manager — e.g. brew install ffmpeg on macOS, apt-get install ffmpeg on Debian-based Linux, or winget install ffmpeg on Windows.

Installation

The ASR pipeline is an optional extra. Install it alongside the base package:

bash

pip install "docling[asr]"

Or with uv:

bash

uv add "docling[asr]"

!!! note "WhisperS2T on Linux with CUDA" The optional WhisperS2T backend uses CTranslate2, which loads NVIDIA's cuBLAS shared library at runtime. On Linux, if WhisperS2T model loading fails because the library cannot be found, add it to your LD_LIBRARY_PATH. When cuBLAS is installed from a pip wheel (e.g. nvidia-cublas-cu12), the shared library lives under the nvidia/cublas/lib directory inside your environment's site-packages.

Basic usage

python

from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert(Path("recording.mp3"))
doc = result.document

# Export to Markdown
print(doc.export_to_markdown())

The same code works for video — pass an .mp4, .mov, or .avi path and Docling handles the rest.

Exporting to different formats

result.document is a DoclingDocument. You can export it to any supported format:

python

doc.export_to_markdown()   # Markdown
doc.export_to_dict()       # JSON-serializable dict
doc.export_to_html()       # HTML
doc.export_to_doctags()    # DocTags

See Serialization for more on export options.

Understanding the output

The ASR pipeline produces paragraph-level Markdown with timestamps per segment:

[time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde

[time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.

This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage.

A practical use case: searchable meeting archives

A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files.

Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive.

Standalone transcription script

For a full working example, see the example-docling-media repository, which processes a directory of audio/video files and writes each transcript to a Markdown file.

The core of that project is ~30 lines:

python

from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline


def main():
    audio_path = Path("videoplayback.mp3")

    pipeline_options = AsrPipelineOptions()
    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    result = converter.convert(audio_path)
    md = result.document.export_to_markdown()
    Path("transcript.md").write_text(md)
    print(md)


if __name__ == "__main__":
    main()

Building a RAG pipeline with LangChain

Docling integrates with LangChain via DoclingLoader, which wraps DocumentConverter and handles chunking automatically. To build a retrieval pipeline over your audio archive:

python

from langchain_docling import DoclingLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load and chunk all audio files in a directory
loader = DoclingLoader("recordings/")
docs = loader.load()

# Embed and index
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Query in natural language
results = retriever.invoke("What did we decide about the auth service in Q3?")

See the LangChain integration guide for more details on DoclingLoader options.

Choosing an ASR model and backend

Docling ships three interchangeable ASR backends, all installed by the asr extra shown above:

Backend	Library	Hardware	Notes
Native Whisper	`openai-whisper` (PyTorch)	CPU, CUDA	Default; broadest compatibility
MLX Whisper	`mlx-whisper`	Apple Silicon (MPS)	Optimized for M-series Macs
WhisperS2T	`whisper-s2t-reborn` (CTranslate2)	CPU, CUDA	Optional & experimental; batched decoding for high throughput

Automatic backend selection

The auto-selecting presets — WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, WHISPER_LARGE, and WHISPER_TURBO — pick a backend for you based on the hardware they detect, in this priority order:

MLX Whisper — on Apple Silicon, when mlx-whisper is installed.
Native Whisper — on all other hardware.

WhisperS2T is never auto-selected; you opt into it explicitly (see below).

This is why the Basic usage example needs no hardware-specific code — asr_model_specs.WHISPER_TURBO runs on MLX on a Mac and on native Whisper on Linux and Windows. WHISPER_TURBO is a good default; to change the model size, swap in another auto-selecting preset:

python

from docling.datamodel import asr_model_specs

pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE

Forcing a specific backend

Each size also has explicit variants that bypass hardware detection, suffixed _NATIVE, _MLX, and _S2T. Use them to pin a backend regardless of platform:

python

from docling.datamodel import asr_model_specs

# Native OpenAI Whisper (CPU / CUDA)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_NATIVE

# MLX (Apple Silicon)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_MLX

The rest of the setup — the DocumentConverter from Basic usage — is unchanged.

WhisperS2T: high-throughput transcription

WhisperS2T runs Whisper through CTranslate2 with batched, VAD-segmented decoding. On CPU and CUDA it is typically the fastest backend and uses less VRAM than native Whisper at the larger model sizes, which makes it well suited to transcribing large batches of files. It is experimental and opt-in — select a _S2T preset:

python

from docling.datamodel import asr_model_specs

pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3_S2T

Available _S2T presets:

Preset	HuggingFace model	Multilingual?
`WHISPER_TINY_S2T`	`tiny`	yes
`WHISPER_TINY_EN_S2T`	`tiny.en`	English-only
`WHISPER_BASE_S2T`	`base`	yes
`WHISPER_BASE_EN_S2T`	`base.en`	English-only
`WHISPER_SMALL_S2T`	`small`	yes
`WHISPER_SMALL_EN_S2T`	`small.en`	English-only
`WHISPER_DISTIL_SMALL_EN_S2T`	`distil-small.en`	English-only
`WHISPER_MEDIUM_S2T`	`medium`	yes
`WHISPER_MEDIUM_EN_S2T`	`medium.en`	English-only
`WHISPER_DISTIL_MEDIUM_EN_S2T`	`distil-medium.en`	English-only
`WHISPER_LARGE_V3_S2T`	`large-v3`	yes
`WHISPER_DISTIL_LARGE_V3_S2T`	`distil-large-v3`	English-only
`WHISPER_DISTIL_LARGE_V3_5_S2T`	`distil-large-v3.5`	English-only
`WHISPER_LARGE_V3_TURBO_S2T`	`large-v3-turbo`	yes (no `translate`)

The English-only presets reject a non-en language and the translate task; large-v3-turbo is multilingual but does not support translate. For multilingual transcription or speech translation, use a multilingual preset such as WHISPER_LARGE_V3_S2T.

To tune throughput and accuracy, construct the options directly instead of using a preset:

python

from docling.datamodel.pipeline_options_asr_model import (
    InferenceAsrFramework,
    InlineAsrWhisperS2TOptions,
)

pipeline_options.asr_options = InlineAsrWhisperS2TOptions(
    repo_id="large-v3",
    inference_framework=InferenceAsrFramework.WHISPER_S2T,
    language="en",
    torch_dtype="float16",  # float32 | float16 | bfloat16
    batch_size=8,           # higher = more throughput, more VRAM
    beam_size=1,            # 1 = greedy (fastest); higher may improve accuracy
)

!!! note "WhisperS2T is not available on Apple Silicon" The whisper-s2t-reborn dependency installs only on non-Apple-Silicon platforms, so _S2T presets can't be used on M-series Macs — use the native or MLX backends there. On Linux with CUDA, see the cuBLAS note above if model loading fails.

From the command line

The docling CLI selects any preset with --asr-model (values are the lower-case preset names). Audio and video inputs route to the ASR pipeline automatically, so no extra flag is required:

bash

# auto-selecting default
docling --to md --asr-model whisper_turbo recording.mp3

# force native Whisper
docling --to md --asr-model whisper_turbo_native recording.mp3

# WhisperS2T, distilled large-v3
docling --to md --asr-model whisper_distil_large_v3_s2t recording.mp3

See the CLI reference for the complete list of --asr-model values.

Limitations

Limitation	Workaround
No SRT/WebVTT subtitle output	Use `openai-whisper` CLI: `whisper audio.mp3 --output_format srt`
No speaker diarization	Use `pyannote-audio` as a pre- or post-processing step
No word-level timestamps	Not available in current export formats

For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows.

Processing audio and video

Processing audio and video

Supported formats

Installation

Basic usage

Exporting to different formats

Understanding the output

A practical use case: searchable meeting archives

Standalone transcription script

Building a RAG pipeline with LangChain

Choosing an ASR model and backend

Automatic backend selection

Forcing a specific backend

WhisperS2T: high-throughput transcription

From the command line

Limitations

See also