docs/usage/processing_audio_media.md
Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured DoclingDocument — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes.
Under the hood, Docling transcribes with OpenAI Whisper. By default the pipeline auto-selects the best backend for your hardware — mlx-whisper on Apple Silicon and native Whisper everywhere else — so the basic example below needs no configuration. To change the model size, force a particular backend, or opt into the faster experimental WhisperS2T backend, see Choosing an ASR model and backend.
| Type | Formats |
|---|---|
| Audio | WAV, MP3, M4A, AAC, OGG, FLAC |
| Video | MP4, AVI, MOV |
For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually.
!!! note "ffmpeg required"
Whisper audio decoding requires the ffmpeg executable to be installed and available on your PATH. This applies to common audio formats such as MP3, WAV, M4A, AAC, OGG, and FLAC, and to video files whose audio track is extracted before transcription. Install it with your system package manager — e.g. brew install ffmpeg on macOS, apt-get install ffmpeg on Debian-based Linux, or winget install ffmpeg on Windows.
The ASR pipeline is an optional extra. Install it alongside the base package:
pip install "docling[asr]"
Or with uv:
uv add "docling[asr]"
!!! note "WhisperS2T on Linux with CUDA"
The optional WhisperS2T backend uses CTranslate2, which loads NVIDIA's cuBLAS shared library at runtime. On Linux, if WhisperS2T model loading fails because the library cannot be found, add it to your LD_LIBRARY_PATH. When cuBLAS is installed from a pip wheel (e.g. nvidia-cublas-cu12), the shared library lives under the nvidia/cublas/lib directory inside your environment's site-packages.
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
converter = DocumentConverter(
format_options={
InputFormat.AUDIO: AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert(Path("recording.mp3"))
doc = result.document
# Export to Markdown
print(doc.export_to_markdown())
The same code works for video — pass an .mp4, .mov, or .avi path and Docling handles the rest.
result.document is a DoclingDocument. You can export it to any supported format:
doc.export_to_markdown() # Markdown
doc.export_to_dict() # JSON-serializable dict
doc.export_to_html() # HTML
doc.export_to_doctags() # DocTags
See Serialization for more on export options.
The ASR pipeline produces paragraph-level Markdown with timestamps per segment:
[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage.
A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files.
Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive.
For a full working example, see the example-docling-media repository, which processes a directory of audio/video files and writes each transcript to a Markdown file.
The core of that project is ~30 lines:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline
def main():
audio_path = Path("videoplayback.mp3")
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
converter = DocumentConverter(
format_options={
InputFormat.AUDIO: AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert(audio_path)
md = result.document.export_to_markdown()
Path("transcript.md").write_text(md)
print(md)
if __name__ == "__main__":
main()
Docling integrates with LangChain via DoclingLoader, which wraps DocumentConverter and handles chunking automatically. To build a retrieval pipeline over your audio archive:
from langchain_docling import DoclingLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Load and chunk all audio files in a directory
loader = DoclingLoader("recordings/")
docs = loader.load()
# Embed and index
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
# Query in natural language
results = retriever.invoke("What did we decide about the auth service in Q3?")
See the LangChain integration guide for more details on DoclingLoader options.
Docling ships three interchangeable ASR backends, all installed by the asr extra shown above:
| Backend | Library | Hardware | Notes |
|---|---|---|---|
| Native Whisper | openai-whisper (PyTorch) | CPU, CUDA | Default; broadest compatibility |
| MLX Whisper | mlx-whisper | Apple Silicon (MPS) | Optimized for M-series Macs |
| WhisperS2T | whisper-s2t-reborn (CTranslate2) | CPU, CUDA | Optional & experimental; batched decoding for high throughput |
The auto-selecting presets — WHISPER_TINY, WHISPER_BASE, WHISPER_SMALL, WHISPER_MEDIUM, WHISPER_LARGE, and WHISPER_TURBO — pick a backend for you based on the hardware they detect, in this priority order:
mlx-whisper is installed.WhisperS2T is never auto-selected; you opt into it explicitly (see below).
This is why the Basic usage example needs no hardware-specific code — asr_model_specs.WHISPER_TURBO runs on MLX on a Mac and on native Whisper on Linux and Windows. WHISPER_TURBO is a good default; to change the model size, swap in another auto-selecting preset:
from docling.datamodel import asr_model_specs
pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE
Each size also has explicit variants that bypass hardware detection, suffixed _NATIVE, _MLX, and _S2T. Use them to pin a backend regardless of platform:
from docling.datamodel import asr_model_specs
# Native OpenAI Whisper (CPU / CUDA)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_NATIVE
# MLX (Apple Silicon)
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO_MLX
The rest of the setup — the DocumentConverter from Basic usage — is unchanged.
WhisperS2T runs Whisper through CTranslate2 with batched, VAD-segmented decoding. On CPU and CUDA it is typically the fastest backend and uses less VRAM than native Whisper at the larger model sizes, which makes it well suited to transcribing large batches of files. It is experimental and opt-in — select a _S2T preset:
from docling.datamodel import asr_model_specs
pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3_S2T
Available _S2T presets:
| Preset | HuggingFace model | Multilingual? |
|---|---|---|
WHISPER_TINY_S2T | tiny | yes |
WHISPER_TINY_EN_S2T | tiny.en | English-only |
WHISPER_BASE_S2T | base | yes |
WHISPER_BASE_EN_S2T | base.en | English-only |
WHISPER_SMALL_S2T | small | yes |
WHISPER_SMALL_EN_S2T | small.en | English-only |
WHISPER_DISTIL_SMALL_EN_S2T | distil-small.en | English-only |
WHISPER_MEDIUM_S2T | medium | yes |
WHISPER_MEDIUM_EN_S2T | medium.en | English-only |
WHISPER_DISTIL_MEDIUM_EN_S2T | distil-medium.en | English-only |
WHISPER_LARGE_V3_S2T | large-v3 | yes |
WHISPER_DISTIL_LARGE_V3_S2T | distil-large-v3 | English-only |
WHISPER_DISTIL_LARGE_V3_5_S2T | distil-large-v3.5 | English-only |
WHISPER_LARGE_V3_TURBO_S2T | large-v3-turbo | yes (no translate) |
The English-only presets reject a non-en language and the translate task; large-v3-turbo is multilingual but does not support translate. For multilingual transcription or speech translation, use a multilingual preset such as WHISPER_LARGE_V3_S2T.
To tune throughput and accuracy, construct the options directly instead of using a preset:
from docling.datamodel.pipeline_options_asr_model import (
InferenceAsrFramework,
InlineAsrWhisperS2TOptions,
)
pipeline_options.asr_options = InlineAsrWhisperS2TOptions(
repo_id="large-v3",
inference_framework=InferenceAsrFramework.WHISPER_S2T,
language="en",
torch_dtype="float16", # float32 | float16 | bfloat16
batch_size=8, # higher = more throughput, more VRAM
beam_size=1, # 1 = greedy (fastest); higher may improve accuracy
)
!!! note "WhisperS2T is not available on Apple Silicon"
The whisper-s2t-reborn dependency installs only on non-Apple-Silicon platforms, so _S2T presets can't be used on M-series Macs — use the native or MLX backends there. On Linux with CUDA, see the cuBLAS note above if model loading fails.
The docling CLI selects any preset with --asr-model (values are the lower-case preset names). Audio and video inputs route to the ASR pipeline automatically, so no extra flag is required:
# auto-selecting default
docling --to md --asr-model whisper_turbo recording.mp3
# force native Whisper
docling --to md --asr-model whisper_turbo_native recording.mp3
# WhisperS2T, distilled large-v3
docling --to md --asr-model whisper_distil_large_v3_s2t recording.mp3
See the CLI reference for the complete list of --asr-model values.
| Limitation | Workaround |
|---|---|
| No SRT/WebVTT subtitle output | Use openai-whisper CLI: whisper audio.mp3 --output_format srt |
| No speaker diarization | Use pyannote-audio as a pre- or post-processing step |
| No word-level timestamps | Not available in current export formats |
For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows.