Back to Docling

Supported Formats

docs/usage/supported_formats.md

2.93.01.9 KB
Original Source

Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too — check out Architecture for more details.

Below you can find a listing of all supported input and output formats.

Supported input formats

FormatDescription
PDF
DOCX, XLSX, PPTXDefault formats in MS Office 2007+, based on Office Open XML
Markdown
AsciiDocHuman-readable, plain-text markup language for structured technical content
LaTeXScientific document preparation system
HTML, XHTML
CSV
PNG, JPEG, TIFF, BMP, WEBPImage formats
WAV, MP3, M4A, AAC, OGG, FLACAudio formats (requires asr extra — see Processing audio and video)
MP4, AVI, MOVVideo formats — audio track is extracted and transcribed (requires asr extra and ffmpeg)
WebVTTWeb Video Text Tracks format for displaying timed text

Schema-specific support:

FormatDescription
USPTO XMLXML format followed by USPTO patents
JATS XMLXML format followed by JATS articles
XBRL XMLXML format for business and financial reporting following XBRL standard
Docling JSONJSON-serialized Docling Document

Supported output formats

FormatDescription
HTMLBoth image embedding and referencing are supported
Markdown
JSONLossless serialization of Docling Document
TextPlain text, i.e. without Markdown markers
DoctagsMarkup format for efficiently representing the full content and layout characteristics of a document
WebVTTWeb Video Text Tracks format for displaying timed text