Back to Docling

Supported Formats

docs/usage/supported_formats.md

2.100.02.2 KB
Original Source

Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too — check out Architecture for more details.

Below you can find a listing of all supported input and output formats.

Supported input formats

FormatDescription
PDF
DOCX, XLSX, PPTXDefault formats in MS Office 2007+, based on Office Open XML
EPUBElectronic Publication format for e-books
Markdown
AsciiDocHuman-readable, plain-text markup language for structured technical content
LaTeXScientific document preparation system
HTML, XHTML
CSV
PNG, JPEG, TIFF, BMP, WEBPImage formats
WAV, MP3, M4A, AAC, OGG, FLACAudio formats (requires asr extra — see Processing audio and video)
MP4, AVI, MOVVideo formats — audio track is extracted and transcribed (requires asr extra and ffmpeg)
WebVTTWeb Video Text Tracks format for displaying timed text

Schema-specific support:

FormatDescription
DocLang XMLXML format following DocLang; supported extensions: .dclg, .dclg.xml
USPTO XMLXML format followed by USPTO patents
JATS XMLXML format followed by JATS articles
XBRL XMLXML format for business and financial reporting following XBRL standard
Docling JSONJSON-serialized Docling Document

Supported output formats

FormatDescription
HTMLBoth image embedding and referencing are supported
Markdown
JSONLossless serialization of Docling Document
DocLang XMLXML serialization following DocLang; CLI output format: doclang
TextPlain text, i.e. without Markdown markers
DoctagsMarkup format for efficiently representing the full content and layout characteristics of a document
WebVTTWeb Video Text Tracks format for displaying timed text