File Format Support

This document provides detailed information about each file format supported by MarkItDown.

Document Formats

PDF (.pdf)

Capabilities:

Text extraction
Table detection
Metadata extraction
OCR for scanned documents (with dependencies)

Dependencies:

bash

pip install 'markitdown[pdf]'

Best For:

Scientific papers
Reports
Books
Forms

Limitations:

Complex layouts may not preserve perfect formatting
Scanned PDFs require OCR setup
Some PDF features (annotations, forms) may not convert

Example:

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)

Enhanced with Azure Document Intelligence:

python

md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")

Microsoft Word (.docx)

Capabilities:

Text extraction
Table conversion
Heading hierarchy
List formatting
Basic text formatting (bold, italic)

Dependencies:

bash

pip install 'markitdown[docx]'

Best For:

Research papers
Reports
Documentation
Manuscripts

Preserved Elements:

Headings (converted to Markdown headers)
Tables (converted to Markdown tables)
Lists (bulleted and numbered)
Basic formatting (bold, italic)
Paragraphs

Example:

python

result = md.convert("manuscript.docx")

PowerPoint (.pptx)

Capabilities:

Slide content extraction
Speaker notes
Table extraction
Image descriptions (with AI)

Dependencies:

bash

pip install 'markitdown[pptx]'

Best For:

Presentations
Lecture slides
Conference talks

Output Format:

markdown

# Slide 1: Title

Content from slide 1...

**Notes**: Speaker notes appear here

---

# Slide 2: Next Topic

...

With AI Image Descriptions:

python

from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")

Excel (.xlsx, .xls)

Capabilities:

Sheet extraction
Table formatting
Data preservation
Formula values (calculated)

Dependencies:

bash

pip install 'markitdown[xlsx]'  # Modern Excel
pip install 'markitdown[xls]'   # Legacy Excel

Best For:

Data tables
Research data
Statistical results
Experimental data

Output Format:

markdown

# Sheet: Results

| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1      | 10.2    | 12.5      | 0.023   |
| 2      | 9.8     | 11.9      | 0.031   |

Example:

python

result = md.convert("experimental_data.xlsx")

Image Formats

Images (.jpg, .jpeg, .png, .gif, .webp)

Capabilities:

EXIF metadata extraction
OCR text extraction
AI-powered image descriptions

Dependencies:

bash

pip install 'markitdown[all]'  # Includes image support

Best For:

Scanned documents
Charts and graphs
Scientific diagrams
Photographs with text

Output Without AI:

markdown

![Image](image.jpg)

**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000

Output With AI:

python

from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")

OCR for Text Extraction: Requires Tesseract OCR:

bash

# macOS
brew install tesseract

# Ubuntu
sudo apt-get install tesseract-ocr

Audio Formats

Audio (.wav, .mp3)

Capabilities:

Metadata extraction
Speech-to-text transcription
Duration and technical info

Dependencies:

bash

pip install 'markitdown[audio-transcription]'

Best For:

Lecture recordings
Interviews
Podcasts
Meeting recordings

Output Format:

markdown

# Audio: interview.mp3

**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz

**Transcription**:
[Transcribed text appears here...]

Example:

python

result = md.convert("lecture.mp3")

Web Formats

HTML (.html, .htm)

Capabilities:

Clean HTML to Markdown conversion
Link preservation
Table conversion
List formatting

Best For:

Web pages
Documentation
Blog posts
Online articles

Output Format: Clean Markdown with preserved links and structure

Example:

python

result = md.convert("webpage.html")

YouTube URLs

Capabilities:

Fetch video transcriptions
Extract video metadata
Caption download

Dependencies:

bash

pip install 'markitdown[youtube-transcription]'

Best For:

Educational videos
Lectures
Talks
Tutorials

Example:

python

result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")

Data Formats

CSV (.csv)

Capabilities:

Automatic table conversion
Delimiter detection
Header preservation

Output Format: Markdown tables

Example:

python

result = md.convert("data.csv")

Output:

markdown

| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |

JSON (.json)

Capabilities:

Structured representation
Pretty formatting
Nested data visualization

Best For:

API responses
Configuration files
Data exports

Example:

python

result = md.convert("data.json")

XML (.xml)

Capabilities:

Structure preservation
Attribute extraction
Formatted output

Best For:

Configuration files
Data interchange
Structured documents

Example:

python

result = md.convert("config.xml")

Archive Formats

ZIP (.zip)

Capabilities:

Iterates through archive contents
Converts each file individually
Maintains directory structure in output

Best For:

Document collections
Project archives
Batch conversions

Output Format:

markdown

# Archive: documents.zip

## File: document1.pdf
[Content from document1.pdf...]

---

## File: document2.docx
[Content from document2.docx...]

Example:

python

result = md.convert("archive.zip")

E-book Formats

EPUB (.epub)

Capabilities:

Full text extraction
Chapter structure
Metadata extraction

Best For:

E-books
Digital publications
Long-form content

Output Format: Markdown with preserved chapter structure

Example:

python

result = md.convert("book.epub")

Other Formats

Outlook Messages (.msg)

Capabilities:

Email content extraction
Attachment listing
Metadata (from, to, subject, date)

Dependencies:

bash

pip install 'markitdown[outlook]'

Best For:

Email archives
Communication records

Example:

python

result = md.convert("message.msg")

Format-Specific Tips

PDF Best Practices

Use Azure Document Intelligence for complex layouts:
python
```
md = MarkItDown(docintel_endpoint="endpoint_url")
```
For scanned PDFs, ensure OCR is set up:
bash
```
brew install tesseract  # macOS
```
Split very large PDFs before conversion for better performance

PowerPoint Best Practices

Use AI for visual content:

python

md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Check speaker notes - they're included in output
Complex animations won't be captured - static content only

Excel Best Practices

Large spreadsheets may take time to convert
Formulas are converted to their calculated values
Multiple sheets are all included in output
Charts become text descriptions (use AI for better descriptions)

Image Best Practices

Use AI for meaningful descriptions:

python

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific figure in detail"
)

For text-heavy images, ensure OCR dependencies are installed
High-resolution images may take longer to process

Audio Best Practices

Clear audio produces better transcriptions
Long recordings may take significant time
Consider splitting long audio files for faster processing

Unsupported Formats

If you need to convert an unsupported format:

Create a custom converter (see api_reference.md)
Look for plugins on GitHub (#markitdown-plugin)
Pre-convert to supported format (e.g., convert .rtf to .docx)

Format Detection

MarkItDown automatically detects format from:

File extension (primary method)
MIME type (fallback)
File signature (magic bytes, fallback)

Override detection:

python

# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")

# With streams
with open("file", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")