scientific-skills/markitdown/references/file_formats.md
This document provides detailed information about each file format supported by MarkItDown.
Capabilities:
Dependencies:
pip install 'markitdown[pdf]'
Best For:
Limitations:
Example:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)
Enhanced with Azure Document Intelligence:
md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")
Capabilities:
Dependencies:
pip install 'markitdown[docx]'
Best For:
Preserved Elements:
Example:
result = md.convert("manuscript.docx")
Capabilities:
Dependencies:
pip install 'markitdown[pptx]'
Best For:
Output Format:
# Slide 1: Title
Content from slide 1...
**Notes**: Speaker notes appear here
---
# Slide 2: Next Topic
...
With AI Image Descriptions:
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")
Capabilities:
Dependencies:
pip install 'markitdown[xlsx]' # Modern Excel
pip install 'markitdown[xls]' # Legacy Excel
Best For:
Output Format:
# Sheet: Results
| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1 | 10.2 | 12.5 | 0.023 |
| 2 | 9.8 | 11.9 | 0.031 |
Example:
result = md.convert("experimental_data.xlsx")
Capabilities:
Dependencies:
pip install 'markitdown[all]' # Includes image support
Best For:
Output Without AI:

**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000
Output With AI:
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")
OCR for Text Extraction: Requires Tesseract OCR:
# macOS
brew install tesseract
# Ubuntu
sudo apt-get install tesseract-ocr
Capabilities:
Dependencies:
pip install 'markitdown[audio-transcription]'
Best For:
Output Format:
# Audio: interview.mp3
**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz
**Transcription**:
[Transcribed text appears here...]
Example:
result = md.convert("lecture.mp3")
Capabilities:
Best For:
Output Format: Clean Markdown with preserved links and structure
Example:
result = md.convert("webpage.html")
Capabilities:
Dependencies:
pip install 'markitdown[youtube-transcription]'
Best For:
Example:
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
Capabilities:
Output Format: Markdown tables
Example:
result = md.convert("data.csv")
Output:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1 | Value2 | Value3 |
Capabilities:
Best For:
Example:
result = md.convert("data.json")
Capabilities:
Best For:
Example:
result = md.convert("config.xml")
Capabilities:
Best For:
Output Format:
# Archive: documents.zip
## File: document1.pdf
[Content from document1.pdf...]
---
## File: document2.docx
[Content from document2.docx...]
Example:
result = md.convert("archive.zip")
Capabilities:
Best For:
Output Format: Markdown with preserved chapter structure
Example:
result = md.convert("book.epub")
Capabilities:
Dependencies:
pip install 'markitdown[outlook]'
Best For:
Example:
result = md.convert("message.msg")
Use Azure Document Intelligence for complex layouts:
md = MarkItDown(docintel_endpoint="endpoint_url")
For scanned PDFs, ensure OCR is set up:
brew install tesseract # macOS
Split very large PDFs before conversion for better performance
Use AI for visual content:
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
Check speaker notes - they're included in output
Complex animations won't be captured - static content only
Large spreadsheets may take time to convert
Formulas are converted to their calculated values
Multiple sheets are all included in output
Charts become text descriptions (use AI for better descriptions)
Use AI for meaningful descriptions:
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this scientific figure in detail"
)
For text-heavy images, ensure OCR dependencies are installed
High-resolution images may take longer to process
Clear audio produces better transcriptions
Long recordings may take significant time
Consider splitting long audio files for faster processing
If you need to convert an unsupported format:
api_reference.md)MarkItDown automatically detects format from:
Override detection:
# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")
# With streams
with open("file", "rb") as f:
result = md.convert_stream(f, file_extension=".pdf")