docs/md_v2/advanced/pdf-parsing.md
Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.
PDFCrawlerStrategyPDFCrawlerStrategy is an implementation of AsyncCrawlerStrategy designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that AsyncWebCrawler can handle.
Use PDFCrawlerStrategy when you need to:
AsyncWebCrawler.https://example.com/document.pdf) and local file paths (e.g., file:///path/to/your/document.pdf).CrawlResult object, allowing consistent handling of PDF data alongside web page data.__init__(self, logger: AsyncLogger = None):
logger: An optional AsyncLogger instance (from crawl4ai.async_logger) for logging purposes.async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
AsyncWebCrawler during the arun process.url (which should point to a PDF) and creates a minimal AsyncCrawlResponse.html attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the PDFContentScrapingStrategy (or a similar PDF-aware scraping strategy).response_headers to indicate "application/pdf" and status_code to 200.async close(self):
PDFCrawlerStrategy, this is usually minimal.async __aenter__(self) / async __aexit__(self, exc_type, exc_val, exc_tb):
async with.import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
async def main():
# Initialize the PDF crawler strategy
pdf_crawler_strategy = PDFCrawlerStrategy()
# PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
# The scraping strategy handles the actual PDF content extraction
pdf_scraping_strategy = PDFContentScrapingStrategy()
run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
# Example with a remote PDF URL
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
print(f"Attempting to process PDF: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_config)
if result.success:
print(f"Successfully processed PDF: {result.url}")
print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
# Further processing of result.markdown, result.media, etc.
# would be done here, based on what PDFContentScrapingStrategy extracts.
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
else:
print("No markdown (text) content extracted.")
else:
print(f"Failed to process PDF: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
Pros:
AsyncWebCrawler to handle PDF sources directly using familiar arun calls.Cons:
PDFContentScrapingStrategy) to process the PDF.PDFContentScrapingStrategyPDFContentScrapingStrategy is an implementation of ContentScrapingStrategy designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as PDFCrawlerStrategy. This strategy uses the NaivePDFProcessorStrategy internally to perform the low-level PDF parsing.
Use PDFContentScrapingStrategy when your AsyncWebCrawler (often configured with PDFCrawlerStrategy) needs to:
ScrapingResult that can be converted into a CrawlResult, making PDF content accessible in a manner similar to HTML web content (e.g., text in result.markdown, metadata in result.metadata).When initializing PDFContentScrapingStrategy, you can configure its behavior using the following attributes:
extract_images: bool = False: If True, the strategy will attempt to extract images from the PDF.save_images_locally: bool = False: If True (and extract_images is also True), extracted images will be saved to disk in the image_save_dir. If False, image data might be available in another form (e.g., base64, depending on the underlying processor) but not saved as separate files by this strategy.image_save_dir: str = None: Specifies the directory where extracted images should be saved if save_images_locally is True. If None, a default or temporary directory might be used.batch_size: int = 4: Defines how many PDF pages are processed in a single batch. This can be useful for managing memory when dealing with very large PDF documents.logger: AsyncLogger = None: An optional AsyncLogger instance for logging.__init__(self, save_images_locally: bool = False, extract_images: bool = False, image_save_dir: str = None, batch_size: int = 4, logger: AsyncLogger = None):
NaivePDFProcessorStrategy instance which performs the actual PDF parsing.scrap(self, url: str, html: str, **params) -> ScrapingResult:
ascrap) to process the PDF.url: The path or URL to the PDF file (provided by PDFCrawlerStrategy or similar).html: Typically an empty string when used with PDFCrawlerStrategy, as the content is a PDF, not HTML.url is remote).ScrapingResult object:
cleaned_html: Contains an HTML-like representation of the PDF, where each page's content is often wrapped in a <div> with page number information.media: A dictionary where media["images"] will contain information about extracted images if extract_images was True.links: A dictionary where links["urls"] can contain URLs found within the PDF content.metadata: A dictionary holding PDF metadata (e.g., title, author, num_pages).async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
scrap. Under the hood, it typically runs the synchronous scrap method in a separate thread using asyncio.to_thread to avoid blocking the event loop._get_pdf_path(self, url: str) -> str:
url is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If url indicates a local file (file:// or a direct path), it resolves and returns the local path.import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import os # For creating image directory
async def main():
# Define the directory for saving extracted images
image_output_dir = "./my_pdf_images"
os.makedirs(image_output_dir, exist_ok=True)
# Configure the PDF content scraping strategy
# Enable image extraction and specify where to save them
pdf_scraping_cfg = PDFContentScrapingStrategy(
extract_images=True,
save_images_locally=True,
image_save_dir=image_output_dir,
batch_size=2 # Process 2 pages at a time for demonstration
)
# The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
pdf_crawler_cfg = PDFCrawlerStrategy()
# Configure the overall crawl run
run_cfg = CrawlerRunConfig(
scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
)
# Initialize the crawler with the PDF-specific crawler strategy
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
print(f"Starting PDF processing for: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_cfg)
if result.success:
print("\n--- PDF Processing Successful ---")
print(f"Processed URL: {result.url}")
print("\n--- Metadata ---")
for key, value in result.metadata.items():
print(f" {key.replace('_', ' ').title()}: {value}")
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"\n--- Extracted Text (Markdown Snippet) ---")
print(result.markdown.raw_markdown[:500].strip() + "...")
else:
print("\nNo text (markdown) content extracted.")
if result.media and result.media.get("images"):
print(f"\n--- Image Extraction ---")
print(f"Extracted {len(result.media['images'])} image(s).")
for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
print(f" Image {i+1}:")
print(f" Page: {img_info.get('page')}")
print(f" Format: {img_info.get('format', 'N/A')}")
if img_info.get('path'):
print(f" Saved at: {img_info.get('path')}")
else:
print("\nNo images were extracted (or extract_images was False).")
else:
print(f"\n--- PDF Processing Failed ---")
print(f"Error: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
Pros:
CrawlResult object structure, making PDF-derived data accessible in a way consistent with web-scraped data.batch_size parameter can help in managing memory consumption when processing large or numerous PDF pages.Cons:
save_images_locally is true).NaivePDFProcessorStrategy internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).