PDF Processing Strategies

Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.

`PDFCrawlerStrategy`

Overview

PDFCrawlerStrategy is an implementation of AsyncCrawlerStrategy designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that AsyncWebCrawler can handle.

When to Use

Use PDFCrawlerStrategy when you need to:

Process PDF files using the AsyncWebCrawler.
Handle PDFs from both web URLs (e.g., https://example.com/document.pdf) and local file paths (e.g., file:///path/to/your/document.pdf).
Integrate PDF content extraction into a unified CrawlResult object, allowing consistent handling of PDF data alongside web page data.

Key Methods and Their Behavior

__init__(self, logger: AsyncLogger = None):
- Initializes the strategy.
- logger: An optional AsyncLogger instance (from crawl4ai.async_logger) for logging purposes.
async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
- This method is called by the AsyncWebCrawler during the arun process.
- It takes the url (which should point to a PDF) and creates a minimal AsyncCrawlResponse.
- The html attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the PDFContentScrapingStrategy (or a similar PDF-aware scraping strategy).
- It sets response_headers to indicate "application/pdf" and status_code to 200.
async close(self):
- A method for cleaning up any resources used by the strategy. For PDFCrawlerStrategy, this is usually minimal.
async __aenter__(self) / async __aexit__(self, exc_type, exc_val, exc_tb):
- Enables asynchronous context management for the strategy, allowing it to be used with async with.

Example Usage

python

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy

async def main():
    # Initialize the PDF crawler strategy
    pdf_crawler_strategy = PDFCrawlerStrategy()

    # PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
    # The scraping strategy handles the actual PDF content extraction
    pdf_scraping_strategy = PDFContentScrapingStrategy()
    run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)

    async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
        # Example with a remote PDF URL
        pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
        
        print(f"Attempting to process PDF: {pdf_url}")
        result = await crawler.arun(url=pdf_url, config=run_config)

        if result.success:
            print(f"Successfully processed PDF: {result.url}")
            print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
            # Further processing of result.markdown, result.media, etc.
            # would be done here, based on what PDFContentScrapingStrategy extracts.
            if result.markdown and hasattr(result.markdown, 'raw_markdown'):
                print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
            else:
                print("No markdown (text) content extracted.")
        else:
            print(f"Failed to process PDF: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())

Pros and Cons

Pros:

Enables AsyncWebCrawler to handle PDF sources directly using familiar arun calls.
Provides a consistent interface for specifying PDF sources (URLs or local paths).
Abstracts the source handling, allowing a separate scraping strategy to focus on PDF content parsing.

Cons:

Does not perform any PDF data extraction itself; it strictly relies on a compatible scraping strategy (like PDFContentScrapingStrategy) to process the PDF.
Has limited utility on its own; most of its value comes from being paired with a PDF-specific content scraping strategy.

`PDFContentScrapingStrategy`

Overview

PDFContentScrapingStrategy is an implementation of ContentScrapingStrategy designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as PDFCrawlerStrategy. This strategy uses the NaivePDFProcessorStrategy internally to perform the low-level PDF parsing.