Back to Crawl4ai

πŸš€ Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update

docs/blog/release-v0.7.4.md

0.8.610.1 KB
Original Source

πŸš€ Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update

August 17, 2025 β€’ 6 min read


Today I'm releasing Crawl4AI v0.7.4β€”the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.

🎯 What's New at a Glance

  • πŸš€ LLMTableExtraction: Revolutionary table extraction with intelligent chunking for massive tables
  • ⚑ Enhanced Concurrency: True concurrency improvements for fast-completing tasks in batch operations
  • πŸ”§ Browser Manager Fixes: Resolved race conditions in concurrent page creation
  • ⌨️ Cross-Platform Browser Profiler: Improved keyboard handling and quit mechanisms
  • πŸ”— Advanced URL Processing: Better handling of raw URLs and base tag link resolution
  • πŸ›‘οΈ Enhanced Proxy Support: Flexible proxy configuration with dict and string formats
  • 🐳 Docker Improvements: Better API handling and raw HTML support

πŸš€ LLMTableExtraction: Revolutionary Table Processing

The Problem: Complex tables with rowspan, colspan, nested structures, or massive datasets that traditional HTML parsing can't handle effectively. Large tables that exceed token limits crash extraction processes.

My Solution: I developed LLMTableExtractionβ€”an intelligent table extraction strategy that uses Large Language Models with automatic chunking to handle tables of any size and complexity.

Technical Implementation

python
from crawl4ai import (
    AsyncWebCrawler,
    CrawlerRunConfig, 
    LLMConfig,
    LLMTableExtraction,
    CacheMode
)

# Configure LLM for table extraction
llm_config = LLMConfig(
    provider="openai/gpt-4.1-mini",
    api_token="env:OPENAI_API_KEY",
    temperature=0.1,  # Low temperature for consistency
    max_tokens=32000
)

# Create intelligent table extraction strategy
table_strategy = LLMTableExtraction(
    llm_config=llm_config,
    verbose=True,
    max_tries=2,
    enable_chunking=True,           # Handle massive tables
    chunk_token_threshold=5000,     # Smart chunking threshold
    overlap_threshold=100,          # Maintain context between chunks
    extraction_type="structured"    # Get structured data output
)

# Apply to crawler configuration
config = CrawlerRunConfig(
    table_extraction_strategy=table_strategy,
    cache_mode=CacheMode.BYPASS
)

async with AsyncWebCrawler() as crawler:
    # Extract complex tables with intelligence
    result = await crawler.arun(
        "https://en.wikipedia.org/wiki/List_of_countries_by_GDP", 
        config=config
    )
    
    # Access extracted tables directly
    for i, table in enumerate(result.tables):
        print(f"Table {i}: {len(table['data'])} rows Γ— {len(table['headers'])} columns")
        
        # Convert to pandas DataFrame instantly
        import pandas as pd
        df = pd.DataFrame(table['data'], columns=table['headers'])
        print(df.head())

Intelligent Chunking for Massive Tables:

python
# Handle tables that exceed token limits
large_table_strategy = LLMTableExtraction(
    llm_config=llm_config,
    enable_chunking=True,
    chunk_token_threshold=3000,    # Conservative threshold
    overlap_threshold=150,         # Preserve context
    max_concurrent_chunks=3,       # Parallel processing
    merge_strategy="intelligent"   # Smart chunk merging
)

# Process Wikipedia comparison tables, financial reports, etc.
config = CrawlerRunConfig(
    table_extraction_strategy=large_table_strategy,
    # Target specific table containers
    css_selector="div.wikitable, table.sortable",
    delay_before_return_html=2.0
)

result = await crawler.arun(
    "https://en.wikipedia.org/wiki/Comparison_of_operating_systems",
    config=config
)

# Tables are automatically chunked, processed, and merged
print(f"Extracted {len(result.tables)} complex tables")
for table in result.tables:
    print(f"Merged table: {len(table['data'])} total rows")

Advanced Features:

  • Intelligent Chunking: Automatically splits massive tables while preserving structure
  • Context Preservation: Overlapping chunks maintain column relationships
  • Parallel Processing: Concurrent chunk processing for speed
  • Smart Merging: Reconstructs complete tables from processed chunks
  • Complex Structure Support: Handles rowspan, colspan, nested tables
  • Metadata Extraction: Captures table context, captions, and relationships

Expected Real-World Impact:

  • Financial Analysis: Extract complex earnings tables and financial statements
  • Research & Academia: Process large datasets from Wikipedia, research papers
  • E-commerce: Handle product comparison tables with complex layouts
  • Government Data: Extract census data, statistical tables from official sources
  • Competitive Intelligence: Process competitor pricing and feature tables

⚑ Enhanced Concurrency: True Performance Gains

The Problem: The arun_many() method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations.

My Solution: I implemented true concurrency improvements in the dispatcher that enable genuine parallel processing for fast-completing tasks.

Performance Optimization

python
# Before v0.7.4: Sequential-like behavior for fast tasks
# After v0.7.4: True concurrency

async with AsyncWebCrawler() as crawler:
    # These will now run with true concurrency
    urls = [
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/1", 
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/1"
    ]
    
    # Processes in truly parallel fashion
    results = await crawler.arun_many(urls)
    
    # Performance improvement: ~4x faster for fast-completing tasks
    print(f"Processed {len(results)} URLs with true concurrency")

Expected Real-World Impact:

  • API Crawling: 3-4x faster processing of REST endpoints and API documentation
  • Batch URL Processing: Significant speedup for large URL lists
  • Monitoring Systems: Faster health checks and status page monitoring
  • Data Aggregation: Improved performance for real-time data collection

πŸ”§ Critical Stability Fixes

Browser Manager Race Condition Resolution

The Problem: Concurrent page creation in persistent browser contexts caused "Target page/context closed" errors during high-concurrency operations.

My Solution: Implemented thread-safe page creation with proper locking mechanisms.

python
# Fixed: Safe concurrent page creation
browser_config = BrowserConfig(
    browser_type="chromium",
    use_persistent_context=True,  # Now thread-safe
    max_concurrent_sessions=10    # Safely handle concurrent requests
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    # These concurrent operations are now stable
    tasks = [crawler.arun(url) for url in url_list]
    results = await asyncio.gather(*tasks)  # No more race conditions

Enhanced Browser Profiler

The Problem: Inconsistent keyboard handling across platforms and unreliable quit mechanisms.

My Solution: Cross-platform keyboard listeners with improved quit handling.

Advanced URL Processing

The Problem: Raw URL formats (raw:// and raw:) weren't properly handled, and base tag link resolution was incomplete.

My Solution: Enhanced URL preprocessing and base tag support.

python
# Now properly handles all URL formats
urls = [
    "https://example.com",
    "raw://static-html-content", 
    "raw:file://local-file.html"
]

# Base tag links are now correctly resolved
config = CrawlerRunConfig(
    include_links=True,  # Links properly resolved with base tags
    resolve_absolute_urls=True
)

πŸ›‘οΈ Enhanced Proxy Configuration

The Problem: Proxy configuration only accepted specific formats, limiting flexibility.

My Solution: Enhanced ProxyConfig to support both dictionary and string formats.

python
# Multiple proxy configuration formats now supported
from crawl4ai import BrowserConfig, ProxyConfig

# String format
proxy_config = ProxyConfig("http://proxy.example.com:8080")

# Dictionary format  
proxy_config = ProxyConfig({
    "server": "http://proxy.example.com:8080",
    "username": "user",
    "password": "pass"
})

# Use with crawler
browser_config = BrowserConfig(proxy_config=proxy_config)
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://httpbin.org/ip")

🐳 Docker & Infrastructure Improvements

This release includes several Docker and infrastructure improvements:

  • Better API Token Handling: Improved Docker example scripts with correct endpoints
  • Raw HTML Support: Enhanced Docker API to handle raw HTML content properly
  • Documentation Updates: Comprehensive Docker deployment examples
  • Test Coverage: Expanded test suite with better coverage

πŸ“š Documentation & Examples

Enhanced documentation includes:

  • LLM Table Extraction Guide: Comprehensive examples and best practices
  • Migration Documentation: Updated patterns for new table extraction methods
  • Docker Deployment: Complete deployment guide with examples
  • Performance Optimization: Guidelines for concurrent crawling

πŸ™ Acknowledgments

Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible.

πŸ“š Resources


Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extractionβ€”it's a game changer for data extraction workflows!

Happy Crawling! πŸ•·οΈ

- The Crawl4AI Team