Migration Guide: Table Extraction v0.7.3

Overview

Version 0.7.3 introduces the Table Extraction Strategy Pattern, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.

What's New

Strategy Pattern Implementation

Table extraction now follows the same strategy pattern used throughout Crawl4AI:

Consistent Architecture: Aligns with extraction, chunking, and markdown strategies
Extensibility: Easy to create custom table extraction strategies
Better Separation: Table logic moved from content scraping to dedicated module
Full Control: Fine-grained control over table detection and extraction

New Classes

python

from crawl4ai import (
    TableExtractionStrategy,    # Abstract base class
    DefaultTableExtraction,      # Current implementation (default)
    NoTableExtraction           # Explicitly disable extraction
)

Backward Compatibility

✅ All existing code continues to work without changes.

No Changes Required

If your code looks like this, it will continue to work:

python

# This still works exactly the same
config = CrawlerRunConfig(
    table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables  # Same structure, same data

What Happens Behind the Scenes

When you don't specify a table_extraction strategy:

CrawlerRunConfig automatically creates DefaultTableExtraction
It uses your table_score_threshold parameter
Tables are extracted exactly as before
Results appear in result.tables with the same structure

New Capabilities

1. Explicit Strategy Configuration

You can now explicitly configure table extraction:

python

# New: Explicit control
strategy = DefaultTableExtraction(
    table_score_threshold=7,
    min_rows=2,              # New: minimum row filter
    min_cols=2,              # New: minimum column filter
    verbose=True             # New: detailed logging
)

config = CrawlerRunConfig(
    table_extraction=strategy
)

2. Disable Table Extraction

Improve performance when tables aren't needed:

python

# New: Skip table extraction entirely
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction

3. Custom Extraction Strategies

Create specialized extractors:

python

class MyTableExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):
        # Custom extraction logic
        return custom_tables

config = CrawlerRunConfig(
    table_extraction=MyTableExtractor()
)

Migration Scenarios

Scenario 1: Basic Usage (No Changes Needed)

Before (v0.7.2):

python

config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

After (v0.7.3):

python

# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

Scenario 2: Custom Threshold (No Changes Needed)

Before (v0.7.2):

python

config = CrawlerRunConfig(
    table_score_threshold=5
)

After (v0.7.3):

python

# Still works the same
config = CrawlerRunConfig(
    table_score_threshold=5
)

# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=2  # Additional filtering
)
config = CrawlerRunConfig(
    table_extraction=strategy
)

Scenario 3: Advanced Filtering (New Feature)

Before (v0.7.2):

python

# Had to filter after extraction
config = CrawlerRunConfig(
    table_score_threshold=5
)
result = await crawler.arun(url, config)

# Manual filtering
large_tables = [
    t for t in result.tables 
    if len(t['rows']) >= 5 and len(t['headers']) >= 3
]

After (v0.7.3):

python

# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=5,
    min_cols=3
)
config = CrawlerRunConfig(
    table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered

Code Organization Changes

Module Structure

Before (v0.7.2):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      - is_data_table()      # Table detection
      - extract_table_data() # Table extraction

After (v0.7.3):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      # Table methods removed, uses strategy
  
  table_extraction.py (NEW)
    - TableExtractionStrategy    # Base class
    - DefaultTableExtraction      # Moved logic here
    - NoTableExtraction          # New option

Import Changes

New imports available (optional):

python

# These are now available but not required for existing code
from crawl4ai import (
    TableExtractionStrategy,
    DefaultTableExtraction,
    NoTableExtraction
)

Performance Implications

No Performance Impact

For existing code, performance remains identical:

Same extraction logic
Same scoring algorithm
Same processing time

Performance Improvements Available

New options for better performance:

python

# Skip tables entirely (faster)
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)

# Process only specific areas (faster)
config = CrawlerRunConfig(
    css_selector="main.content",
    table_extraction=DefaultTableExtraction(
        min_rows=5,  # Skip small tables
        min_cols=3
    )
)

Testing Your Migration

Verification Script

Run this to verify your extraction still works:

python

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def verify_extraction():
    url = "your_url_here"
    
    async with AsyncWebCrawler() as crawler:
        # Test 1: Old approach
        config_old = CrawlerRunConfig(
            table_score_threshold=7
        )
        result_old = await crawler.arun(url, config_old)
        
        # Test 2: New explicit approach
        from crawl4ai import DefaultTableExtraction
        config_new = CrawlerRunConfig(
            table_extraction=DefaultTableExtraction(
                table_score_threshold=7
            )
        )
        result_new = await crawler.arun(url, config_new)
        
        # Compare results
        assert len(result_old.tables) == len(result_new.tables)
        print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
        
        # Verify structure
        for old, new in zip(result_old.tables, result_new.tables):
            assert old['headers'] == new['headers']
            assert old['rows'] == new['rows']
        
        print("✓ Table content identical")

asyncio.run(verify_extraction())

Deprecation Notes

No Deprecations

All existing parameters continue to work
table_score_threshold in CrawlerRunConfig is still supported
No breaking changes

Internal Changes (Transparent to Users)

LXMLWebScrapingStrategy.is_data_table() - Moved to DefaultTableExtraction
LXMLWebScrapingStrategy.extract_table_data() - Moved to DefaultTableExtraction

These methods were internal and not part of the public API.

Benefits of Upgrading

While not required, using the new pattern provides:

Better Control: Filter tables during extraction, not after
Performance Options: Skip extraction when not needed
Extensibility: Create custom extractors for specific needs
Consistency: Same pattern as other Crawl4AI strategies
Future-Proof: Ready for upcoming advanced strategies

Troubleshooting

Issue: Different Number of Tables

Cause: Threshold or filtering differences

Solution:

python

# Ensure same threshold
strategy = DefaultTableExtraction(
    table_score_threshold=7,  # Match your old setting
    min_rows=0,               # No filtering (default)
    min_cols=0                # No filtering (default)
)

Issue: Import Errors

Cause: Using new classes without importing

Solution:

python

# Add imports if using new features
from crawl4ai import (
    DefaultTableExtraction,
    NoTableExtraction,
    TableExtractionStrategy
)

Issue: Custom Strategy Not Working

Cause: Incorrect method signature

Solution:

python

class CustomExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):  # Correct signature
        # Not: extract_tables(self, html)
        # Not: extract(self, element)
        return tables_list

Getting Help

If you encounter issues:

Check your table_score_threshold matches previous settings
Verify imports if using new classes
Enable verbose logging: DefaultTableExtraction(verbose=True)
Review the Table Extraction Documentation
Check examples

Summary

✅ Full backward compatibility - No code changes required
✅ Same results - Identical extraction behavior by default
✅ New options - Additional control when needed
✅ Better architecture - Consistent with Crawl4AI patterns
✅ Ready for future - Foundation for advanced strategies

The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.