Back to Crawl4ai

πŸš€ Crawl4AI v0.7.0: The Adaptive Intelligence Update

docs/blog/release-v0.7.0.md

0.8.613.7 KB
Original Source

πŸš€ Crawl4AI v0.7.0: The Adaptive Intelligence Update

January 28, 2025 β€’ 10 min read


Today I'm releasing Crawl4AI v0.7.0β€”the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.

🎯 What's New at a Glance

  • Adaptive Crawling: Your crawler now learns and adapts to website patterns
  • Virtual Scroll Support: Complete content extraction from infinite scroll pages
  • Link Preview with Intelligent Scoring: Intelligent link analysis and prioritization
  • Async URL Seeder: Discover thousands of URLs in seconds with intelligent filtering
  • Performance Optimizations: Significant speed and memory improvements

🧠 Adaptive Crawling: Intelligence Through Pattern Learning

The Problem: Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.

My Solution: I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.

Technical Deep-Dive

The Adaptive Crawler maintains a persistent state for each domain, tracking:

  • Pattern success rates
  • Selector stability over time
  • Content structure variations
  • Extraction confidence scores
python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio

async def main():
    
    # Configure adaptive crawler
    config = AdaptiveConfig(
        strategy="statistical",  # or "embedding" for semantic understanding
        max_pages=10,
        confidence_threshold=0.7,  # Stop at 70% confidence
        top_k_links=3,  # Follow top 3 links per page
        min_gain_threshold=0.05  # Need 5% information gain to continue
    )
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        print("Starting adaptive crawl about Python decorators...")
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/glossary.html",
            query="python decorators functions wrapping"
        )
        
        print(f"\nβœ… Crawling Complete!")
        print(f"β€’ Confidence Level: {adaptive.confidence:.0%}")
        print(f"β€’ Pages Crawled: {len(result.crawled_urls)}")
        print(f"β€’ Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
        
        # Get most relevant content
        relevant = adaptive.get_relevant_content(top_k=3)
        print(f"\nMost Relevant Pages:")
        for i, page in enumerate(relevant, 1):
            print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")

asyncio.run(main())

Expected Real-World Impact:

  • News Aggregation: Maintain 95%+ extraction accuracy even as news sites update their templates
  • E-commerce Monitoring: Track product changes across hundreds of stores without constant maintenance
  • Research Data Collection: Build robust academic datasets that survive website redesigns
  • Reduced Maintenance: Cut selector update time by 80% for frequently-changing sites

🌊 Virtual Scroll: Complete Content Capture

The Problem: Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.

My Solution: I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.

Implementation Details

python
from crawl4ai import VirtualScrollConfig

# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
    container_selector="[data-testid='primaryColumn']",
    scroll_count=20,                    # Number of scrolls
    scroll_by="container_height",       # Smart scrolling by container size
    wait_after_scroll=1.0              # Let content load
)

# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
    container_selector="main .product-grid",
    scroll_count=30,
    scroll_by=800,                     # Fixed pixel scrolling
    wait_after_scroll=1.5              # Images need time
)

# For news feeds with lazy loading
news_config = VirtualScrollConfig(
    container_selector=".article-feed",
    scroll_count=50,
    scroll_by="page_height",           # Viewport-based scrolling
    wait_after_scroll=0.5              # Wait for content to load
)

# Use it in your crawl
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        "https://twitter.com/trending",
        config=CrawlerRunConfig(
            virtual_scroll_config=twitter_config,
            # Combine with other features
            extraction_strategy=JsonCssExtractionStrategy({
                "tweets": {
                    "selector": "[data-testid='tweet']",
                    "fields": {
                        "text": {"selector": "[data-testid='tweetText']", "type": "text"},
                        "likes": {"selector": "[data-testid='like']", "type": "text"}
                    }
                }
            })
        )
    )
    
    print(f"Captured {len(result.extracted_content['tweets'])} tweets")

Key Capabilities:

  • DOM Recycling Awareness: Detects and handles virtual DOM element recycling
  • Smart Scroll Physics: Three modes - container height, page height, or fixed pixels
  • Content Preservation: Captures content before it's destroyed
  • Intelligent Stopping: Stops when no new content appears
  • Memory Efficient: Streams content instead of holding everything in memory

Expected Real-World Impact:

  • Social Media Analysis: Capture entire Twitter threads with hundreds of replies, not just top 10
  • E-commerce Scraping: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
  • News Aggregation: Get all articles from modern news sites, not just above-the-fold content
  • Research Applications: Complete data extraction from academic databases using virtual pagination

The Problem: You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.

My Solution: I implemented a three-layer scoring system that analyzes links like a human wouldβ€”considering their position, context, and relevance to your goals.

python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig

async def main():
    # Configure intelligent link analysis
    link_config = LinkPreviewConfig(
        include_internal=True,
        include_external=False,
        max_links=10,
        concurrency=5,
        query="python tutorial",  # For contextual scoring
        score_threshold=0.3,
        verbose=True
    )
    # Use in your crawl
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://www.geeksforgeeks.org/",
            config=CrawlerRunConfig(
                link_preview_config=link_config,
                score_links=True,  # Enable intrinsic scoring
                cache_mode=CacheMode.BYPASS
            )
        )

        # Access scored and sorted links
        if result.success and result.links:
            for link in result.links.get("internal", []):
                text = link.get('text', 'No text')[:40]
                print(
                    text,
                    f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
                    f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
                    f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
                )

asyncio.run(main())

Scoring Components:

  1. Intrinsic Score: Based on link quality indicators

    • Position on page (navigation, content, footer)
    • Link attributes (rel, title, class names)
    • Anchor text quality and length
    • URL structure and depth
  2. Contextual Score: Relevance to your query using BM25 algorithm

    • Keyword matching in link text and title
    • Meta description analysis
    • Content preview scoring
  3. Total Score: Combined score for final ranking

Expected Real-World Impact:

  • Research Efficiency: Find relevant papers 10x faster by following only high-score links
  • Competitive Analysis: Automatically identify important pages on competitor sites
  • Content Discovery: Build topic-focused crawlers that stay on track
  • SEO Audits: Identify and prioritize high-value internal linking opportunities

🎣 Async URL Seeder: Automated URL Discovery at Scale

The Problem: You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.

My Solution: I built Async URL Seederβ€”a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.

Technical Architecture

python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def main():
    async with AsyncUrlSeeder() as seeder:
        # Discover Python tutorial URLs
        config = SeedingConfig(
            source="sitemap",  # Use sitemap
            pattern="*python*",  # URL pattern filter
            extract_head=True,  # Get metadata
            query="python tutorial",  # For relevance scoring
            scoring_method="bm25",
            score_threshold=0.2,
            max_urls=10
        )
        
        print("Discovering Python async tutorial URLs...")
        urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
        
        print(f"\nβœ… Found {len(urls)} relevant URLs:")
        for i, url_info in enumerate(urls[:5], 1):
            print(f"\n{i}. {url_info['url']}")
            if url_info.get('relevance_score'):
                print(f"   Relevance: {url_info['relevance_score']:.3f}")
            if url_info.get('head_data', {}).get('title'):
                print(f"   Title: {url_info['head_data']['title'][:60]}...")

asyncio.run(main())

Discovery Methods:

  • Sitemap Mining: Parses robots.txt and all linked sitemaps
  • Common Crawl: Queries the Common Crawl index for historical URLs
  • Intelligent Crawling: Follows links with smart depth control
  • Pattern Analysis: Learns URL structures and generates variations

Expected Real-World Impact:

  • Migration Projects: Discover 10,000+ URLs from legacy sites in under 60 seconds
  • Market Research: Map entire competitor ecosystems automatically
  • Academic Research: Build comprehensive datasets without manual URL collection
  • SEO Audits: Find every indexable page with content scoring
  • Content Archival: Ensure no content is left behind during site migrations

⚑ Performance Optimizations

This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.

What We Optimized

python
# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
    result = await crawler.arun(
        url,
        config=CrawlerRunConfig(
            # Performance optimizations
            wait_until="domcontentloaded",  # Faster than networkidle
            cache_mode=CacheMode.ENABLED    # Enable caching
        )
    )
    results.append(result)

Performance Gains:

  • Startup Time: 70% faster browser initialization
  • Page Loading: 40% reduction with smart resource blocking
  • Extraction: 3x faster with compiled CSS selectors
  • Memory Usage: 60% reduction with streaming processing
  • Concurrent Crawls: Handle 5x more parallel requests

πŸ”§ Important Changes

Breaking Changes

  • link_extractor renamed to link_preview (better reflects functionality)
  • Minimum Python version now 3.9
  • CrawlerConfig split into CrawlerRunConfig and BrowserConfig

Migration Guide

python
# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)

# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

πŸ€– Coming Soon: Intelligent Web Automation

I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:

  • Crawl Agents: Autonomous crawlers that understand your goals and adapt their strategies
  • Auto JS Generation: Automatic JavaScript code generation for complex interactions
  • Smart Form Handling: Intelligent form detection and filling
  • Context-Aware Actions: Crawlers that understand page context and make decisions

These features are under active development and will revolutionize how we approach web automation. Stay tuned!

πŸš€ Get Started

bash
pip install crawl4ai==0.7.0

Check out the updated documentation.

Questions? Issues? I'm always listening:

Happy crawling! πŸ•·οΈ


P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.