🚀 Crawl4AI v0.7.5 - Complete Feature Walkthrough

Welcome to Crawl4AI v0.7.5! This notebook demonstrates all the new features introduced in this release.

📋 What's New in v0.7.5

🔧 Docker Hooks System - NEW! Complete pipeline customization with user-provided Python functions
🤖 Enhanced LLM Integration - Custom providers with temperature control
🔒 HTTPS Preservation - Secure internal link handling
🛠️ Multiple Bug Fixes - Community-reported issues resolved

📦 Setup and Installation

First, let's make sure we have the latest version installed:

python

# # Install or upgrade to v0.7.5
# !pip install -U crawl4ai==0.7.5 --quiet

# Import required modules
import asyncio
import nest_asyncio
nest_asyncio.apply()  # For Jupyter compatibility

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
from crawl4ai import hooks_to_string

print("✅ Crawl4AI v0.7.5 ready!")

🔧 Feature 1: Docker Hooks System (NEW! 🆕)

What is it?

v0.7.5 introduces a completely new Docker Hooks System that lets you inject custom Python functions at 8 key points in the crawling pipeline. This gives you full control over:

Authentication setup
Performance optimization
Content processing
Custom behavior at each stage

Three Ways to Use Docker Hooks

The Docker Hooks System offers three approaches, all part of this new feature:

String-based hooks - Write hooks as strings for REST API
Using hooks_to_string() utility - Convert Python functions to strings
Docker Client auto-conversion - Pass functions directly (most convenient)

All three approaches are NEW in v0.7.5!

🔒 Feature 2: HTTPS Preservation for Internal Links

Problem

When crawling HTTPS sites, internal links sometimes get downgraded to HTTP, breaking authentication and causing security warnings.

Solution

The new preserve_https_for_internal_links=True parameter maintains HTTPS protocol for all internal links.

python

async def demo_https_preservation():
    """
    Demonstrate HTTPS preservation with deep crawling
    """
    print("🔒 Testing HTTPS Preservation\n")
    print("=" * 60)
    
    # Setup URL filter for quotes.toscrape.com
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
    )
    
    # Configure crawler with HTTPS preservation
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # 🆕 NEW in v0.7.5
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            max_pages=5,
            filter_chain=FilterChain([url_filter])
        )
    )
    
    async with AsyncWebCrawler() as crawler:
        # With deep_crawl_strategy, arun() returns a list of CrawlResult objects
        results = await crawler.arun(
            url="https://quotes.toscrape.com",
            config=config
        )
        
        # Analyze the first result
        if results and len(results) > 0:
            first_result = results[0]
            internal_links = [link['href'] for link in first_result.links['internal']]
            
            # Check HTTPS preservation
            https_links = [link for link in internal_links if link.startswith('https://')]
            http_links = [link for link in internal_links if link.startswith('http://') and not link.startswith('https://')]
            
            print(f"\n📊 Results:")
            print(f"  Pages crawled: {len(results)}")
            print(f"  Total internal links (from first page): {len(internal_links)}")
            print(f"  HTTPS links: {len(https_links)} ✅")
            print(f"  HTTP links: {len(http_links)} {'⚠️' if http_links else ''}")
            if internal_links:
                print(f"  HTTPS preservation rate: {len(https_links)/len(internal_links)*100:.1f}%")
            
            print(f"\n🔗 Sample HTTPS-preserved links:")
            for link in https_links[:5]:
                print(f"  → {link}")
        else:
            print(f"\n⚠️ No results returned")
    
    print("\n" + "=" * 60)
    print("✅ HTTPS Preservation Demo Complete!\n")

# Run the demo
await demo_https_preservation()

🤖 Feature 3: Enhanced LLM Integration

What's New

Custom temperature parameter for creativity control
base_url for custom API endpoints
Better multi-provider support

Example with Custom Temperature

python

from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field
import os

# Define extraction schema
class Article(BaseModel):
    title: str = Field(description="Article title")
    summary: str = Field(description="Brief summary of the article")
    main_topics: list[str] = Field(description="List of main topics covered")

async def demo_enhanced_llm():
    """
    Demonstrate enhanced LLM integration with custom temperature
    """
    print("🤖 Testing Enhanced LLM Integration\n")
    print("=" * 60)
    
    # Check for API key
    api_key = os.getenv('OPENAI_API_KEY')
    if not api_key:
        print("⚠️ Note: Set OPENAI_API_KEY environment variable to test LLM extraction")
        print("For this demo, we'll show the configuration only.\n")
        
        print("📝 Example LLM Configuration with new v0.7.5 features:")
        print("""
llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",
        api_token="your-api-key",
        temperature=0.7,  # 🆕 NEW: Control creativity (0.0-2.0)
        base_url="custom-endpoint"  # 🆕 NEW: Custom API endpoint
    ),
    schema=Article.schema(),
    extraction_type="schema",
    instruction="Extract article information"
)
        """)
        return
    
    # Create LLM extraction strategy with custom temperature
    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini",
            api_token=api_key,
            temperature=0.3,  # 🆕 Lower temperature for more focused extraction
        ),
        schema=Article.schema(),
        extraction_type="schema",
        instruction="Extract the article title, a brief summary, and main topics discussed."
    )
    
    config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Artificial_intelligence",
            config=config
        )
        
        if result.success:
            print("\n✅ LLM Extraction Successful!")
            print(f"\n📄 Extracted Content:")
            print(result.extracted_content)
        else:
            print(f"\n❌ Extraction failed: {result.error_message}")
    
    print("\n" + "=" * 60)
    print("✅ Enhanced LLM Demo Complete!\n")

# Run the demo
await demo_enhanced_llm()

Creating Reusable Hook Functions

First, let's create some hook functions that we can reuse:

python

# Define reusable hooks as Python functions

async def block_images_hook(page, context, **kwargs):
    """
    Performance optimization: Block images to speed up crawling
    """
    print("[Hook] Blocking images for faster loading...")
    await context.route(
        "**/*.{png,jpg,jpeg,gif,webp,svg,ico}",
        lambda route: route.abort()
    )
    return page

async def set_viewport_hook(page, context, **kwargs):
    """
    Set consistent viewport size for rendering
    """
    print("[Hook] Setting viewport to 1920x1080...")
    await page.set_viewport_size({"width": 1920, "height": 1080})
    return page

async def add_custom_headers_hook(page, context, url, **kwargs):
    """
    Add custom headers before navigation
    """
    print(f"[Hook] Adding custom headers for {url}...")
    await page.set_extra_http_headers({
        'X-Crawl4AI-Version': '0.7.5',
        'X-Custom-Header': 'docker-hooks-demo',
        'Accept-Language': 'en-US,en;q=0.9'
    })
    return page

async def scroll_page_hook(page, context, **kwargs):
    """
    Scroll page to load lazy-loaded content
    """
    print("[Hook] Scrolling page to load lazy content...")
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(1000)
    await page.evaluate("window.scrollTo(0, 0)")
    await page.wait_for_timeout(500)
    return page

async def log_page_metrics_hook(page, context, **kwargs):
    """
    Log page metrics before extracting HTML
    """
    metrics = await page.evaluate('''
        () => ({
            images: document.images.length,
            links: document.links.length,
            scripts: document.scripts.length,
            title: document.title
        })
    ''')
    print(f"[Hook] Page Metrics - Title: {metrics['title']}")
    print(f"        Images: {metrics['images']}, Links: {metrics['links']}, Scripts: {metrics['scripts']}")
    return page

print("✅ Reusable hook library created!")
print("\n📚 Available hooks:")
print("  • block_images_hook - Speed optimization")
print("  • set_viewport_hook - Consistent rendering")
print("  • add_custom_headers_hook - Custom headers")
print("  • scroll_page_hook - Lazy content loading")
print("  • log_page_metrics_hook - Page analytics")

Using hooks_to_string() Utility

The new hooks_to_string() utility converts Python function objects to strings that can be sent to the Docker API:

python

# Convert functions to strings using the NEW utility
hooks_as_strings = hooks_to_string({
    "on_page_context_created": block_images_hook,
    "before_goto": add_custom_headers_hook,
    "before_retrieve_html": scroll_page_hook,
})

print("✅ Converted 3 hook functions to string format")
print("\n📝 Example of converted hook (first 200 chars):")
print(hooks_as_strings["on_page_context_created"][:200] + "...")

print("\n💡 Benefits of hooks_to_string():")
print("  ✓ Write hooks as Python functions (IDE support, type checking)")
print("  ✓ Automatically converts to string format for Docker API")
print("  ✓ Reusable across projects")
print("  ✓ Easy to test and debug")

8 Available Hook Points

The Docker Hooks System provides 8 strategic points where you can inject custom behavior:

on_browser_created - Browser initialization
on_page_context_created - Page context setup
on_user_agent_updated - User agent configuration
before_goto - Pre-navigation setup
after_goto - Post-navigation processing
on_execution_started - JavaScript execution start
before_retrieve_html - Pre-extraction processing
before_return_html - Final HTML processing

Complete Docker Hooks Demo

Note: For a complete demonstration of all Docker Hooks approaches including:

String-based hooks with REST API
hooks_to_string() utility usage
Docker Client with automatic conversion
Complete pipeline with all 8 hook points

See the separate file: v0.7.5_docker_hooks_demo.py

This standalone Python script provides comprehensive, runnable examples of the entire Docker Hooks System.

🛠️ Feature 4: Bug Fixes Summary

Major Fixes in v0.7.5

URL Processing - Fixed '+' sign preservation in query parameters
Proxy Configuration - Enhanced proxy string parsing (old parameter deprecated)
Docker Error Handling - Better error messages with status codes
Memory Management - Fixed leaks in long-running sessions
JWT Authentication - Fixed Docker JWT validation
Playwright Stealth - Fixed stealth features
API Configuration - Fixed config handling
Deep Crawl Strategy - Resolved JSON encoding errors
LLM Provider Support - Fixed custom provider integration
Performance - Resolved backoff strategy failures

New Proxy Configuration Example

python

# OLD WAY (Deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")

# NEW WAY (v0.7.5)
browser_config_with_proxy = BrowserConfig(
    proxy_config={
        "server": "http://proxy.example.com:8080",
        "username": "optional-username",  # Optional
        "password": "optional-password"   # Optional
    }
)

print("✅ New proxy configuration format demonstrated")
print("\n📝 Benefits:")
print("  • More explicit and clear")
print("  • Better authentication support")
print("  • Consistent with industry standards")

🎯 Complete Example: Combining Multiple Features

Let's create a real-world example that uses multiple v0.7.5 features together:

python

async def complete_demo():
    """
    Comprehensive demo combining multiple v0.7.5 features
    """
    print("🎯 Complete v0.7.5 Feature Demo\n")
    print("=" * 60)
    
    # Use function-based hooks (NEW Docker Hooks System)
    print("\n1️⃣ Using Docker Hooks System (NEW!)")
    hooks = {
        "on_page_context_created": set_viewport_hook,
        "before_goto": add_custom_headers_hook,
        "before_retrieve_html": log_page_metrics_hook
    }
    
    # Convert to strings using the NEW utility
    hooks_strings = hooks_to_string(hooks)
    print(f"   ✓ Converted {len(hooks_strings)} hooks to string format")
    print("   ✓ Ready to send to Docker API")
    
    # Use HTTPS preservation
    print("\n2️⃣ Enabling HTTPS Preservation")
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?example\.com(\/.*)?$"]
    )
    
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # v0.7.5 feature
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=1,
            max_pages=3,
            filter_chain=FilterChain([url_filter])
        )
    )
    print("   ✓ HTTPS preservation enabled")
    
    # Use new proxy config format
    print("\n3️⃣ Using New Proxy Configuration Format")
    browser_config = BrowserConfig(
        headless=True,
        # proxy_config={  # Uncomment if you have a proxy
        #     "server": "http://proxy:8080"
        # }
    )
    print("   ✓ New proxy config format ready")
    
    # Run the crawl
    print("\n4️⃣ Executing Crawl with All Features")
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # With deep_crawl_strategy, returns a list
        results = await crawler.arun(
            url="https://example.com",
            config=config
        )
        
        if results and len(results) > 0:
            result = results[0]  # Get first result
            print("   ✓ Crawl successful!")
            print(f"\n📊 Results:")
            print(f"   • Pages crawled: {len(results)}")
            print(f"   • Title: {result.metadata.get('title', 'N/A')}")
            print(f"   • Content length: {len(result.markdown.raw_markdown)} characters")
            print(f"   • Links found: {len(result.links['internal']) + len(result.links['external'])}")
        else:
            print(f"   ⚠️ No results returned")
    
    print("\n" + "=" * 60)
    print("✅ Complete Feature Demo Finished!\n")

# Run complete demo
await complete_demo()

🎓 Summary

What We Covered

✅ HTTPS Preservation - Maintain secure protocols throughout crawling
✅ Enhanced LLM Integration - Custom temperature and provider configuration
✅ Docker Hooks System (NEW!) - Complete pipeline customization with 3 approaches
✅ hooks_to_string() Utility (NEW!) - Convert functions for Docker API
✅ Bug Fixes - New proxy config and multiple improvements

Key Highlight: Docker Hooks System 🌟

The Docker Hooks System is completely NEW in v0.7.5. It offers:

8 strategic hook points in the pipeline
3 ways to use hooks (strings, utility, auto-conversion)
Full control over crawling behavior
Support for authentication, optimization, and custom processing

Next Steps

Docker Hooks Demo - See v0.7.5_docker_hooks_demo.py for complete Docker Hooks examples
Documentation - Visit docs.crawl4ai.com for full reference
Examples - Check GitHub examples
Community - Join Discord for support

📚 Resources

Happy Crawling with v0.7.5! 🚀