Back to Crawl4ai

šŸš€ Crawl4AI v0.7.5 - Complete Feature Walkthrough

docs/releases_review/v0.7.5_video_walkthrough.ipynb

0.8.616.5 KB
Original Source

šŸš€ Crawl4AI v0.7.5 - Complete Feature Walkthrough

Welcome to Crawl4AI v0.7.5! This notebook demonstrates all the new features introduced in this release.

šŸ“‹ What's New in v0.7.5

  1. šŸ”§ Docker Hooks System - NEW! Complete pipeline customization with user-provided Python functions
  2. šŸ¤– Enhanced LLM Integration - Custom providers with temperature control
  3. šŸ”’ HTTPS Preservation - Secure internal link handling
  4. šŸ› ļø Multiple Bug Fixes - Community-reported issues resolved

šŸ“¦ Setup and Installation

First, let's make sure we have the latest version installed:

python
# # Install or upgrade to v0.7.5
# !pip install -U crawl4ai==0.7.5 --quiet

# Import required modules
import asyncio
import nest_asyncio
nest_asyncio.apply()  # For Jupyter compatibility

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
from crawl4ai import hooks_to_string

print("āœ… Crawl4AI v0.7.5 ready!")

šŸ”§ Feature 1: Docker Hooks System (NEW! šŸ†•)

What is it?

v0.7.5 introduces a completely new Docker Hooks System that lets you inject custom Python functions at 8 key points in the crawling pipeline. This gives you full control over:

  • Authentication setup
  • Performance optimization
  • Content processing
  • Custom behavior at each stage

Three Ways to Use Docker Hooks

The Docker Hooks System offers three approaches, all part of this new feature:

  1. String-based hooks - Write hooks as strings for REST API
  2. Using hooks_to_string() utility - Convert Python functions to strings
  3. Docker Client auto-conversion - Pass functions directly (most convenient)

All three approaches are NEW in v0.7.5!


Problem

When crawling HTTPS sites, internal links sometimes get downgraded to HTTP, breaking authentication and causing security warnings.

Solution

The new preserve_https_for_internal_links=True parameter maintains HTTPS protocol for all internal links.

python
async def demo_https_preservation():
    """
    Demonstrate HTTPS preservation with deep crawling
    """
    print("šŸ”’ Testing HTTPS Preservation\n")
    print("=" * 60)
    
    # Setup URL filter for quotes.toscrape.com
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
    )
    
    # Configure crawler with HTTPS preservation
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # šŸ†• NEW in v0.7.5
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            max_pages=5,
            filter_chain=FilterChain([url_filter])
        )
    )
    
    async with AsyncWebCrawler() as crawler:
        # With deep_crawl_strategy, arun() returns a list of CrawlResult objects
        results = await crawler.arun(
            url="https://quotes.toscrape.com",
            config=config
        )
        
        # Analyze the first result
        if results and len(results) > 0:
            first_result = results[0]
            internal_links = [link['href'] for link in first_result.links['internal']]
            
            # Check HTTPS preservation
            https_links = [link for link in internal_links if link.startswith('https://')]
            http_links = [link for link in internal_links if link.startswith('http://') and not link.startswith('https://')]
            
            print(f"\nšŸ“Š Results:")
            print(f"  Pages crawled: {len(results)}")
            print(f"  Total internal links (from first page): {len(internal_links)}")
            print(f"  HTTPS links: {len(https_links)} āœ…")
            print(f"  HTTP links: {len(http_links)} {'āš ļø' if http_links else ''}")
            if internal_links:
                print(f"  HTTPS preservation rate: {len(https_links)/len(internal_links)*100:.1f}%")
            
            print(f"\nšŸ”— Sample HTTPS-preserved links:")
            for link in https_links[:5]:
                print(f"  → {link}")
        else:
            print(f"\nāš ļø No results returned")
    
    print("\n" + "=" * 60)
    print("āœ… HTTPS Preservation Demo Complete!\n")

# Run the demo
await demo_https_preservation()

šŸ¤– Feature 3: Enhanced LLM Integration

What's New

  • Custom temperature parameter for creativity control
  • base_url for custom API endpoints
  • Better multi-provider support

Example with Custom Temperature

python
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field
import os

# Define extraction schema
class Article(BaseModel):
    title: str = Field(description="Article title")
    summary: str = Field(description="Brief summary of the article")
    main_topics: list[str] = Field(description="List of main topics covered")

async def demo_enhanced_llm():
    """
    Demonstrate enhanced LLM integration with custom temperature
    """
    print("šŸ¤– Testing Enhanced LLM Integration\n")
    print("=" * 60)
    
    # Check for API key
    api_key = os.getenv('OPENAI_API_KEY')
    if not api_key:
        print("āš ļø Note: Set OPENAI_API_KEY environment variable to test LLM extraction")
        print("For this demo, we'll show the configuration only.\n")
        
        print("šŸ“ Example LLM Configuration with new v0.7.5 features:")
        print("""
llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",
        api_token="your-api-key",
        temperature=0.7,  # šŸ†• NEW: Control creativity (0.0-2.0)
        base_url="custom-endpoint"  # šŸ†• NEW: Custom API endpoint
    ),
    schema=Article.schema(),
    extraction_type="schema",
    instruction="Extract article information"
)
        """)
        return
    
    # Create LLM extraction strategy with custom temperature
    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini",
            api_token=api_key,
            temperature=0.3,  # šŸ†• Lower temperature for more focused extraction
        ),
        schema=Article.schema(),
        extraction_type="schema",
        instruction="Extract the article title, a brief summary, and main topics discussed."
    )
    
    config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Artificial_intelligence",
            config=config
        )
        
        if result.success:
            print("\nāœ… LLM Extraction Successful!")
            print(f"\nšŸ“„ Extracted Content:")
            print(result.extracted_content)
        else:
            print(f"\nāŒ Extraction failed: {result.error_message}")
    
    print("\n" + "=" * 60)
    print("āœ… Enhanced LLM Demo Complete!\n")

# Run the demo
await demo_enhanced_llm()

Creating Reusable Hook Functions

First, let's create some hook functions that we can reuse:

python
# Define reusable hooks as Python functions

async def block_images_hook(page, context, **kwargs):
    """
    Performance optimization: Block images to speed up crawling
    """
    print("[Hook] Blocking images for faster loading...")
    await context.route(
        "**/*.{png,jpg,jpeg,gif,webp,svg,ico}",
        lambda route: route.abort()
    )
    return page

async def set_viewport_hook(page, context, **kwargs):
    """
    Set consistent viewport size for rendering
    """
    print("[Hook] Setting viewport to 1920x1080...")
    await page.set_viewport_size({"width": 1920, "height": 1080})
    return page

async def add_custom_headers_hook(page, context, url, **kwargs):
    """
    Add custom headers before navigation
    """
    print(f"[Hook] Adding custom headers for {url}...")
    await page.set_extra_http_headers({
        'X-Crawl4AI-Version': '0.7.5',
        'X-Custom-Header': 'docker-hooks-demo',
        'Accept-Language': 'en-US,en;q=0.9'
    })
    return page

async def scroll_page_hook(page, context, **kwargs):
    """
    Scroll page to load lazy-loaded content
    """
    print("[Hook] Scrolling page to load lazy content...")
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(1000)
    await page.evaluate("window.scrollTo(0, 0)")
    await page.wait_for_timeout(500)
    return page

async def log_page_metrics_hook(page, context, **kwargs):
    """
    Log page metrics before extracting HTML
    """
    metrics = await page.evaluate('''
        () => ({
            images: document.images.length,
            links: document.links.length,
            scripts: document.scripts.length,
            title: document.title
        })
    ''')
    print(f"[Hook] Page Metrics - Title: {metrics['title']}")
    print(f"        Images: {metrics['images']}, Links: {metrics['links']}, Scripts: {metrics['scripts']}")
    return page

print("āœ… Reusable hook library created!")
print("\nšŸ“š Available hooks:")
print("  • block_images_hook - Speed optimization")
print("  • set_viewport_hook - Consistent rendering")
print("  • add_custom_headers_hook - Custom headers")
print("  • scroll_page_hook - Lazy content loading")
print("  • log_page_metrics_hook - Page analytics")

Using hooks_to_string() Utility

The new hooks_to_string() utility converts Python function objects to strings that can be sent to the Docker API:

python
# Convert functions to strings using the NEW utility
hooks_as_strings = hooks_to_string({
    "on_page_context_created": block_images_hook,
    "before_goto": add_custom_headers_hook,
    "before_retrieve_html": scroll_page_hook,
})

print("āœ… Converted 3 hook functions to string format")
print("\nšŸ“ Example of converted hook (first 200 chars):")
print(hooks_as_strings["on_page_context_created"][:200] + "...")

print("\nšŸ’” Benefits of hooks_to_string():")
print("  āœ“ Write hooks as Python functions (IDE support, type checking)")
print("  āœ“ Automatically converts to string format for Docker API")
print("  āœ“ Reusable across projects")
print("  āœ“ Easy to test and debug")

8 Available Hook Points

The Docker Hooks System provides 8 strategic points where you can inject custom behavior:

  1. on_browser_created - Browser initialization
  2. on_page_context_created - Page context setup
  3. on_user_agent_updated - User agent configuration
  4. before_goto - Pre-navigation setup
  5. after_goto - Post-navigation processing
  6. on_execution_started - JavaScript execution start
  7. before_retrieve_html - Pre-extraction processing
  8. before_return_html - Final HTML processing

Complete Docker Hooks Demo

Note: For a complete demonstration of all Docker Hooks approaches including:

  • String-based hooks with REST API
  • hooks_to_string() utility usage
  • Docker Client with automatic conversion
  • Complete pipeline with all 8 hook points

See the separate file: v0.7.5_docker_hooks_demo.py

This standalone Python script provides comprehensive, runnable examples of the entire Docker Hooks System.


šŸ› ļø Feature 4: Bug Fixes Summary

Major Fixes in v0.7.5

  1. URL Processing - Fixed '+' sign preservation in query parameters
  2. Proxy Configuration - Enhanced proxy string parsing (old parameter deprecated)
  3. Docker Error Handling - Better error messages with status codes
  4. Memory Management - Fixed leaks in long-running sessions
  5. JWT Authentication - Fixed Docker JWT validation
  6. Playwright Stealth - Fixed stealth features
  7. API Configuration - Fixed config handling
  8. Deep Crawl Strategy - Resolved JSON encoding errors
  9. LLM Provider Support - Fixed custom provider integration
  10. Performance - Resolved backoff strategy failures

New Proxy Configuration Example

python
# OLD WAY (Deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")

# NEW WAY (v0.7.5)
browser_config_with_proxy = BrowserConfig(
    proxy_config={
        "server": "http://proxy.example.com:8080",
        "username": "optional-username",  # Optional
        "password": "optional-password"   # Optional
    }
)

print("āœ… New proxy configuration format demonstrated")
print("\nšŸ“ Benefits:")
print("  • More explicit and clear")
print("  • Better authentication support")
print("  • Consistent with industry standards")

šŸŽÆ Complete Example: Combining Multiple Features

Let's create a real-world example that uses multiple v0.7.5 features together:

python
async def complete_demo():
    """
    Comprehensive demo combining multiple v0.7.5 features
    """
    print("šŸŽÆ Complete v0.7.5 Feature Demo\n")
    print("=" * 60)
    
    # Use function-based hooks (NEW Docker Hooks System)
    print("\n1ļøāƒ£ Using Docker Hooks System (NEW!)")
    hooks = {
        "on_page_context_created": set_viewport_hook,
        "before_goto": add_custom_headers_hook,
        "before_retrieve_html": log_page_metrics_hook
    }
    
    # Convert to strings using the NEW utility
    hooks_strings = hooks_to_string(hooks)
    print(f"   āœ“ Converted {len(hooks_strings)} hooks to string format")
    print("   āœ“ Ready to send to Docker API")
    
    # Use HTTPS preservation
    print("\n2ļøāƒ£ Enabling HTTPS Preservation")
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?example\.com(\/.*)?$"]
    )
    
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # v0.7.5 feature
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=1,
            max_pages=3,
            filter_chain=FilterChain([url_filter])
        )
    )
    print("   āœ“ HTTPS preservation enabled")
    
    # Use new proxy config format
    print("\n3ļøāƒ£ Using New Proxy Configuration Format")
    browser_config = BrowserConfig(
        headless=True,
        # proxy_config={  # Uncomment if you have a proxy
        #     "server": "http://proxy:8080"
        # }
    )
    print("   āœ“ New proxy config format ready")
    
    # Run the crawl
    print("\n4ļøāƒ£ Executing Crawl with All Features")
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # With deep_crawl_strategy, returns a list
        results = await crawler.arun(
            url="https://example.com",
            config=config
        )
        
        if results and len(results) > 0:
            result = results[0]  # Get first result
            print("   āœ“ Crawl successful!")
            print(f"\nšŸ“Š Results:")
            print(f"   • Pages crawled: {len(results)}")
            print(f"   • Title: {result.metadata.get('title', 'N/A')}")
            print(f"   • Content length: {len(result.markdown.raw_markdown)} characters")
            print(f"   • Links found: {len(result.links['internal']) + len(result.links['external'])}")
        else:
            print(f"   āš ļø No results returned")
    
    print("\n" + "=" * 60)
    print("āœ… Complete Feature Demo Finished!\n")

# Run complete demo
await complete_demo()

šŸŽ“ Summary

What We Covered

āœ… HTTPS Preservation - Maintain secure protocols throughout crawling
āœ… Enhanced LLM Integration - Custom temperature and provider configuration
āœ… Docker Hooks System (NEW!) - Complete pipeline customization with 3 approaches
āœ… hooks_to_string() Utility (NEW!) - Convert functions for Docker API
āœ… Bug Fixes - New proxy config and multiple improvements

Key Highlight: Docker Hooks System 🌟

The Docker Hooks System is completely NEW in v0.7.5. It offers:

  • 8 strategic hook points in the pipeline
  • 3 ways to use hooks (strings, utility, auto-conversion)
  • Full control over crawling behavior
  • Support for authentication, optimization, and custom processing

Next Steps

  1. Docker Hooks Demo - See v0.7.5_docker_hooks_demo.py for complete Docker Hooks examples
  2. Documentation - Visit docs.crawl4ai.com for full reference
  3. Examples - Check GitHub examples
  4. Community - Join Discord for support

šŸ“š Resources


Happy Crawling with v0.7.5! šŸš€