docs/blog/release-v0.7.3.md
August 6, 2025 β’ 5 min read
Today I'm releasing Crawl4AI v0.7.3βthe Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
The Problem: You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handlingβcaching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
My Solution: I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
# Define specialized configs for different content types
configs = [
# Documentation sites - aggressive caching, include links
CrawlerRunConfig(
url_matcher=["*docs*", "*documentation*"],
cache_mode="write",
markdown_generator_options={"include_links": True}
),
# News/blog sites - fresh content, scroll for lazy loading
CrawlerRunConfig(
url_matcher=lambda url: 'blog' in url or 'news' in url,
cache_mode="bypass",
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
),
# API endpoints - structured extraction
CrawlerRunConfig(
url_matcher=["*.json", "*api*"],
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
extraction_type="structured"
)
),
# Default fallback for everything else
CrawlerRunConfig() # No url_matcher = matches everything
]
# Crawl multiple URLs with appropriate configs
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=[
"https://docs.python.org/3/", # β Uses documentation config
"https://blog.python.org/", # β Uses blog config
"https://api.github.com/users", # β Uses API config
"https://example.com/" # β Uses default config
],
config=configs
)
Matching Capabilities:
"*.pdf", "*/blog/*"Expected Real-World Impact:
The Problem: Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
My Solution: I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
from crawl4ai import AsyncWebCrawler, BrowserConfig
# Enable undetected mode for stealth crawling
browser_config = BrowserConfig(
browser_type="undetected", # Use undetected Chrome
headless=True, # Can run headless with stealth
extra_args=[
"--disable-blink-features=AutomationControlled",
"--disable-web-security",
"--disable-features=VizDisplayCompositor"
]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# This will bypass most bot detection systems
result = await crawler.arun("https://protected-site.com")
if result.success:
print("β
Successfully bypassed bot detection!")
print(f"Content length: {len(result.markdown)}")
Advanced Anti-Bot Strategies:
# Combine multiple stealth techniques
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(
# Random user agents and headers
headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1"
},
# Human-like behavior simulation
js_code="""
// Random mouse movements
const simulateHuman = () => {
const event = new MouseEvent('mousemove', {
clientX: Math.random() * window.innerWidth,
clientY: Math.random() * window.innerHeight
});
document.dispatchEvent(event);
};
setInterval(simulateHuman, 100 + Math.random() * 200);
// Random scrolling
const randomScroll = () => {
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
window.scrollTo(0, scrollY);
};
setTimeout(randomScroll, 500 + Math.random() * 1000);
""",
# Delay to appear more human
delay_before_return_html=2.0
)
result = await crawler.arun("https://bot-protected-site.com", config=config)
Expected Real-World Impact:
The Problem: Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
My Solution: Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
# Monitor memory during crawling
monitor = MemoryMonitor()
async with AsyncWebCrawler() as crawler:
# Start monitoring
monitor.start_monitoring()
# Perform memory-intensive operations
results = await crawler.arun_many([
"https://heavy-js-site.com",
"https://large-images-site.com",
"https://dynamic-content-site.com"
])
# Get detailed memory report
memory_report = monitor.get_report()
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
# Automatic cleanup suggestions
if memory_report['peak_mb'] > 1000: # > 1GB
print("π‘ Consider batch size optimization")
print("π‘ Enable aggressive garbage collection")
Expected Real-World Impact:
The Problem: Table data was accessed through the generic result.media interface, making DataFrame conversion cumbersome and unclear.
My Solution: Dedicated result.tables interface with direct DataFrame conversion and improved detection algorithms.
# Old way (deprecated)
# tables_data = result.media.get('tables', [])
# New way (v0.7.3+)
result = await crawler.arun("https://site-with-tables.com")
# Direct table access
if result.tables:
print(f"Found {len(result.tables)} tables")
# Convert to pandas DataFrame instantly
import pandas as pd
for i, table in enumerate(result.tables):
df = pd.DataFrame(table['data'])
print(f"Table {i}: {df.shape[0]} rows Γ {df.shape[1]} columns")
print(df.head())
# Table metadata
print(f"Source: {table.get('source_xpath', 'Unknown')}")
print(f"Headers: {table.get('headers', [])}")
Expected Real-World Impact:
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
Sponsorship Tiers:
Why Sponsor?
The Problem: Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
My Solution: Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
# Option 1: Direct environment variables
docker run -d \
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
-e GROQ_API_KEY="your-key" \
-p 11235:11235 \
unclecode/crawl4ai:latest
# Option 2: Using .llm.env file (recommended for production)
# Create .llm.env file:
# LLM_PROVIDER=openai/gpt-4o-mini
# OPENAI_API_KEY=your-openai-key
# GROQ_API_KEY=your-groq-key
docker run -d \
--env-file .llm.env \
-p 11235:11235 \
unclecode/crawl4ai:latest
Override per request when needed:
# Use default provider from .llm.env
response = requests.post("http://localhost:11235/crawl", json={
"url": "https://example.com",
"extraction_strategy": {"type": "llm"}
})
# Override to use different provider for this specific request
response = requests.post("http://localhost:11235/crawl", json={
"url": "https://complex-page.com",
"extraction_strategy": {
"type": "llm",
"provider": "openai/gpt-4" # Override default
}
})
Expected Real-World Impact:
.llm.env file, not in commandsThis release includes several important bug fixes that improve stability and reliability:
Based on community feedback, we've updated:
Thanks to our contributors and the entire community for feedback and bug reports.
Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deploymentβthey're game changers!
Happy Crawling! π·οΈ
- The Crawl4AI Team