docs/md_v2/blog/releases/0.5.0.md
Release Theme: Power, Flexibility, and Scalability
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. Key improvements include a new deep crawling system, a memory-adaptive dispatcher for handling large-scale crawls, multiple crawling strategies (including a fast HTTP-only crawler), Docker deployment options, and a powerful command-line interface (CLI). This release also includes numerous bug fixes, performance optimizations, and documentation updates.
Important Note: This release contains several breaking changes. Please review the "Breaking Changes" section carefully and update your code accordingly.
Crawl4AI now supports deep crawling, allowing you to explore websites beyond the
initial URLs. This is controlled by the deep_crawl_strategy parameter in
CrawlerRunConfig. Several strategies are available:
BFSDeepCrawlStrategy (Breadth-First Search): Explores the website level
by level. (Default)DFSDeepCrawlStrategy (Depth-First Search): Explores each branch as
deeply as possible before backtracking.BestFirstCrawlingStrategy: Uses a scoring function to prioritize which
URLs to crawl next.import time
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain, URLPatternFilter, KeywordRelevanceScorer, BestFirstCrawlingStrategy
import asyncio
# Create a filter chain to filter urls based on patterns, domains and content type
filter_chain = FilterChain(
[
DomainFilter(
allowed_domains=["docs.crawl4ai.com"],
blocked_domains=["old.docs.crawl4ai.com"],
),
URLPatternFilter(patterns=["*core*", "*advanced*"],),
ContentTypeFilter(allowed_types=["text/html"]),
]
)
# Create a keyword scorer that prioritises the pages with certain keywords first
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"], weight=0.7
)
# Set up the configuration
deep_crawl_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer,
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True,
)
async def main():
async with AsyncWebCrawler() as crawler:
start_time = time.perf_counter()
results = []
async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=deep_crawl_config):
print(f"Crawled: {result.url} (Depth: {result.metadata['depth']}), score: {result.metadata['score']:.2f}")
results.append(result)
duration = time.perf_counter() - start_time
print(f"\n✅ Crawled {len(results)} high-value pages in {duration:.2f} seconds")
asyncio.run(main())
Breaking Change: The max_depth parameter is now part of CrawlerRunConfig
and controls the depth of the crawl, not the number of concurrent crawls. The
arun() and arun_many() methods are now decorated to handle deep crawling
strategies. Imports for deep crawling strategies have changed. See the
Deep Crawling documentation for more details.
The new MemoryAdaptiveDispatcher dynamically adjusts concurrency based on
available system memory and includes built-in rate limiting. This prevents
out-of-memory errors and avoids overwhelming target websites.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
import asyncio
# Configure the dispatcher (optional, defaults are used if not provided)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0, # Pause if memory usage exceeds 80%
check_interval=0.5, # Check memory every 0.5 seconds
)
async def batch_mode():
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
config=CrawlerRunConfig(stream=False), # Batch mode
dispatcher=dispatcher,
)
for result in results:
print(f"Crawled: {result.url} with status code: {result.status_code}")
async def stream_mode():
async with AsyncWebCrawler() as crawler:
# OR, for streaming:
async for result in await crawler.arun_many(
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
config=CrawlerRunConfig(stream=True),
dispatcher=dispatcher,
):
print(f"Crawled: {result.url} with status code: {result.status_code}")
print("Dispatcher in batch mode:")
asyncio.run(batch_mode())
print("-" * 50)
print("Dispatcher in stream mode:")
asyncio.run(stream_mode())
Breaking Change: AsyncWebCrawler.arun_many() now uses
MemoryAdaptiveDispatcher by default. Existing code that relied on unbounded
concurrency may require adjustments.
Crawl4AI now offers two crawling strategies:
AsyncPlaywrightCrawlerStrategy (Default): Uses Playwright for
browser-based crawling, supporting JavaScript rendering and complex
interactions.AsyncHTTPCrawlerStrategy: A lightweight, fast, and memory-efficient
HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is
unnecessary.from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
import asyncio
# Use the HTTP crawler strategy
http_crawler_config = HTTPCrawlerConfig(
method="GET",
headers={"User-Agent": "MyCustomBot/1.0"},
follow_redirects=True,
verify_ssl=True
)
async def main():
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
result = await crawler.arun("https://example.com")
print(f"Status code: {result.status_code}")
print(f"Content length: {len(result.html)}")
asyncio.run(main())
Crawl4AI can now be easily deployed as a Docker container, providing a consistent and isolated environment. The Docker image includes a FastAPI server with both streaming and non-streaming endpoints.
# Build the image (from the project root)
docker build -t crawl4ai .
# Run the container
docker run -d -p 8000:8000 --name crawl4ai crawl4ai
API Endpoints:
/crawl (POST): Non-streaming crawl./crawl/stream (POST): Streaming crawl (NDJSON)./health (GET): Health check./schema (GET): Returns configuration schemas./md/{url} (GET): Returns markdown content of the URL./llm/{url} (GET): Returns LLM extracted content./token (POST): Get JWT tokenBreaking Changes:
.llm.env file for API keys.config.yml structure.supervisord instead of direct process management.See the Docker deployment documentation for detailed instructions.
A new CLI (crwl) provides convenient access to Crawl4AI's functionality from
the terminal.
# Basic crawl
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# Use a configuration file
crwl https://example.com -B browser.yml -C crawler.yml
# Use LLM-based extraction
crwl https://example.com -e extract.yml -s schema.json
# Ask a question about the crawled content
crwl https://example.com -q "What is the main topic?"
# See usage examples
crwl --example
See the CLI documentation for more details.
Added LXMLWebScrapingStrategy for faster HTML parsing using the lxml
library. This can significantly improve scraping performance, especially for
large or complex pages. Set scraping_strategy=LXMLWebScrapingStrategy() in
your CrawlerRunConfig.
Breaking Change: The ScrapingMode enum has been replaced with a strategy
pattern. Use WebScrapingStrategy (default) or LXMLWebScrapingStrategy.
Added ProxyRotationStrategy abstract base class with RoundRobinProxyStrategy
concrete implementation.
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
async def main():
# Load proxies and create rotation strategy
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
return
proxy_strategy = RoundRobinProxyStrategy(proxies)
# Create configs
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
asyncio.run(main())
LLMContentFilter for intelligent markdown generation. This new
filter uses an LLM to create more focused and relevant markdown output.from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai import LLMConfig
import asyncio
llm_config = LLMConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
markdown_generator = DefaultMarkdownGenerator(
content_filter=LLMContentFilter(llm_config=llm_config, instruction="Extract key concepts and summaries")
)
config = CrawlerRunConfig(markdown_generator=markdown_generator)
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.crawl4ai.com", config=config)
print(result.markdown.fit_markdown)
asyncio.run(main())
Added: URL redirection tracking. The crawler now automatically follows
HTTP redirects (301, 302, 307, 308) and records the final URL in the
redirected_url field of the CrawlResult object. No code changes are
required to enable this; it's automatic.
Added: LLM-powered schema generation utility. A new generate_schema
method has been added to JsonCssExtractionStrategy and
JsonXPathExtractionStrategy. This greatly simplifies creating extraction
schemas.
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
llm_config = LLMConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
schema = JsonCssExtractionStrategy.generate_schema(
html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
llm_config = llm_config,
query="Extract product name and price"
)
print(schema)
Expected Output (may vary slightly due to LLM)
{
"name": "ProductExtractor",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
Added: robots.txt compliance support. The crawler can now respect
robots.txt rules. Enable this by setting check_robots_txt=True in
CrawlerRunConfig.
config = CrawlerRunConfig(check_robots_txt=True)
Added: PDF processing capabilities. Crawl4AI can now extract text, images,
and metadata from PDF files (both local and remote). This uses a new
PDFCrawlerStrategy and PDFContentScrapingStrategy.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import asyncio
async def main():
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
result = await crawler.arun(
"https://arxiv.org/pdf/2310.06825.pdf",
config=CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy()
)
)
print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author, etc.)
asyncio.run(main())
Added: Support for frozenset serialization. Improves configuration serialization, especially for sets of allowed/blocked domains. No code changes required.
Added: New LLMConfig parameter. This new parameter can be passed for
extraction, filtering, and schema generation tasks. It simplifies passing
provider strings, API tokens, and base URLs across all sections where LLM
configuration is necessary. It also enables reuse and allows for quick
experimentation between different LLM configurations.
from crawl4ai import LLMConfig
from crawl4ai import LLMExtractionStrategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Example of using LLMConfig with LLMExtractionStrategy
llm_config = LLMConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
strategy = LLMExtractionStrategy(llm_config=llm_config, schema=...)
# Example usage within a crawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(extraction_strategy=strategy)
)
Breaking Change: Removed old parameters like provider, api_token,
base_url, and api_base from LLMExtractionStrategy and
LLMContentFilter. Users should migrate to using the LLMConfig object.
Changed: Improved browser context management and added shared data support.
(Breaking Change: BrowserContext API updated). Browser contexts are now
managed more efficiently, reducing resource usage. A new shared_data
dictionary is available in the BrowserContext to allow passing data between
different stages of the crawling process. Breaking Change: The
BrowserContext API has changed, and the old get_context method is
deprecated.
Changed: Renamed final_url to redirected_url in CrawledURL. This
improves consistency and clarity. Update any code referencing the old field
name.
Changed: Improved type hints and removed unused files. This is an internal improvement and should not require code changes.
Changed: Reorganized deep crawling functionality into dedicated module.
(Breaking Change: Import paths for DeepCrawlStrategy and related classes
have changed). This improves code organization. Update imports to use the new
crawl4ai.deep_crawling module.
Changed: Improved HTML handling and cleanup codebase. (Breaking
Change: Removed ssl_certificate.json file). This removes an unused file.
If you were relying on this file for custom certificate validation, you'll
need to implement an alternative approach.
Changed: Enhanced serialization and config handling. (Breaking Change:
FastFilterChain has been replaced with FilterChain). This change
simplifies config and improves the serialization.
Added: Modified the license to Apache 2.0 with a required attribution
clause. See the LICENSE file for details. All users must now clearly
attribute the Crawl4AI project when using, distributing, or creating
derivative works.
Fixed: Prevent memory leaks by ensuring proper closure of Playwright pages. No code changes required.
Fixed: Make model fields optional with default values (Breaking
Change: Code relying on all fields being present may need adjustment).
Fields in data models (like CrawledURL) are now optional, with default
values (usually None). Update code to handle potential None values.
Fixed: Adjust memory threshold and fix dispatcher initialization. This is an internal bug fix; no code changes are required.
Fixed: Ensure proper exit after running doctor command. No code changes are required.
Fixed: JsonCss selector and crawler improvements.
Fixed: Not working long page screenshot (#403)
Documentation: Updated documentation URLs to the new domain.
Documentation: Added SERP API project example.
Documentation: Added clarifying comments for CSS selector behavior.
Documentation: Add Code of Conduct for the project (#410)
MemoryAdaptiveDispatcher is now the default for
arun_many(), changing concurrency behavior. The return type of arun_many
depends on the stream parameter.max_depth is now part of CrawlerRunConfig and controls
crawl depth. Import paths for deep crawling strategies have changed.BrowserContext API has been updated.ScrapingMode enum replaced by strategy pattern
(WebScrapingStrategy, LXMLWebScrapingStrategy).content_filter parameter from
CrawlerRunConfig. Use extraction strategies or markdown generators with
filters instead.WebCrawler, CLI, and docs management functionality.DeepCrawlStrategy,
BreadthFirstSearchStrategy, and related classes due to the new
deep_crawling module structure.CrawlerRunConfig: Move max_depth to CrawlerRunConfig. If using
content_filter, migrate to an extraction strategy or a markdown generator
with a filter.arun_many(): Adapt code to the new MemoryAdaptiveDispatcher behavior
and the return type.BrowserContext: Update code using the BrowserContext API.None values for optional fields in data
models.ScrapingMode enum with WebScrapingStrategy or
LXMLWebScrapingStrategy.crwl command and update any scripts using the
old CLI.