docs/blog/release-v0.8.5.md
March 2026 • 10 min read
I'm releasing Crawl4AI v0.8.5—our biggest release since v0.8.0. This update brings automatic anti-bot detection with proxy escalation, Shadow DOM flattening, deep crawl cancellation, and over 60 bug fixes from both our team and the community. If you're running crawls at scale or dealing with protected sites, this one's for you.
cancel() or should_cancel callbackset_defaults() / get_defaults() / reset_defaults()avoid_ads / avoid_css| col1 | col2 | pipe delimiters in markdown outputThis is the headline feature. Crawl4AI now automatically detects when a page is blocked by anti-bot protection and takes action—retrying with different proxies or falling back to an alternative fetch method.
The detection uses three tiers:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import ProxyConfig
config = CrawlerRunConfig(
# Try direct first, then proxy on bot detection
proxy_config=[
ProxyConfig.DIRECT,
ProxyConfig(server="http://my-proxy:8080"),
],
max_retries=2,
# Optional: fallback when all proxies fail
fallback_fetch_function=my_web_unlocker_function,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://protected-site.com", config=config)
# Check what happened
stats = result.crawl_stats
print(f"Resolved by: {stats['resolved_by']}") # "direct", "proxy", or "fallback_fetch"
print(f"Proxies tried: {len(stats['proxies_used'])}")
The system errs on the side of caution—false positives are cheap (the fallback rescues them), but false negatives mean garbage results. After 5 iterations of real-world testing, it handles everything from Cloudflare challenges to Reddit's 180KB SPA block pages.
Web components with shadow DOM hide their content from regular DOM traversal. The new flatten_shadow_dom option serializes shadow DOM content into the light DOM before extraction.
config = CrawlerRunConfig(flatten_shadow_dom=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://some-web-component-site.com", config=config)
# Shadow DOM content is now visible in result.html, cleaned_html, and markdown
The implementation patches attachShadow to force-open closed shadow roots, recursively resolves <slot> projections, and strips only shadow-scoped <style> tags. It also reorders the JS execution pipeline—js_code now runs after wait_for + delay_before_return_html so your scripts operate on the fully-hydrated page. If you need JS to run before waiting, use the new js_code_before_wait parameter.
All deep crawl strategies (BFS, DFS, BestFirst) now support graceful cancellation:
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
pages_found = 0
def should_stop():
return pages_found >= 50 # Stop after finding enough pages
async def on_state(state):
nonlocal pages_found
pages_found = state["pages_crawled"]
strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=1000,
should_cancel=should_stop, # Sync or async callback
on_state_change=on_state,
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
print(f"Cancelled: {strategy.cancelled}")
You can also call strategy.cancel() directly from another thread or coroutine.
Tired of repeating the same parameters? Set defaults once and they apply to every new instance:
from crawl4ai import BrowserConfig, CrawlerRunConfig
# Set organization-wide defaults
BrowserConfig.set_defaults(headless=True, text_mode=True)
CrawlerRunConfig.set_defaults(verbose=False, remove_consent_popups=True)
# All new instances inherit defaults
bc = BrowserConfig() # headless=True, text_mode=True
rc = CrawlerRunConfig() # verbose=False, remove_consent_popups=True
# Explicit params always override
bc2 = BrowserConfig(text_mode=False) # text_mode=False, headless still True
# Inspect and reset
print(BrowserConfig.get_defaults()) # {"headless": True, "text_mode": True}
BrowserConfig.reset_defaults() # Back to normal
Many sites split a single item's data across sibling elements (think Hacker News, where title and score are in separate <tr> rows). The new "source" field navigates to a sibling before extracting:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "HackerNewsItems",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": ".titleline > a", "type": "text"},
{"name": "link", "selector": ".titleline > a", "type": "attribute", "attribute": "href"},
# Navigate to the NEXT sibling <tr> to get the score
{"name": "score", "selector": ".score", "type": "text", "source": "+ tr"},
{"name": "author", "selector": ".hnuser", "type": "text", "source": "+ tr"},
]
}
strategy = JsonCssExtractionStrategy(schema=schema)
Works in both JsonCssExtractionStrategy and JsonXPathExtractionStrategy. Falls back gracefully when siblings don't exist.
A single flag auto-dismisses cookie consent banners from 40+ CMP platforms:
config = CrawlerRunConfig(remove_consent_popups=True)
Covers OneTrust, Cookiebot, Didomi, Quantcast, Sourcepoint, Google FundingChoices, TrustArc, ConsentManager, Osano, Iubenda, Complianz, LiveRamp, CookieYes, Klaro, Termly, and many more.
Block ad trackers and CSS resources at the network level for faster, leaner crawls:
config = BrowserConfig(
avoid_ads=True, # Blocks doubleclick, google-analytics, etc.
avoid_css=True, # Blocks .css, .less, .scss resources
)
For long-running crawl sessions:
config = BrowserConfig(
memory_saving_mode=True, # Aggressive cache/V8 heap flags
max_pages_before_recycle=100, # Auto-restart browser after N pages
)
This prevents memory leaks during sustained crawling. The recycling uses a version-based approach that's safe under concurrent load—we fixed three separate deadlock bugs to get this right.
Tables in markdown output now have proper GitHub-Flavored Markdown pipe delimiters:
Before (v0.8.0):
Name | Age | City
---|---|---
Alice | 30 | NYC
After (v0.8.5):
| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
query_llm_config: Separate LLM config for adaptive crawler query expansion (#1682)force_viewport_screenshot: Screenshot only the viewport, not the full pagedevice_scale_factor: Configurable screenshot DPI via BrowserConfig (#1463)redirected_status_code: Now available on CrawlResult (#1435)wait_for_images: Wait for images to load before taking screenshots (#1792)score_threshold: Filter low-quality URLs in BestFirstCrawlingStrategy (#1804)link_preview_timeout: Configurable timeout in AdaptiveConfig (#1793)--json-ensure-ascii: CLI flag for Unicode preservation in JSON output (#1668)type-list pipeline: Chained extraction like ["attribute", "regex"] in JsonCssExtractionStrategy (#1290)Severity: CRITICAL Affected: Docker API deployment (v0.8.0 and earlier)
The /crawl endpoint's deserialization logic used eval() for certain object types. I removed this entirely and added an allowlist (ALLOWED_DESERIALIZE_TYPES) so only known config classes can be instantiated.
Affected: Docker deployments using Redis
Upgraded Redis to 7.2.7 which patches the Lua use-after-free vulnerability.
/token endpoint now requires api_token when configured (#1795)sec-ch-ua synced with User-Agent, WebGL kept alive in stealth modecreate_isolated_context=Falseadd_init_script (#1768)simulate_user destroying page content via ArrowDown keypressERR_INVALID_AUTH_CREDENTIALS (#1281)can_process_url() to receive normalized URLtotal_score not calculated for links that fail head extractionFilterChain.add_filter AttributeError on tuple immutabilityis_external_url port comparison (#1783)<base> tag ignored in html2text relative link resolution (#1721)cleaned_html (#1364)class and id attributes in cleaned_html (#1782)force_json_response path for LLM extractionfinish_reason (#1788)agenerate_schema() JSON parsing for Anthropic modelsfrom_serializable_dict ignoring plain data dicts with "type" keycss_selector ignored in LXML scraping for raw:// URLs (#1484)CRAWL4_AI_BASE_DIRECTORY env var (#1296)UnicodeEncodeError in URL seeder, strip zero-width chars (#1784)scroll_delay ignored in full-page screenshot scroller/llm per-request provider override, Redis config from host/port/password (#1611, #1817)scan_full_page=False (#1750)arun_many dispatcher bypass (#1818, #1509)tf-playwright-stealth with playwright-stealth (#1553)script.js in package distribution (#1711)text → string) (#1077)chardet.detect in thread executor (#1751)mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)Added a comprehensive 291-test regression suite covering all major subsystems: core crawl, content processing, extraction strategies, deep crawling, browser management, config serialization, utilities, and edge cases.
cleaned_html Now Preserves class and id AttributesIf you have downstream code that parses cleaned_html and assumes no class/id attributes are present, this may need updating. This change enables users to do CSS-based analysis on cleaned HTML.
If you pin Redis versions in your deployment, update to 7.2.7 or later.
pip install --upgrade crawl4ai
# or
pip install crawl4ai==0.8.5
docker pull unclecode/crawl4ai:0.8.5
docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.8.5
Run the verification tests to confirm all features are working:
python docs/releases_review/demo_v0.8.5.py
This runs 13 actual tests that crawl real URLs and verify each feature end-to-end.
This release includes contributions from a large number of community members. Thank you to everyone who submitted PRs, reported issues, and provided reproduction steps. Special thanks to all contributors listed in CONTRIBUTORS.md.
Issues fixed: #462, #880, #943, #1031, #1077, #1183, #1213, #1251, #1281, #1290, #1296, #1308, #1354, #1364, #1370, #1374, #1424, #1435, #1463, #1484, #1487, #1489, #1494, #1503, #1509, #1512, #1520, #1553, #1594, #1601, #1606, #1611, #1622, #1635, #1640, #1658, #1666, #1667, #1668, #1671, #1682, #1686, #1711, #1715, #1716, #1721, #1730, #1731, #1746, #1750, #1751, #1754, #1758, #1762, #1768, #1770, #1776, #1782, #1783, #1784, #1786, #1788, #1789, #1790, #1792, #1793, #1794, #1795, #1796, #1797, #1801, #1803, #1804, #1805, #1815, #1817, #1818, #1824
This is a massive release—10 new features, critical security patches, and 60+ bug fixes. Whether you're dealing with anti-bot protection, shadow DOM sites, or just want more reliable crawls at scale, v0.8.5 has you covered. Thank you for your continued support!
Happy crawling!
- unclecode