docs/md_v2/api/arun.md
arun() Parameter Guide (New Approach)In Crawl4AI’s latest configuration model, nearly all parameters that once went directly to arun() are now part of CrawlerRunConfig. When calling arun(), you provide:
await crawler.arun(
url="https://example.com",
config=my_run_config
)
Below is an organized look at the parameters that can go inside CrawlerRunConfig, divided by their functional areas. For Browser settings (e.g., headless, browser_type), see BrowserConfig.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
check_robots_txt=True, # Respect robots.txt rules
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
Key Fields:
verbose=True logs each crawl step. cache_mode decides how to read/write the local crawl cache.cache_mode (default: CacheMode.ENABLED)
Use a built-in enum from CacheMode:
ENABLED: Normal caching—reads if available, writes if missing.DISABLED: No caching—always refetch pages.READ_ONLY: Reads from cache only; no new writes.WRITE_ONLY: Writes to cache but doesn’t read existing data.BYPASS: Skips reading cache for this crawl (though it might still write if set up that way).run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
Cache modes:
CacheMode.BYPASS — Skip cache entirely; always fetch fresh, write result to cache.CacheMode.DISABLED — No caching at all; don't read or write.CacheMode.WRITE_ONLY — Never read from cache, but write results.CacheMode.READ_ONLY — Read from cache if available, never write.run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
Key Fields:
wait_for:
"css:selector" or"js:() => boolean"js:() => document.querySelectorAll('.item').length > 10.mean_delay & max_range: define random delays for arun_many() calls.
semaphore_count: concurrency limit when crawling multiple URLs.
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
js_only=False
)
js_code can be a single string or a list of strings. js_only=True means “I’m continuing in the same session with new JS steps, no new full navigation.”run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
magic=True tries multiple stealth features. simulate_user=True mimics mouse movements or random delays. override_navigator=True fakes some navigator properties (like user agent checks).session_id:
run_config = CrawlerRunConfig(
session_id="my_session123"
)
If re-used in subsequent arun() calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
Where they appear:
result.screenshot → Base64 screenshot string.result.pdf → Byte array with PDF data.For advanced data extraction (CSS/LLM-based), set extraction_strategy:
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
The extracted data will appear in result.extracted_content.
Below is a snippet combining many parameters:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
What we covered:
1. Crawling the main content region, ignoring external links. 2. Running JavaScript to click “.show-more”. 3. Waiting for “.loaded-block” to appear. 4. Generating a screenshot & PDF of the final page. 5. Extracting repeated “article.post” elements with a CSS-based extraction strategy.
1. Use BrowserConfig for global browser settings (headless, user agent).
2. Use CrawlerRunConfig to handle the specific crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your parameters consistent in run configs—especially if you’re part of a large codebase with multiple crawls.
4. Limit large concurrency (semaphore_count) if the site or your system can’t handle it.
5. For dynamic pages, set js_code or scan_full_page so you load all content.
All parameters that used to be direct arguments to arun() now belong in CrawlerRunConfig. This approach:
For a full reference, check out the CrawlerRunConfig Docs.
Happy crawling with your structured, flexible config approach!