docs/md_v2/core/browser-crawler-config.md
Crawl4AI's flexibility stems from two key classes:
BrowserConfig – Dictates how the browser is launched and behaves (e.g., headless or visible, proxy, user agent).CrawlerRunConfig – Dictates how each crawl operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).LLMConfig - Dictates how LLM providers are configured. (model, api token, base url, temperature etc.)In most examples, you create one BrowserConfig for the entire crawler session, then pass a fresh or re-used CrawlerRunConfig whenever you call arun(). This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the Configuration Parameters.
class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
browser_mode="dedicated",
use_managed_browser=False,
cdp_url=None,
debugging_port=9222,
host="localhost",
proxy_config=None,
viewport_width=1080,
viewport_height=600,
verbose=True,
use_persistent_context=False,
user_data_dir=None,
cookies=None,
headers=None,
user_agent=(
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode="",
text_mode=False,
light_mode=False,
extra_args=None,
enable_stealth=False,
# ... other advanced parameters omitted here
):
...
1.⠀browser_type
"chromium", "firefox", or "webkit"."chromium".2.⠀headless
True: Runs the browser in headless mode (invisible browser).False: Runs the browser in visible mode, which helps with debugging.3.⠀browser_mode
"dedicated" (default): Creates a new browser instance each time"builtin": Uses the builtin CDP browser running in background"custom": Uses explicit CDP settings provided in cdp_url"docker": Runs browser in Docker container with isolation4.⠀use_managed_browser & cdp_url
use_managed_browser=True: Launch browser using Chrome DevTools Protocol (CDP) for advanced controlcdp_url: URL for CDP endpoint (e.g., "ws://localhost:9222/devtools/browser/")browser_mode5.⠀debugging_port & host
debugging_port: Port for browser debugging protocol (default: 9222)host: Host for browser connection (default: "localhost")6.⠀proxy_config
ProxyConfig object or dictionary with fields like:{
"server": "http://proxy.example.com:8080",
"username": "...",
"password": "..."
}
None if a proxy is not required.7.⠀viewport_width & viewport_height
8.⠀device_scale_factor
1.0.2.0 for Retina-quality screenshots (e.g., a 1920×1080 viewport produces 3840×2160 images).9.⠀verbose
True, prints extra logs.9.⠀use_persistent_context
True, uses a persistent browser profile, storing cookies/local storage across runs.user_data_dir to point to a folder.10.⠀cookies & headers
- If you want to start with specific cookies or add universal HTTP headers to the browser context, set them here.
- E.g. cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}].
11.⠀user_agent & user_agent_mode
- user_agent: Custom User-Agent string. If None, a default is used.
- user_agent_mode: Set to "random" for randomization (helps fight bot detection).
12.⠀text_mode & light_mode
- text_mode=True disables images, possibly speeding up text-only crawls.
- light_mode=True turns off certain background features for performance.
13.⠀avoid_ads & avoid_css
- avoid_ads=True blocks requests to common ad and tracker domains (Google Analytics, DoubleClick, Facebook, Hotjar, etc.) at the browser context level. Reduces network overhead and memory usage.
- avoid_css=True blocks loading of CSS files (.css, .less, .scss, .sass), useful when you only need text content and want faster, leaner crawls.
- Both default to False (opt-in). Can be combined with each other and with text_mode.
14.⠀extra_args
- Additional flags for the underlying browser.
- E.g. ["--disable-extensions"].
15.⠀enable_stealth
- If True, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is False. Recommended for sites with bot protection.
Both configuration classes provide a clone() method to create modified copies:
# Create a base browser config
base_browser = BrowserConfig(
browser_type="chromium",
headless=True,
text_mode=True
)
# Create a visible browser config for debugging
debug_browser = base_browser.clone(
headless=False,
verbose=True
)
Both BrowserConfig and CrawlerRunConfig support class-level default overrides via set_defaults(). This is useful in server/cloud deployments where every config instance needs the same base settings — set them once at startup instead of repeating at every call site.
from crawl4ai import BrowserConfig, CrawlerRunConfig
# At application startup — one time
BrowserConfig.set_defaults(
cache_cdp_connection=True,
cdp_close_delay=0,
create_isolated_context=True,
)
CrawlerRunConfig.set_defaults(verbose=False)
# Every new instance automatically inherits those defaults
cfg = BrowserConfig(cdp_url="ws://localhost:9222")
# → cache_cdp_connection=True, cdp_close_delay=0, create_isolated_context=True
# Explicit values still win
cfg = BrowserConfig(cdp_url="ws://localhost:9222", cache_cdp_connection=False)
# → cache_cdp_connection=False (explicit overrides the class default)
Available methods (on both BrowserConfig and CrawlerRunConfig):
| Method | Description |
|---|---|
set_defaults(**kwargs) | Set class-level defaults. Invalid parameter names raise ValueError. |
get_defaults() | Return a copy of the current class-level defaults. |
reset_defaults() | Clear all class-level defaults. |
reset_defaults("param1", "param2") | Clear only the named defaults. |
Note: Class defaults are independent per class —
BrowserConfig.set_defaults()does not affectCrawlerRunConfig, and vice versa. Defaults are stored in memory and apply for the lifetime of the process.
Minimal Example:
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_conf = BrowserConfig(
browser_type="firefox",
headless=False,
text_mode=True
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300])
class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
chunking_strategy=RegexChunking(),
markdown_generator=None,
cache_mode=CacheMode.BYPASS,
js_code=None,
c4a_script=None,
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Proxy Configuration
proxy_config=None,
proxy_rotation_strategy=None,
# Page Interaction Parameters
scan_full_page=False,
scroll_delay=0.2,
wait_until="domcontentloaded",
page_timeout=60000,
delay_before_return_html=0.1,
# URL Matching Parameters
url_matcher=None, # For URL-specific configurations
match_mode=MatchMode.OR,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
):
...
1.⠀word_count_threshold:
2.⠀extraction_strategy:
None, no structured extraction is done (only raw/cleaned HTML + markdown).3.⠀chunking_strategy:
RegexChunking(). Can be customized for different chunking approaches.4.⠀markdown_generator:
DefaultMarkdownGenerator(...), controlling how HTML→Markdown conversion is done.None, a default approach is used.5.⠀cache_mode:
ENABLED, BYPASS, DISABLED, etc.).CacheMode.BYPASS.6.⠀js_code, js_code_before_wait, & c4a_script:
js_code: JavaScript to run after wait_for completes — on the fully-loaded page.js_code_before_wait: JavaScript to run before wait_for — for triggering loading that wait_for then checks.c4a_script: C4A script that compiles to JavaScript.7.⠀wait_for:
wait_for="css:.main-loaded" or wait_for="js:() => window.loaded === true".8.⠀flatten_shadow_dom:
True, flattens Shadow DOM content into the light DOM before HTML capture.9.⠀screenshot, pdf, & capture_mhtml:
True, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.result.screenshot (base64), result.pdf (bytes), or result.mhtml (string).force_viewport_screenshot=True to capture only the visible viewport instead of the full page. This is faster and produces smaller images when you don't need a full-page screenshot.9.⠀Location Parameters:
locale: Browser's locale (e.g., "en-US", "fr-FR") for language preferencestimezone_id: Browser's timezone (e.g., "America/New_York", "Europe/Paris")geolocation: GPS coordinates via GeolocationConfig(latitude=48.8566, longitude=2.3522)10.⠀Proxy Configuration:
- proxy_config: Single ProxyConfig or list[ProxyConfig] — proxies tried in order. Pass a list for automatic escalation.
- proxy_rotation_strategy: Strategy for rotating proxies during crawls
11.⠀Anti-Bot Retry & Fallback (see Anti-Bot & Fallback):
- max_retries: Number of retry rounds when blocking is detected (default: 0). Each round tries all proxies in proxy_config.
- fallback_fetch_function: Async function called as last resort — takes URL, returns raw HTML
12.⠀Page Interaction Parameters:
- scan_full_page: If True, scroll through the entire page to load all content
- wait_until: Condition to wait for when navigating (e.g., "domcontentloaded", "networkidle")
- page_timeout: Timeout in milliseconds for page operations (default: 60000)
- delay_before_return_html: Delay in seconds before retrieving final HTML.
13.⠀url_matcher & match_mode:
- Enable URL-specific configurations when used with arun_many().
- Set url_matcher to patterns (glob, function, or list) to match specific URLs.
- Use match_mode (OR/AND) to control how multiple patterns combine.
- See URL-Specific Configurations for examples.
13.⠀verbose:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to True in BrowserConfig.
14.⠀stream:
- If True, enables streaming mode for arun_many() to process URLs as they complete.
- Allows handling results incrementally instead of waiting for all URLs to finish.
The clone() method is particularly useful for creating variations of your crawler configuration:
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200,
wait_until="networkidle"
)
# Create variations for different use cases
stream_config = base_config.clone(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
debug_config = base_config.clone(
page_timeout=120000, # Longer timeout for debugging
verbose=True
)
The clone() method:
1.⠀provider:
"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
(default: "openai/gpt-4o-mini")2.⠀api_token:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,"GEMINI_API_KEY" will be read from environment variables
- API token of LLM provider
eg: api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"
- Environment variable - use with prefix "env:"
eg:api_token = "env: GROQ_API_KEY"
3.⠀base_url:
4.⠀Retry/backoff controls (optional):
backoff_base_delay (default 2 seconds) – base delay inserted before the first retry when the provider returns a rate-limit response.backoff_max_attempts (default 3) – total number of attempts (initial call plus retries) before the request is surfaced as an error.backoff_exponential_factor (default 2) – growth rate for the retry delay (delay = base_delay * factor^attempt).perform_completion_with_backoff helper, ensuring every strategy that consumes your LLMConfig honors the same throttling policy.llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
backoff_base_delay=1, # optional
backoff_max_attempts=5, # optional
backoff_exponential_factor=3, #optional
)
In a typical scenario, you define one BrowserConfig for your crawler session, then create one or more CrawlerRunConfig & LLMConfig depending on each call's needs:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
browser_conf = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720
)
# 2) Example extraction strategy
schema = {
"name": "Articles",
"baseSelector": "div.article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
extraction = JsonCssExtractionStrategy(schema)
# 3) Example LLM content filtering
gemini_config = LLMConfig(
provider="gemini/gemini-1.5-pro",
api_token = "env:GEMINI_API_TOKEN"
)
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
llm_config=gemini_config, # or your preferred provider
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=500, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(
markdown_generator=md_generator,
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
# 4) Execute the crawl
result = await crawler.arun(url="https://example.com/news", config=run_conf)
if result.success:
print("Extracted content:", result.extracted_content)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
For a detailed list of available parameters (including advanced ones), see:
You can explore topics like:
BrowserConfig, CrawlerRunConfig and LLMConfig give you straightforward ways to define:
Use them together for clear, maintainable code, and when you need more specialized behavior, check out the advanced parameters in the reference docs. Happy crawling!