docs/md_v2/core/quickstart.md
Welcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, you’ll:
Crawl4AI provides:
AsyncWebCrawler.BrowserConfig and CrawlerRunConfig.DefaultMarkdownGenerator (supports optional filters).By the end of this guide, you’ll have performed a basic crawl, generated Markdown, tried out two extraction strategies, and crawled a dynamic page that uses “Load More” buttons or JavaScript updates.
Here’s a minimal Python script that creates an AsyncWebCrawler, fetches a webpage, and prints the first 300 characters of its Markdown output:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
What’s happening?
AsyncWebCrawler launches a headless browser (Chromium by default).https://example.com.You now have a simple, working crawl!
Crawl4AI’s crawler can be heavily customized using two main classes:
1. BrowserConfig: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. CrawlerRunConfig: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
Below is an example with minimal usage:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
IMPORTANT: By default cache mode is set to
CacheMode.BYPASSto have fresh content. SetCacheMode.ENABLEDto enable caching.
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.
result.markdown:result.markdown.fit_markdown:PruningContentFilter).DefaultMarkdownGeneratorfrom crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. PruningContentFilter may adds around 50ms in processing time. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
New! Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token") # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
For a complete guide on schema generation and advanced usage, see No-LLM Extraction Strategies.
Here's a basic extraction example:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Example Items",
"baseSelector": "div.item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="raw://" + raw_html,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
Why is this helpful?
Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with
raw://.
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:
ollama/llama3.3, no_token)openai/gpt-4, requires api_token)Below is an example using open-source style (no token) and closed-source:
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
)
)
What’s happening?
PricingInfo) describing the fields we want.Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
What's special about adaptive crawling?
Learn more about Adaptive Crawling →
If you need to crawl multiple URLs in parallel, you can use arun_many(). By default, Crawl4AI employs a MemoryAdaptiveDispatcher, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def quick_parallel_example():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
# Or get all results at once (default behavior)
run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
else:
print(f"[ERROR] {res.url} => {res.error_message}")
if __name__ == "__main__":
asyncio.run(quick_parallel_example())
The example above shows two ways to handle multiple URLs:
stream=True): Process results as they become available using async forstream=False): Wait for all results to completeFor more advanced concurrency (e.g., a semaphore-based approach, adaptive memory usage throttling, or customized rate limiting), see Advanced Multi-URL Crawling.
Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to click a “Next Page” button and wait for new commits to load on GitHub, using BrowserConfig and CrawlerRunConfig:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
async def main():
await extract_structured_data_using_css_extractor()
if __name__ == "__main__":
asyncio.run(main())
Key Points:
BrowserConfig(headless=False): We want to watch it click “Next Page.”CrawlerRunConfig(...): We specify the extraction strategy, pass session_id to reuse the same page.js_code and wait_for are used for subsequent pages (page > 0) to click the “Next” button and wait for new commits to load.js_only=True indicates we’re not re-navigating but continuing the existing session.kill_session() to clean up the page and browser session.Congratulations! You have:
If you’re ready for more, check out:
Crawl4AI is a powerful, flexible tool. Enjoy building out your scrapers, data pipelines, or AI-driven extraction flows. Happy crawling!