docs/md_v2/api/arun_many.md
arun_many(...) ReferenceNote: This function is very similar to
arun()but focused on concurrent or batch crawling. If you’re unfamiliar witharun()usage, please read that doc first, then review this for differences.
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> RunManyReturn:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: RunManyReturn containing either a list of `CrawlResult` objects or an async generator if streaming is enabled.
"""
arun()1. Multiple URLs:
RunManyReturn which contains either a list of CrawlResult or an async generator if streaming is enabled.2. Concurrency & Dispatchers:
dispatcher param allows advanced concurrency control. MemoryAdaptiveDispatcher) is used internally. 3. Streaming Support:
stream=True in your CrawlerRunConfig.async for to process results as they become available.4. Parallel Execution**:
arun_many() can run multiple requests concurrently under the hood. CrawlResult might also include a dispatch_result with concurrency details (like memory usage, start/end times).# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
Instead of using one config for all URLs, provide a list of configs with url_matcher patterns:
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Custome settings for JSON extraction
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
URL Matching Features:
"*.pdf", "*/blog/*", "*python.org*"lambda url: 'api' in urlMatchMode.OR or MatchMode.ANDKey Points:
dispatch_result in each CrawlResult (if using concurrency) can hold memory and timing info. url_matcher) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.Returns a RunManyReturn object which contains either a list of CrawlResult objects, or an async generator if streaming is enabled. You can iterate to check result.success or read each item’s extracted_content, markdown, or dispatch_result.
MemoryAdaptiveDispatcher: Dynamically manages concurrency based on system memory usage. SemaphoreDispatcher: Fixed concurrency limit, simpler but less adaptive. For advanced usage or custom settings, see Multi-URL Crawling with Dispatchers.
1. Large Lists: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
2. Session Reuse: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
3. Error Handling: Each CrawlResult might fail for different reasons—always check result.success or the error_message before proceeding.
Use arun_many() when you want to crawl multiple URLs simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a dispatcher. Each result is a standard CrawlResult, possibly augmented with concurrency stats (dispatch_result) for deeper inspection. For more details on concurrency logic and dispatchers, see the Advanced Multi-URL Crawling docs.