docs/md_v2/core/adaptive-crawling.md
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. Adaptive Crawling changes this paradigm by introducing intelligence into the crawling process.
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
When crawling websites for specific information, you face two challenges:
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
The AdaptiveCrawler uses three metrics to measure information sufficiency:
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Create an adaptive crawler (config is optional)
adaptive = AdaptiveCrawler(crawler)
# Start crawling with a query
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View statistics
adaptive.print_stats()
# Get the most relevant content
relevant_pages = adaptive.get_relevant_content(top_k=5)
for page in relevant_pages:
print(f"- {page['url']} (score: {page['score']:.2f})")
from crawl4ai import AdaptiveConfig
config = AdaptiveConfig(
confidence_threshold=0.8, # Stop when 80% confident (default: 0.7)
max_pages=30, # Maximum pages to crawl (default: 20)
top_k_links=5, # Links to follow per page (default: 3)
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
)
adaptive = AdaptiveCrawler(crawler, config)
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
The statistical strategy uses pure information theory and term-based analysis:
# Default configuration uses statistical strategy
config = AdaptiveConfig(
strategy="statistical", # This is the default
confidence_threshold=0.8
)
The embedding strategy uses semantic embeddings for deeper understanding:
# Configure embedding strategy with local embeddings
config = AdaptiveConfig(
strategy="embedding",
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
n_query_variations=10, # Generate 10 query variations
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
)
# With separate LLM configs for embeddings and query expansion (recommended)
from crawl4ai import LLMConfig
config = AdaptiveConfig(
strategy="embedding",
# Embedding model — used for text-to-vector calls
embedding_llm_config=LLMConfig(
provider='openai/text-embedding-3-small',
api_token='your-api-key'
),
# Query model — used for chat completion (query expansion)
query_llm_config=LLMConfig(
provider='openai/gpt-4o-mini',
api_token='your-api-key'
)
)
# Alternative: Dictionary format (backward compatible)
config = AdaptiveConfig(
strategy="embedding",
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': 'your-api-key'
},
query_llm_config={
'provider': 'openai/gpt-4o-mini',
'api_token': 'your-api-key'
}
)
Note: The embedding strategy makes two types of API calls that need different model types:
- Embedding calls (text → vector) require an embedding model like
text-embedding-3-small- Query expansion (chat completion) requires a chat model like
gpt-4o-miniUse
embedding_llm_configfor the embedding model andquery_llm_configfor the chat model. Ifquery_llm_configis not set, it falls back toembedding_llm_configfor backward compatibility.
| Feature | Statistical | Embedding |
|---|---|---|
| Speed | Very fast | Moderate (API calls) |
| Cost | Free | Depends on provider |
| Accuracy | Good for exact terms | Excellent for concepts |
| Dependencies | None | Embedding model/API |
| Query Understanding | Literal | Semantic |
| Best Use Case | Technical docs, specific terms | Research, broad topics |
The embedding strategy offers fine-tuned control through several parameters:
config = AdaptiveConfig(
strategy="embedding",
# Model configuration
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_llm_config=None, # Use for API-based embeddings (embedding model)
query_llm_config=None, # Use for query expansion (chat completion model)
# Query expansion
n_query_variations=10, # Number of query variations to generate
# Coverage parameters
embedding_coverage_radius=0.2, # Distance threshold for coverage
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
# Stopping criteria
embedding_min_relative_improvement=0.1, # Min improvement to continue
embedding_validation_min_score=0.3, # Min validation score
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
# Link selection
embedding_overlap_threshold=0.85, # Similarity for deduplication
# Display confidence mapping
embedding_quality_min_confidence=0.7, # Min displayed confidence
embedding_quality_max_confidence=0.95 # Max displayed confidence
)
The embedding strategy can detect when a query is completely unrelated to the content:
# This will stop quickly with low confidence
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="how to cook pasta" # Irrelevant to Python docs
)
# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
print("Query is unrelated to the content!")
The confidence score (0-1) indicates how sufficient the gathered information is:
adaptive.print_stats(detailed=False) # Summary table
adaptive.print_stats(detailed=True) # Detailed metrics
The summary shows:
config = AdaptiveConfig(
save_state=True,
state_path="my_crawl_state.json"
)
# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)
# Resume from saved state
result = await adaptive.digest(
start_url,
query,
resume_from="my_crawl_state.json"
)
# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")
# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
await new_adaptive.import_knowledge_base("knowledge_base.jsonl")
max_pages limitstop_k_links based on site structure# Gather information about a programming concept
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators implementation patterns"
)
# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
print(f"\nFrom: {doc['url']}")
print(f"Relevance: {doc['score']:.2%}")
print(doc['content'][:500] + "...")
# Build a focused knowledge base about machine learning
queries = [
"supervised learning algorithms",
"neural network architectures",
"model evaluation metrics"
]
for query in queries:
await adaptive.digest(
start_url="https://scikit-learn.org/stable/",
query=query
)
# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")
# Intelligently crawl API documentation
config = AdaptiveConfig(
confidence_threshold=0.85, # Higher threshold for completeness
max_pages=30
)
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
Q: How is this different from traditional crawling? A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
Q: Can I use this with JavaScript-heavy sites? A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
Q: How does it handle large websites?
A: The algorithm naturally limits crawling to relevant sections. Use max_pages as a safety limit.
Q: Can I customize the scoring algorithms? A: Advanced users can implement custom strategies. See Adaptive Strategies.