docs/md_v2/api/digest.md
The digest() method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
strstrOptional[Union[str, Path]]NoneReturns a CrawlState object containing:
Set[str]): All URLs that have been crawledList[CrawlResult]): Collection of crawled pages with contentList[Link]): Links discovered but not yet crawledDict[str, float]): Performance and quality metricsstr): The original queryThe digest() method implements an intelligent crawling algorithm:
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async await context managers"
)
print(f"Crawled {len(state.crawled_urls)} pages")
print(f"Confidence: {adaptive.confidence:.0%}")
config = AdaptiveConfig(
confidence_threshold=0.9, # Require high confidence
max_pages=30, # Allow more pages
top_k_links=3 # Follow top 3 links per page
)
adaptive = AdaptiveCrawler(crawler, config=config)
state = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
# First crawl - may be interrupted
state1 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms"
)
# Save state (if not auto-saved)
state1.save("ml_crawl_state.json")
# Later, resume from saved state
state2 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms",
resume_from="ml_crawl_state.json"
)
state = await adaptive.digest(
start_url="https://docs.example.com",
query="api reference"
)
# Monitor progress
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"New terms discovered: {state.new_terms_history}")
print(f"Final confidence: {adaptive.confidence:.2%}")
# View detailed statistics
adaptive.print_stats(detailed=True)
Be Specific: Use descriptive terms that appear in target content
# Good
query = "python async context managers implementation"
# Too broad
query = "python programming"
Include Key Terms: Add technical terms you expect to find
query = "oauth2 jwt refresh tokens authorization"
Multiple Concepts: Combine related concepts for comprehensive coverage
query = "rest api pagination sorting filtering"
try:
state = await adaptive.digest(
start_url="https://example.com",
query="search terms"
)
except Exception as e:
print(f"Crawl failed: {e}")
# State is auto-saved if save_state=True in config
The crawl stops when any of these conditions are met: