PROGRESSIVE_CRAWLING.md
This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets.
Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: "When do we have enough information to answer a query and similar queries in its domain?"
We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow.
Let:
We define Information Sufficiency as:
IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q)
Coverage measures how well current knowledge covers query terms and related concepts:
Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q|
Where:
Consistency measures information coherence across documents:
Consistency(K, Q) = 1 - Var(answers from random subsets of K)
This captures the principle that sufficient knowledge should provide stable answers regardless of document subset.
Saturation detects diminishing returns:
Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁))
Where ΔInfo represents marginal information gain from the nth crawl.
Expected information gain from uncrawled links:
ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l)
Components:
Algorithm: ProgressiveCrawl(start_url, query, θ)
K ← ∅
crawled ← {start_url}
pending ← extract_links(crawl(start_url))
while IS(K, Q) < θ and |crawled| < max_pages:
candidates ← rank_by_expected_gain(pending, Q, K)
if max(ExpectedGain(candidates)) < min_gain:
break // Diminishing returns
to_crawl ← top_k(candidates)
new_docs ← parallel_crawl(to_crawl)
K ← K ∪ new_docs
crawled ← crawled ∪ to_crawl
pending ← extract_new_links(new_docs) - crawled
return K
Crawling terminates when:
AbstractStrategy
├── StatisticalStrategy (no LLM, no embeddings)
├── EmbeddingStrategy (with semantic similarity)
└── LLMStrategy (with language model assistance)
Pure statistical approach using:
Advantages: Fast, no API costs, works offline Best for: Technical documentation, specific terminology
Semantic understanding through embeddings:
Mathematical Framework:
Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded)
Gap(q) = 1 - max_similarity(q, K)
LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K))
Key Parameters:
embedding_k_exp: Exponential decay factor for distance-to-score mappingembedding_coverage_radius: Distance threshold for query coverageembedding_min_confidence_threshold: Minimum relevance thresholdAdvantages: Semantic understanding, handles ambiguity, detects irrelevance Best for: Research queries, conceptual topics, diverse content
Using LLM to create evaluation data:
def generate_synthetic_dataset(domain_url):
# 1. Fully crawl domain
full_knowledge = exhaustive_crawl(domain_url)
# 2. Generate answerable queries
queries = llm_generate_queries(full_knowledge)
# 3. Create query variations
for q in queries:
variations = generate_variations(q) # synonyms, sub/super queries
return queries, variations, full_knowledge
Theorem: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages.
Proof sketch: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1.
Under certain assumptions about link preview accuracy:
Knowledge base serialization for:
Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675.
Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems.
class StatisticalStrategy:
def calculate_confidence(self, state):
coverage = self.calculate_coverage(state)
consistency = self.calculate_consistency(state)
saturation = self.calculate_saturation(state)
return min(coverage, consistency, saturation)
def calculate_coverage(self, state):
# BM25-based term coverage
term_scores = []
for term in state.query.split():
df = state.document_frequencies.get(term, 0)
idf = self.idf_cache.get(term, 1.0)
term_scores.append(log(df + 1) * idf)
return mean(term_scores) / max_possible_score
def rank_links(self, state):
scored_links = []
for link in state.pending_links:
relevance = self.bm25_score(link.preview_text, state.query)
novelty = self.calculate_novelty(link, state.knowledge_base)
authority = self.url_authority(link.href)
score = relevance * novelty * authority
scored_links.append((link, score))
return sorted(scored_links, key=lambda x: x[1], reverse=True)
Dataset Creation:
Baseline Comparisons:
Metrics Collection:
Statistical Analysis: