multimodal/benchmark/content-extraction/README.md
This benchmark evaluates different content extraction strategies for web pages, focusing on performance, content quality, and token efficiency. It's designed to address out-of-memory issues with large web pages and improve the quality of extracted content for LLM processing.
š Benchmark Results
āāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā
ā Strategy ā Avg Time (ms) ā Min Time (ms) ā Max Time (ms) ā Std Dev (ms) ā
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¤
ā RawContent ā 492.42 ā 247.29 ā 758.27 ā 212.19 ā
ā CurrentMarkdown ā 494.47 ā 287.30 ā 707.46 ā 206.10 ā
ā Readability ā 622.49 ā 271.61 ā 1103.75 ā 334.54 ā
ā Optimized ā 588.73 ā 271.24 ā 1084.25 ā 318.71 ā
āāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā
ā Strategy ā Original Len ā Extracted Len ā Token Count ā
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¤
ā RawContent ā 829,024 ā 486,823 ā 153,825 ā
ā CurrentMarkdown ā 829,024 ā 127,360 ā 29,623 ā
ā Readability ā 829,024 ā 136,209 ā 35,457 ā
ā Optimized ā 829,024 ā 284,810 ā 79,581 ā
āāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā
š Compression Ratios (compared to original content)
āāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāā
ā Strategy ā Length Ratio ā Token Ratio ā
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¤
ā RawContent ā 58.72% ā 100.00% ā
ā CurrentMarkdown ā 15.36% ā 19.26% ā
ā Readability ā 16.43% ā 23.05% ā
ā Optimized ā 34.35% ā 51.73% ā
āāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāā
š Strategy Descriptions
RawContent: Extracts raw page content without any processing, serving as baseline for comparison.
CurrentMarkdown: Current browser_get_markdown implementation that extracts page content and converts to markdown.
Readability: Uses Mozilla's Readability library to extract main content while removing navigation, ads, and non-essential elements.
Optimized: Universal content extraction using advanced algorithms to identify and extract the most valuable content while preserving semantic structure and optimizing for token efficiency.
When processing web content for LLMs, efficient content extraction is crucial for several reasons:
Memory constraints: Raw HTML content from modern websites can be extremely large, causing out-of-memory issues in extraction pipelines.
Token efficiency: LLMs have token limits and token processing costs. Extracting only relevant content reduces token usage and improves cost efficiency.
Content quality: Better extraction techniques preserve semantic structure (headings, lists, code blocks) while removing noise (ads, navigation, etc.), improving the quality of LLM inputs.
This benchmark compares multiple extraction strategies to find the optimal balance between these factors.
The benchmark evaluates four different content extraction strategies:
RawContent: Extracts the raw HTML content without processing (baseline for comparison).
CurrentMarkdown: Simulates the current browser_get_markdown implementation, which extracts content and converts to markdown.
Readability: Uses Mozilla's Readability library to extract the main content while removing navigation, ads, and other non-essential elements.
Optimized: An advanced implementation using content density analysis, semantic structure preservation, and multi-stage fallback mechanisms.
Run the benchmark with default URLs:
npm run benchmark
Run with a custom URL:
npm run benchmark https://example.com
Save results to disk:
npm run benchmark --save
The optimal strategy balances these factors based on your specific requirements.