JOURNAL.md
This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
Feature: Configurable content source for markdown generation
Changes Made:
content_source: str = "cleaned_html" parameter to MarkdownGenerationStrategy classDefaultMarkdownGenerator to accept and pass the content source parametercleaned_html parameter to input_html in the generate_markdown methodAsyncWebCrawler.aprocess_html to select the appropriate HTML source based on the generator's configpreprocess_html_for_schema import in async_webcrawler.pyImplementation Details:
content_source parameter to specify which HTML input to use for markdown generationaprocess_html to select the appropriate HTML sourceFiles Modified:
crawl4ai/markdown_generation_strategy.py: Added content_source parameter and updated the method signaturecrawl4ai/async_webcrawler.py: Added HTML source selection logic and updated importsExamples:
docs/examples/content_source_example.py demonstrating how to use the new parameterChallenges:
Why This Feature: The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
This feature provides greater flexibility in how users generate markdown, enabling them to:
Feature: Comprehensive stress testing framework using arun_many and the dispatcher system to evaluate performance, concurrency handling, and identify potential issues under high-volume crawling scenarios.
Changes Made:
benchmarking/ (or similar) directory.SiteGenerator) with configurable heavy HTML pages.SimpleMemoryTracker) using platform-specific commands (avoiding psutil dependency for this specific test).CrawlerMonitor from crawl4ai for rich terminal UI and real-time monitoring of test progress and dispatcher activity.run_benchmark.py to orchestrate tests with predefined configurations.run_all.sh as a simple wrapper for run_benchmark.py.Implementation Details:
http.server for local serving, minimizing network variance.crawl4ai's arun_many method for processing URLs.MemoryAdaptiveDispatcher to manage concurrency via the max_sessions parameter (note: memory adaptation features require psutil, not used by SimpleMemoryTracker).SimpleMemoryTracker, recording samples throughout test execution to a CSV file.CrawlerMonitor (which uses the rich library) for clear terminal visualization and progress reporting directly from the dispatcher.Files Created/Updated:
stress_test_sdk.py: Main stress testing implementation using arun_many.benchmark_report.py: (Assumed) Report generator for comparing test results.run_benchmark.py: Test runner script with predefined configurations.run_all.sh: Simple bash script wrapper for run_benchmark.py.USAGE.md: Comprehensive documentation on usage and interpretation (updated).Testing Approach:
arun_many, allowing the dispatcher to manage concurrency up to max_sessions.run_benchmark.py configurations.Challenges:
psutil) to this specific test script.run_benchmark.py to correctly pass arguments to stress_test_sdk.py.Why This Feature:
The high volume stress testing solution addresses critical needs for ensuring Crawl4AI's arun_many reliability:
max_session_permit) and queue management.URLs/sec) under different max_sessions settings.arun_many behavior.arun_many.Design Decisions:
CrawlerMonitor for real-time feedback, leveraging its rich integration.stress_test_sdk.py (when not streaming) to provide chunk-level summaries alongside the continuous monitor.arun_many with a MemoryAdaptiveDispatcher as the core mechanism for parallel execution, reflecting the intended SDK usage.run_benchmark.py to simplify running standard test configurations.SimpleMemoryTracker to provide basic memory insights without requiring psutil for this particular test runner.Future Enhancements to Consider:
psutil to specifically stress the memory-adaptive features of the dispatcher.benchmark_report.py to provide more sophisticated analysis of performance and memory trends from the generated JSON/CSV files.Changes Made:
run_benchmark.py and stress_test_sdk.py to use --max-sessions instead of the incorrect --workers parameter, accurately reflecting dispatcher configuration.run_benchmark.py argument handling to correctly pass all relevant custom parameters (including --stream, --monitor-mode, etc.) to stress_test_sdk.py.benchmark_report.py) Applied dark theme to benchmark reports for better readability.benchmark_report.py) Improved visualization code to eliminate matplotlib warnings.run_benchmark.py to provide clickable file:// links to generated reports in the terminal output.USAGE.md with comprehensive parameter descriptions reflecting the final script arguments.run_all.sh wrapper to correctly invoke run_benchmark.py with flexible arguments.Details of Changes:
Parameter Correction (--max-sessions):
--workers was used incorrectly.stress_test_sdk.py to accept --max-sessions and configure the MemoryAdaptiveDispatcher's max_session_permit accordingly.run_benchmark.py argument parsing and command construction to use --max-sessions.TEST_CONFIGS in run_benchmark.py to use max_sessions.Argument Handling (run_benchmark.py):
run_benchmark.py.--stream, --monitor-mode, --port, --use-rate-limiter, etc.) are correctly forwarded when calling stress_test_sdk.py as a subprocess.Dark Theme & Visualization Fixes (Assumed in benchmark_report.py):
Clickable Links (run_benchmark.py):
benchmark_reports directory after benchmark_report.py runs.pathlib to generate correct file:// URLs for terminal output.Documentation Improvements (USAGE.md):
arun_many, dispatchers, and --max-sessions.stress_test_sdk.py, run_benchmark.py).Files Modified:
stress_test_sdk.py: Changed --workers to --max-sessions, added new arguments, used arun_many.run_benchmark.py: Changed argument handling, updated configs, calls stress_test_sdk.py.run_all.sh: Updated to call run_benchmark.py correctly.USAGE.md: Updated documentation extensively.benchmark_report.py: (Assumed modifications for dark theme and viz fixes).Testing:
--max-sessions correctly limits concurrency via the CrawlerMonitor output.run_benchmark.py are forwarded to stress_test_sdk.py.Why These Changes:
These refinements correct the fundamental approach of the stress test to align with crawl4ai's actual architecture and intended usage:
arun_many, MemoryAdaptiveDispatcher).Future Enhancements to Consider:
Changes Made:
Details of Changes:
Custom Parameter Handling Fix
Dark Theme Implementation
Matplotlib Warning Fixes
Documentation Improvements
Files Modified:
tests/memory/run_benchmark.py: Fixed custom parameter handlingtests/memory/benchmark_report.py: Added dark theme and fixed visualization warningstests/memory/run_all.sh: Added clickable links to reportstests/memory/USAGE.md: Created comprehensive documentationTesting:
Why These Changes: These improvements address several usability issues with the stress testing system:
Future Enhancements:
Feature: MHTML snapshot capture of crawled pages
Changes Made:
capture_mhtml: bool = False parameter to CrawlerRunConfig classmhtml: Optional[str] = None field to CrawlResult modelmhtml_data: Optional[str] = None field to AsyncCrawlResponse classcapture_mhtml() method in AsyncPlaywrightCrawlerStrategy class to capture MHTML via CDPImplementation Details:
Files Modified:
crawl4ai/models.py: Added the mhtml field to CrawlResultcrawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented MHTML capture logiccrawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtmlTesting:
tests/20241401/test_mhtml.py covering:
Challenges:
Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
Future Enhancements to Consider:
Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling
Changes Made:
capture_network_requests: bool = False and capture_console_messages: bool = False parameters to CrawlerRunConfig classnetwork_requests: Optional[List[Dict[str, Any]]] = None and console_messages: Optional[List[Dict[str, Any]]] = None fields to both AsyncCrawlResponse and CrawlResult modelsAsyncPlaywrightCrawlerStrategy._crawl_web() to capture browser network events and console messagesImplementation Details:
request, response, and requestfailed) to record all network activityconsole and pageerror) to record console messages and errorsFiles Modified:
crawl4ai/models.py: Added new fields to AsyncCrawlResponse and CrawlResultcrawl4ai/async_configs.py: Added new configuration parameters to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented capture logic using event listenerscrawl4ai/async_webcrawler.py: Added data transfer from AsyncCrawlResponse to CrawlResultDocumentation:
docs/md_v2/advanced/network-console-capture.mdmkdocs.ymldocs/md_v2/api/crawl-result.mddocs/examples/network_console_capture_example.pyTesting:
tests/general/test_network_console_capture.py with tests for:
Challenges:
Why This Feature: The network and console capture feature provides deep visibility into web page activity, enabling:
Future Enhancements to Consider: