docs/md_v2/blog/releases/0.4.1.md
This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂
Hi everyone,
I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.
One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI waits for all images to load before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.
Here’s how to enable it:
await crawler.crawl(
url="https://example.com",
wait_for_images=True # Add this argument to ensure images are fully loaded
)
What this does is:
This single change handles the majority of lazy-loading cases you’re likely to encounter.
Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable text-only mode to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling 3-4 times faster in most cases.
Here’s how to turn it on:
crawler = AsyncPlaywrightCrawlerStrategy(
text_mode=True # Set this to True to enable text-only crawling
)
When text_mode=True, the crawler automatically:
viewport_width and viewport_height).If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
Another useful addition is the ability to dynamically adjust the viewport size to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.
Here’s how it works:
To enable this, use:
await crawler.crawl(
url="https://example.com",
adjust_viewport_to_content=True # Dynamically adjusts the viewport
)
This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for full-page scanning. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
Here’s an example:
await crawler.crawl(
url="https://example.com",
scan_full_page=True, # Enables scrolling
scroll_delay=0.2 # Waits 200ms between scrolls (optional)
)
What happens here:
If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.
I added a method called create_session for this:
session_id = await crawler.create_session()
# Use the same session for multiple crawls
await crawler.crawl(
url="https://example.com/page1",
session_id=session_id # Reuse the session
)
await crawler.crawl(
url="https://example.com/page2",
session_id=session_id
)
This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
Here are a few smaller updates I’ve made:
light_mode=True to disable background processes, extensions, and other unnecessary features, making the browser more efficient.delay_before_return_html (now set to 0.1 seconds).You can install or upgrade to version 0.4.1 like this:
pip install crawl4ai --upgrade
As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!
Enjoy the new features, and happy crawling! 🕷️