agent-skill/Scrapling-Skill/references/parsing/adaptive.md
Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
Consider a page with a structure like this:
<div class="container">
<section class="products">
<article class="product" id="p1">
<h3>Product 1</h3>
<p class="description">Description 1</p>
</article>
<article class="product" id="p2">
<h3>Product 2</h3>
<p class="description">Description 2</p>
</article>
</section>
</div>
To scrape the first product (the one with the p1 ID), a selector like this would be used:
page.css('#p1')
When website owners implement structural changes like
<div class="new-container">
<div class="product-wrapper">
<section class="products">
<article class="product new-class" data-id="p1">
<div class="product-info">
<h3>Product 1</h3>
<p class="new-description">Description 1</p>
</div>
</article>
<article class="product new-class" data-id="p2">
<div class="product-info">
<h3>Product 2</h3>
<p class="new-description">Description 2</p>
</div>
</article>
</section>
</div>
</div>
The selector will no longer function, and your code needs maintenance. That's where Scrapling's adaptive feature comes into play.
With Scrapling, you can enable the adaptive feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element.
from scrapling import Selector, Fetcher
# Before the change
page = Selector(page_source, adaptive=True, url='example.com')
# or
Fetcher.adaptive = True
page = Fetcher.get('https://example.com')
# then
element = page.css('#p1', auto_save=True)
if not element: # One day website changes?
element = page.css('#p1', adaptive=True) # Scrapling still finds it!
# the rest of your code...
It works with all selection methods, not just CSS/XPath selection.
This example uses The Web Archive's Wayback Machine to demonstrate adaptive scraping across different versions of a website. A copy of StackOverflow's website in 2010 is compared against the current design to show that the adaptive feature can extract the same button using the same selector.
To extract the Questions button from the old design, a selector like #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a can be used (this specific selector was generated by Chrome).
Testing the same selector in both versions:
>> from scrapling import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>> Fetcher.configure(adaptive = True, adaptive_domain='stackoverflow.com')
>>
>> page = Fetcher.get(old_url, timeout=30)
>> element1 = page.css(selector, auto_save=True)[0]
>>
>> # Same selector but used in the updated website
>> page = Fetcher.get(new_url)
>> element2 = page.css(selector, adaptive=True)[0]
>>
>> if element1.text == element2.text:
... print('Scrapling found the same element in the old and new designs!')
'Scrapling found the same element in the old and new designs!'
The adaptive_domain argument is used here because Scrapling sees archive.org and stackoverflow.com as two different domains and would isolate their adaptive data. Passing adaptive_domain tells Scrapling to treat them as the same website for adaptive data storage.
In a typical scenario with the same URL for both requests, the adaptive_domain argument is not needed. The adaptive logic works the same way with both the Selector and Fetcher classes.
Note: The main reason for creating the adaptive_domain argument was to handle if the website changed its URL while changing the design/structure. In that case, it can be used to continue using the previously stored adaptive data for the new URL. Otherwise, Scrapling will consider it a new website and discard the old data.
Adaptive scraping works in two phases:
After selecting an element through any method, the library can find it the next time the website is scraped, even if it undergoes structural/design changes.
The general logic is as follows:
Scrapling saves that element's unique properties (methods shown below).
Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
Because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. The storage system relies on two things:
Selector class, pass it when initializing; when using a fetcher, the domain is automatically taken from the URL.identifier to query that element's properties from the database. The identifier does not always need to be set manually (see below).Together, they will later be used to retrieve the element's unique properties from the database.
Later, when the website's structure changes, enabling adaptive causes Scrapling to retrieve the element's unique properties and match all elements on the page against them. A score is calculated based on their similarity to the desired element. Everything is taken into consideration in that comparison.
The element(s) with the highest similarity score to the wanted element are returned.
The unique properties Scrapling relies on are:
The comparison between elements is not exact; it is based on how similar these values are. Everything is considered, including the values' order (e.g., the order in which class names are written).
The adaptive feature can be applied to any found element and is added as arguments to CSS/XPath selection methods.
First, enable the adaptive feature by passing adaptive=True to the Selector class when initializing it, or enable it on the fetcher being used.
Examples:
>>> from scrapling import Selector, Fetcher
>>> page = Selector(html_doc, adaptive=True)
# OR
>>> Fetcher.adaptive = True
>>> page = Fetcher.get('https://example.com')
When using the Selector class, pass the URL of the website with the url argument so Scrapling can separate the properties saved for each element by domain.
If no URL is passed, the word default will be used in place of the URL field while saving the element's unique properties. This is only an issue when using the same identifier for a different website without passing the URL parameter. The save process overwrites previous data, and the adaptive feature uses only the latest saved properties.
The storage and storage_args arguments control the database connection; by default, the SQLite class provided by the library is used.
There are two main ways to use the adaptive feature:
First, use the auto_save argument while selecting an element that exists on the page:
element = page.css('#p1', auto_save=True)
When the element no longer exists, use the same selector with the adaptive argument to have the library find it:
element = page.css('#p1', adaptive=True)
With the css/xpath methods, the identifier is set automatically to the selector string passed to the method.
Additionally, for all these methods, you can pass the identifier argument to set it yourself. This is useful in some instances, or you can use it to save properties with the auto_save argument.
Elements can be manually saved, retrieved, and relocated within the adaptive feature. This allows relocating any element found by any method.
Example of getting an element by text:
>>> element = page.find_by_text('Tipping the Velvet', first_match=True)
Save its unique properties using the save method. The identifier must be set manually (use a meaningful identifier):
>>> page.save(element, 'my_special_element')
Later, retrieve and relocate the element inside the page with adaptive:
>>> element_dict = page.retrieve('my_special_element')
>>> page.relocate(element_dict, selector_type=True)
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
>>> page.relocate(element_dict, selector_type=True).css('::text').getall()
['Tipping the Velvet']
The retrieve and relocate methods are used here.
To keep it as a lxml.etree object, omit the selector_type argument:
>>> page.relocate(element_dict)
[<Element a at 0x105a2a7b0>]
# 1. Check if data was saved
element_data = page.retrieve('identifier')
if not element_data:
print("No data saved for this identifier")
# 2. Try with different identifier
products = page.css('.product', adaptive=True, identifier='old_selector')
# 3. Save again with new identifier
products = page.css('.new-product', auto_save=True, identifier='new_identifier')
# Use more specific selectors
products = page.css('.product-list .product', auto_save=True)
# Or save with more context
product = page.find_by_text('Product Name').parent
page.save(product, 'specific_product')
In the adaptive save process, only the unique properties of the first element in the selection results are saved. So if the selector you are using selects different elements on the page in other locations, adaptive will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors are separated and each is executed alone.