Adaptive scraping

Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.

Consider a page with a structure like this:

html

<div class="container">
    <section class="products">
        <article class="product" id="p1">
            <h3>Product 1</h3>
            <p class="description">Description 1</p>
        </article>
        <article class="product" id="p2">
            <h3>Product 2</h3>
            <p class="description">Description 2</p>
        </article>
    </section>
</div>

To scrape the first product (the one with the p1 ID), a selector like this would be used:

python

page.css('#p1')

When website owners implement structural changes like

html

<div class="new-container">
    <div class="product-wrapper">
        <section class="products">
            <article class="product new-class" data-id="p1">
                <div class="product-info">
                    <h3>Product 1</h3>
                    <p class="new-description">Description 1</p>
                </div>
            </article>
            <article class="product new-class" data-id="p2">
                <div class="product-info">
                    <h3>Product 2</h3>
                    <p class="new-description">Description 2</p>
                </div>
            </article>
        </section>
    </div>
</div>

The selector will no longer function, and your code needs maintenance. That's where Scrapling's adaptive feature comes into play.

With Scrapling, you can enable the adaptive feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element.

python

from scrapling import Selector, Fetcher
# Before the change
page = Selector(page_source, adaptive=True, url='example.com')
# or
Fetcher.adaptive = True
page = Fetcher.get('https://example.com')
# then
element = page.css('#p1', auto_save=True)
if not element:  # One day website changes?
    element = page.css('#p1', adaptive=True)  # Scrapling still finds it!
# the rest of your code...

It works with all selection methods, not just CSS/XPath selection.

Real-World Scenario

This example uses The Web Archive's Wayback Machine to demonstrate adaptive scraping across different versions of a website. A copy of StackOverflow's website in 2010 is compared against the current design to show that the adaptive feature can extract the same button using the same selector.

To extract the Questions button from the old design, a selector like #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a can be used (this specific selector was generated by Chrome).

Testing the same selector in both versions:

python

>> from scrapling import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>> Fetcher.configure(adaptive = True, adaptive_domain='stackoverflow.com')
>> 
>> page = Fetcher.get(old_url, timeout=30)
>> element1 = page.css(selector, auto_save=True)[0]
>> 
>> # Same selector but used in the updated website
>> page = Fetcher.get(new_url)
>> element2 = page.css(selector, adaptive=True)[0]
>> 
>> if element1.text == element2.text:
...    print('Scrapling found the same element in the old and new designs!')
'Scrapling found the same element in the old and new designs!'

The adaptive_domain argument is used here because Scrapling sees archive.org and stackoverflow.com as two different domains and would isolate their adaptive data. Passing adaptive_domain tells Scrapling to treat them as the same website for adaptive data storage.

In a typical scenario with the same URL for both requests, the adaptive_domain argument is not needed. The adaptive logic works the same way with both the Selector and Fetcher classes.

Note: The main reason for creating the adaptive_domain argument was to handle if the website changed its URL while changing the design/structure. In that case, it can be used to continue using the previously stored adaptive data for the new URL. Otherwise, Scrapling will consider it a new website and discard the old data.

How the adaptive scraping feature works

Adaptive scraping works in two phases:

Save Phase: Store unique properties of elements
Match Phase: Find elements with similar properties later

After selecting an element through any method, the library can find it the next time the website is scraped, even if it undergoes structural/design changes.

The general logic is as follows:

Scrapling saves that element's unique properties (methods shown below).
Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
Because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. The storage system relies on two things:
1. The domain of the current website. When using the Selector class, pass it when initializing; when using a fetcher, the domain is automatically taken from the URL.
2. An identifier to query that element's properties from the database. The identifier does not always need to be set manually (see below).
Together, they will later be used to retrieve the element's unique properties from the database.
Later, when the website's structure changes, enabling adaptive causes Scrapling to retrieve the element's unique properties and match all elements on the page against them. A score is calculated based on their similarity to the desired element. Everything is taken into consideration in that comparison.
The element(s) with the highest similarity score to the wanted element are returned.

The unique properties

The unique properties Scrapling relies on are:

Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
Element's parent tag name, attributes (names and values), and text.

The comparison between elements is not exact; it is based on how similar these values are. Everything is considered, including the values' order (e.g., the order in which class names are written).

How to use adaptive feature

The adaptive feature can be applied to any found element and is added as arguments to CSS/XPath selection methods.

First, enable the adaptive feature by passing adaptive=True to the Selector class when initializing it, or enable it on the fetcher being used.

Examples:

python

>>> from scrapling import Selector, Fetcher
>>> page = Selector(html_doc, adaptive=True)
# OR
>>> Fetcher.adaptive = True
>>> page = Fetcher.get('https://example.com')

When using the Selector class, pass the URL of the website with the url argument so Scrapling can separate the properties saved for each element by domain.

If no URL is passed, the word default will be used in place of the URL field while saving the element's unique properties. This is only an issue when using the same identifier for a different website without passing the URL parameter. The save process overwrites previous data, and the adaptive feature uses only the latest saved properties.

The storage and storage_args arguments control the database connection; by default, the SQLite class provided by the library is used.

There are two main ways to use the adaptive feature:

The CSS/XPath Selection way

First, use the auto_save argument while selecting an element that exists on the page:

python

element = page.css('#p1', auto_save=True)

When the element no longer exists, use the same selector with the adaptive argument to have the library find it:

python

element = page.css('#p1', adaptive=True)

With the css/xpath methods, the identifier is set automatically to the selector string passed to the method.

Additionally, for all these methods, you can pass the identifier argument to set it yourself. This is useful in some instances, or you can use it to save properties with the auto_save argument.

The manual way

Elements can be manually saved, retrieved, and relocated within the adaptive feature. This allows relocating any element found by any method.

Example of getting an element by text:

python

>>> element = page.find_by_text('Tipping the Velvet', first_match=True)

Save its unique properties using the save method. The identifier must be set manually (use a meaningful identifier):

python

>>> page.save(element, 'my_special_element')

Later, retrieve and relocate the element inside the page with adaptive:

python

>>> element_dict = page.retrieve('my_special_element')
>>> page.relocate(element_dict, selector_type=True)
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
>>> page.relocate(element_dict, selector_type=True).css('::text').getall()
['Tipping the Velvet']

The retrieve and relocate methods are used here.

To keep it as a lxml.etree object, omit the selector_type argument:

python

>>> page.relocate(element_dict)
[<Element a at 0x105a2a7b0>]

Troubleshooting

No Matches Found

python

# 1. Check if data was saved
element_data = page.retrieve('identifier')
if not element_data:
    print("No data saved for this identifier")

# 2. Try with different identifier
products = page.css('.product', adaptive=True, identifier='old_selector')

# 3. Save again with new identifier
products = page.css('.new-product', auto_save=True, identifier='new_identifier')

Wrong Elements Matched

python

# Use more specific selectors
products = page.css('.product-list .product', auto_save=True)

# Or save with more context
product = page.find_by_text('Product Name').parent
page.save(product, 'specific_product')

Known Issues

In the adaptive save process, only the unique properties of the first element in the selection results are saved. So if the selector you are using selects different elements on the page in other locations, adaptive will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors are separated and each is executed alone.