docs/integrations/app-integrations/web-crawler.mdx
In this section, we present how to use a web crawler within MindsDB.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.
Before proceeding, ensure the following prerequisites are met:
This handler does not require any connection parameters.
Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
The crawl_depth parameter defines how deep the crawler should navigate through linked pages:
crawl_depth = 0: Crawls only the specified page.crawl_depth = 1: Crawls the specified page and all linked pages on it.There are multiple ways to limit the number of pages returned:
LIMIT clause defines the maximum number of pages returned globally.per_url_limit parameter limits the number of pages returned for each specific URL, if more than one URL is provided.The following example retrieves data from a single webpage:
SELECT *
FROM my_web.crawler
WHERE url = 'https://docs.mindsdb.com/';
Returns 1 row by default.
To retrieve more pages from the same URL, specify the LIMIT:
SELECT *
FROM my_web.crawler
WHERE url = 'https://docs.mindsdb.com/'
LIMIT 30;
Returns up to 30 rows.
To crawl multiple URLs at once:
SELECT *
FROM my_web.crawler
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/', 'https://mindsdb.com/');
Returns 3 rows by default (1 row per URL).
To apply a per-URL limit:
SELECT *
FROM my_web.crawler
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/')
AND per_url_limit = 2;
Returns 4 rows (2 rows per URL).
To crawl all pages linked within a website:
SELECT *
FROM my_web.crawler
WHERE url = 'https://docs.mindsdb.com/'
AND crawl_depth = 1;
Returns 1 + x rows, where x is the number of linked webpages.
For multiple URLs with crawl depth:
SELECT *
FROM my_web.crawler
WHERE url IN ('https://docs.mindsdb.com/', 'https://dev.mysql.com/doc/')
AND crawl_depth = 1;
Returns 2 + x + y rows, where x and y are the number of linked pages from each URL.
MindsDB accepts file uploads of csv, xlsx, xls, sheet, json, and parquet. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
SELECT *
FROM my_web.crawler
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
The Web Handler can be configured to interact only with specific domains by using the web_crawling_allowed_sites setting in the config.json file.
This feature allows you to restrict the handler to crawl and process content only from the domains you specify, enhancing security and control over web interactions.
To configure this, simply list the allowed domains under the web_crawling_allowed_sites key in config.json. For example:
"web_crawling_allowed_sites": [
"https://docs.mindsdb.com",
"https://another-allowed-site.com"
]