Back to Crawlee

Crawl a sitemap

website/versioned_docs/version-3.15/examples/crawl_sitemap.mdx

3.16.01.6 KB
Original Source

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import ApiLink from '@site/src/components/ApiLink';

import CheerioSource from '!!raw-loader!roa-loader!./crawl_sitemap_cheerio.ts'; import PuppeteerSource from '!!raw-loader!roa-loader!./crawl_sitemap_puppeteer.ts'; import PlaywrightSource from '!!raw-loader!roa-loader!./crawl_sitemap_playwright.ts';

We will crawl sitemap which tells search engines which pages and file are important in the website, it also provides valuable information about these files. This example builds a sitemap crawler which downloads and crawls the URLs from a sitemap, by using the <ApiLink to="utils/class/Sitemap">Sitemap</ApiLink> utility class provided by the <ApiLink to="utils">@crawlee/utils</ApiLink> module.

<Tabs groupId="crawler-type"> <TabItem value="cheerio_crawler" label="Cheerio Crawler" default> <RunnableCodeBlock className="language-js" type="cheerio"> {CheerioSource} </RunnableCodeBlock> </TabItem> <TabItem value="puppeteer_crawler" label="Puppeteer Crawler">

:::tip

To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile.

:::

<RunnableCodeBlock className="language-js" type="puppeteer"> {PuppeteerSource} </RunnableCodeBlock> </TabItem> <TabItem value="playwright_crawler" label="Playwright Crawler">

:::tip

To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome image for your Dockerfile.

:::

<RunnableCodeBlock className="language-js" type="playwright"> {PlaywrightSource} </RunnableCodeBlock> </TabItem> </Tabs>