Back to Crawlee

Crawl all links on a website

website/versioned_docs/version-3.11/examples/crawl_all_links.mdx

3.16.01.8 KB
Original Source

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import ApiLink from '@site/src/components/ApiLink';

import CheerioSource from '!!raw-loader!roa-loader!./crawl_all_links_cheerio.ts'; import PuppeteerSource from '!!raw-loader!roa-loader!./crawl_all_links_puppeteer.ts'; import PlaywrightSource from '!!raw-loader!roa-loader!./crawl_all_links_playwright.ts';

This example uses the enqueueLinks() method to add new links to the RequestQueue as the crawler navigates from page to page. This example can also be used to find all URLs on a domain by removing the maxRequestsPerCrawl option.

:::tip

If no options are given, by default the method will only add links that are under the same subdomain. This behavior can be controlled with the <ApiLink to="core/interface/EnqueueLinksOptions#strategy">strategy</ApiLink> option. You can find more info about this option in the Crawl relative links examples.

:::

<Tabs groupId="crawler-type"> <TabItem value="cheerio_crawler" label="Cheerio Crawler" default> <RunnableCodeBlock className="language-js" type="cheerio"> {CheerioSource} </RunnableCodeBlock> </TabItem> <TabItem value="puppeteer_crawler" label="Puppeteer Crawler">

:::tip

To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile.

:::

<RunnableCodeBlock className="language-js" type="puppeteer"> {PuppeteerSource} </RunnableCodeBlock> </TabItem> <TabItem value="playwright_crawler" label="Playwright Crawler">

:::tip

To run this example on the Apify Platform, select the apify/actor-node-playwright-chrome image for your Dockerfile.

:::

<RunnableCodeBlock className="language-js" type="playwright"> {PlaywrightSource} </RunnableCodeBlock> </TabItem> </Tabs>