website/versioned_docs/version-3.12/examples/crawl_relative_links.mdx
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import ApiLink from '@site/src/components/ApiLink';
import AllLinksSource from '!!raw-loader!roa-loader!./crawl_relative_links_all.ts'; import SameDomainSource from '!!raw-loader!roa-loader!./crawl_relative_links_same_domain.ts'; import SameHostnameSource from '!!raw-loader!roa-loader!./crawl_relative_links_same_hostname.ts';
When crawling a website, you may encounter different types of links present that you may want to crawl.
To facilitate the easy crawling of such links, we provide the enqueueLinks() method on the crawler context, which will
automatically find links and add them to the crawler's <ApiLink to="core/class/RequestQueue">RequestQueue</ApiLink>.
We provide 3 different strategies for crawling relative links:
:::note
For these examples, we are using the <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink>, however
the same method is available for both the <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">PuppeteerCrawler</ApiLink>
and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">PlaywrightCrawler</ApiLink>, and you use it
the exact same way.
:::
<Tabs groupId="enqueue_strategy"> <TabItem value="all" label="All Links">:::note Example domains
Any urls found will be matched by this strategy, even if they go off of the site you are currently crawling.
:::
<RunnableCodeBlock className="language-js" type="cheerio"> {AllLinksSource} </RunnableCodeBlock> </TabItem> <TabItem value="same_hostname" label="Same Hostname">:::note Example domains
For a url of https://example.com, enqueueLinks() will match relative urls and urls that point to the same
hostname.
This is the default strategy when calling
enqueueLinks(), so you don't have to specify it.
For instance, hyperlinks like https://example.com/some/path, /absolute/example or ./relative/example will all be matched by this strategy. But links to any subdomain like https://subdomain.example.com/some/path won't.
:::
<RunnableCodeBlock className="language-js" type="cheerio"> {SameHostnameSource} </RunnableCodeBlock> </TabItem> <TabItem value="same-subdomain" label="Same Subdomain" default>:::note Example domains
For a url of https://subdomain.example.com, enqueueLinks() will match relative urls or urls that point to the same domain name, regardless of their subdomain.
For instance, hyperlinks like https://subdomain.example.com/some/path, /absolute/example or ./relative/example will all be matched by this strategy, as well as links to other subdomains or to the naked domain, like https://other-subdomain.example.com or https://example.com will work too.
:::
<RunnableCodeBlock className="language-js" type="cheerio"> {SameDomainSource} </RunnableCodeBlock> </TabItem> </Tabs>