website/versioned_docs/version-3.10/quick-start/index.mdx
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import ApiLink from '@site/src/components/ApiLink';
import Admonition from '@theme/Admonition'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock'; import ThemedImage from '@theme/ThemedImage';
import CheerioSource from '!!raw-loader!roa-loader!./quick_start_cheerio.ts'; import PlaywrightSource from '!!raw-loader!roa-loader!./quick_start_playwright.ts'; import PlaywrightHeadful from '!!raw-loader!roa-loader!./headful_playwright.ts'; import PuppeteerSource from '!!raw-loader!roa-loader!./quick_start_puppeteer.ts'; import PuppeteerHeadful from '!!raw-loader!roa-loader!./headful_puppeteer.ts';
import CheerioLog from '!!raw-loader!./quick_start_cheerio.txt';
With this short tutorial you can start scraping with Crawlee in a minute or two. To learn in-depth how Crawlee works, read the Introduction, which is a comprehensive step-by-step guide for creating your first scraper.
Crawlee comes with three main crawler classes: <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink>, <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">PuppeteerCrawler</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">PlaywrightCrawler</ApiLink>. All classes share the same interface for maximum flexibility when switching between them.
This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.
This crawler uses a headless browser to crawl, controlled by the Puppeteer library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation.
Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers. If you're not familiar with Puppeteer already, and you need a headless browser, go with Playwright.
:::caution before you start
Crawlee requires Node.js 16 or later.
:::
The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.
npx crawlee create my-crawler
After the installation is complete you can start the crawler like this:
cd my-crawler && npm start
You can add Crawlee to any Node.js project by running:
<Tabs groupId="quick_start"> <TabItem value="cheerio" label="CheerioCrawler" default> <CodeBlock language="bash">npm install crawlee</CodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler">:::caution
playwright is not bundled with Crawlee to reduce install size and allow greater flexibility. You need to explicitly install it with NPM. ๐
:::
<CodeBlock language="bash">npm install crawlee playwright</CodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler">
:::caution
puppeteer is not bundled with Crawlee to reduce install size and allow greater flexibility. You need to explicitly install it with NPM. ๐
:::
<CodeBlock language="bash">npm install crawlee puppeteer</CodeBlock> </TabItem> </Tabs>
Run the following example to perform a recursive crawl of the Crawlee website using the selected crawler.
<Admonition type="caution" title="Don't forget about module imports"> To run the example, add a <code>"type": "module"</code> clause into your <code>package.json</code> or copy it into a file with an <code>.mjs</code> suffix. This enables <code>import</code> statements in Node.js. See <a href="https://nodejs.org/dist/latest-v16.x/docs/api/esm.html#enabling" target="_blank" rel="noreferrer">Node.js docs</a> for more information. </Admonition> <Tabs groupId="quick_start"> <TabItem value="cheerio" label="CheerioCrawler" default> <RunnableCodeBlock className="language-js" type="cheerio">{CheerioSource}</RunnableCodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler"> <RunnableCodeBlock className="language-js" type="playwright">{PlaywrightSource}</RunnableCodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler"> <RunnableCodeBlock className="language-js" type="puppeteer">{PuppeteerSource}</RunnableCodeBlock> </TabItem> </Tabs>When you run the example, you will see Crawlee automating the data extraction process in your terminal.
<CodeBlock language="log">{CheerioLog}</CodeBlock>
Browsers controlled by Puppeteer and Playwright run headless (without a visible window). You can switch to headful by adding the headless: false option to the crawlers' constructor. This is useful in the development phase when you want to see what's going on in the browser.
When you run the example code, you'll see an automated browser blaze through the Crawlee website.
:::note
For the sake of this show off, we've slowed down the crawler, but rest assured, it's blazing fast in real world usage.
:::
<ThemedImage alt="An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium" sources={{ light: '/img/chrome-scrape-light.gif', dark: '/img/chrome-scrape-dark.gif', }} />
Crawlee stores data to the ./storage directory in your current working directory. The results of your crawl will be available under ./storage/datasets/default/*.json as JSON files.
{
"url": "https://crawlee.dev/",
"title": "Crawlee ยท The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee"
}
:::tip
You can override the storage directory by setting the CRAWLEE_STORAGE_DIR environment variable.
:::
You can find more examples showcasing various features of Crawlee in the Examples section of the documentation. To better understand Crawlee and its components you should read the Introduction step-by-step guide.
Related links