Session Management - Crawlee

import ApiLink from '@site/src/components/ApiLink';

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

import BasicSource from '!!raw-loader!./session_management_basic.ts'; import HttpSource from '!!raw-loader!./session_management_http.ts'; import CheerioSource from '!!raw-loader!./session_management_cheerio.ts'; import JSDOMSource from '!!raw-loader!./session_management_jsdom.ts'; import PlaywrightSource from '!!raw-loader!./session_management_playwright.ts'; import PuppeteerSource from '!!raw-loader!./session_management_puppeteer.ts'; import StandaloneSource from '!!raw-loader!./session_management_standalone.ts';

<ApiLink to="core/class/SessionPool">SessionPool</ApiLink> is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom settings in Crawlee.

The main benefit of using Session pool is that we can filter out blocked or non-working proxies, so our actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that we can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having our cookies and other identifiers used only with a specific IP will reduce the chance of being blocked. The last but not least benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs.

Check out the avoid blocking guide for more information about blocking.

Now let's take a look at the examples of how to use Session pool:

with <ApiLink to="basic-crawler/class/BasicCrawler">BasicCrawler</ApiLink>;
with <ApiLink to="http-crawler/class/HttpCrawler">HttpCrawler</ApiLink>;
with <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink>;
with <ApiLink to="jsdom-crawler/class/JSDOMCrawler">JSDOMCrawler</ApiLink>;
with <ApiLink to="playwright-crawler/class/PlaywrightCrawler">PlaywrightCrawler</ApiLink>;
with <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">PuppeteerCrawler</ApiLink>;
without a crawler (standalone usage to manage sessions manually).

<Tabs groupId="session_pool"> <TabItem value="basic" label="BasicCrawler"> <CodeBlock language="js"> {BasicSource} </CodeBlock> </TabItem> <TabItem value="http" label="HttpCrawler"> <CodeBlock language="js"> {HttpSource} </CodeBlock> </TabItem> <TabItem value="cheerio" label="CheerioCrawler" default> <CodeBlock language="js"> {CheerioSource} </CodeBlock> </TabItem> <TabItem value="jsdom" label="JSDOMCrawler"> <CodeBlock language="js"> {JSDOMSource} </CodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler"> <CodeBlock language="js"> {PlaywrightSource} </CodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler"> <CodeBlock language="js"> {PuppeteerSource} </CodeBlock> </TabItem> <TabItem value="standalone" label="Standalone"> <CodeBlock language="js"> {StandaloneSource} </CodeBlock> </TabItem> </Tabs>

These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so we will probably see a lot of errors until it becomes stabilized.