Back to Crawlee

Proxy Management

website/versioned_docs/version-3.12/guides/proxy_management.mdx

3.16.010.3 KB
Original Source

import ApiLink from '@site/src/components/ApiLink';

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

import HttpSource from '!!raw-loader!./proxy_management_integration_http.ts'; import JSDOMSource from '!!raw-loader!./proxy_management_integration_jsdom.ts'; import CheerioSource from '!!raw-loader!./proxy_management_integration_cheerio.ts'; import PlaywrightSource from '!!raw-loader!./proxy_management_integration_playwright.ts'; import PuppeteerSource from '!!raw-loader!./proxy_management_integration_puppeteer.ts'; import SessionStandaloneSource from '!!raw-loader!./proxy_management_session_standalone.ts'; import SessionHttpSource from '!!raw-loader!./proxy_management_session_http.ts'; import SessionJSDOMSource from '!!raw-loader!./proxy_management_session_jsdom.ts'; import SessionCheerioSource from '!!raw-loader!./proxy_management_session_cheerio.ts'; import SessionPlaywrightSource from '!!raw-loader!./proxy_management_session_playwright.ts'; import SessionPuppeteerSource from '!!raw-loader!./proxy_management_session_puppeteer.ts'; import InspectionHttpSource from '!!raw-loader!./proxy_management_inspection_http.ts'; import InspectionJSDOMSource from '!!raw-loader!./proxy_management_inspection_jsdom.ts'; import InspectionCheerioSource from '!!raw-loader!./proxy_management_inspection_cheerio.ts'; import InspectionPlaywrightSource from '!!raw-loader!./proxy_management_inspection_playwright.ts'; import InspectionPuppeteerSource from '!!raw-loader!./proxy_management_inspection_puppeteer.ts';

IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.

With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.

Check out the avoid blocking guide for more information about blocking.

Quick start

If we already have proxy URLs of our own, we can start using them immediately in only a few lines of code.

javascript
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy-1.com',
        'http://proxy-2.com',
    ]
});
const proxyUrl = await proxyConfiguration.newUrl();

Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.

Proxy Configuration

All our proxy needs are managed by the <ApiLink to="core/class/ProxyConfiguration">ProxyConfiguration</ApiLink> class. We create an instance using the ProxyConfiguration <ApiLink to="core/class/ProxyConfiguration#constructor">constructor</ApiLink> function based on the provided options. See the <ApiLink to="core/interface/ProxyConfigurationOptions">ProxyConfigurationOptions</ApiLink> for all the possible constructor options.

Static proxy list

You can provide a static list of proxy URLs to the proxyUrls option. The ProxyConfiguration will then rotate through the provided proxies.

javascript
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy-1.com',
        'http://proxy-2.com',
        null // null means no proxy is used
    ]
});

This is the simplest way to use a list of proxies. Crawlee will rotate through the list of proxies in a round-robin fashion.

Custom proxy function

The ProxyConfiguration class allows you to provide a custom function to pick a proxy URL. This is useful when you want to implement your own logic for selecting a proxy.

javascript
const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: (sessionId, { request }) => {
        if (request?.url.includes('crawlee.dev')) {
            return null; // for crawlee.dev, we don't use a proxy
        }

        return 'http://proxy-1.com'; // for all other URLs, we use this proxy
    }
});

The newUrlFunction receives two parameters - sessionId and options - and returns a string containing the proxy URL.

The sessionId parameter is always provided and allows us to differentiate between different sessions - e.g. when Crawlee recognizes your crawlers are being blocked, it will automatically create a new session with a different id.

The options parameter is an object containing a <ApiLink to="core/class/Request">Request</ApiLink>, which is the request that will be made. Note that this object is not always available, for example when we are using the newUrl function directly. Your custom function should therefore not rely on the request object being present and provide a default behavior when it is not.

Tiered proxies

You can also provide a list of proxy tiers to the ProxyConfiguration class. This is useful when you want to switch between different proxies automatically based on the blocking behavior of the website.

:::warning

Note that the tieredProxyUrls option requires ProxyConfiguration to be used from a crawler instance (see below).

Using this configuration through the newUrl calls will not yield the expected results.

:::

javascript
const proxyConfiguration = new ProxyConfiguration({
    tieredProxyUrls: [
        [null], // At first, we try to connect without a proxy
        ['http://okay-proxy.com'],
        ['http://slightly-better-proxy.com', 'http://slightly-better-proxy-2.com'],
        ['http://very-good-and-expensive-proxy.com'],
    ]
});

This configuration will start with no proxy, then switch to http://okay-proxy.com if Crawlee recognizes we're getting blocked by the target website. If that proxy is also blocked, we will switch to one of the slightly-better-proxy URLs. If those are blocked, we will switch to the very-good-and-expensive-proxy.com URL.

Crawlee also periodically probes lower tier proxies to see if they are unblocked, and if they are, it will switch back to them.

Crawler integration

ProxyConfiguration integrates seamlessly into <ApiLink to="http-crawler/class/HttpCrawler">HttpCrawler</ApiLink>, <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink>, <ApiLink to="jsdom-crawler/class/JSDOMCrawler">JSDOMCrawler</ApiLink>, <ApiLink to="playwright-crawler/class/PlaywrightCrawler">PlaywrightCrawler</ApiLink> and <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">PuppeteerCrawler</ApiLink>.

<Tabs groupId="proxy_session_management"> <TabItem value="http" label="HttpCrawler"> <CodeBlock language="js"> {HttpSource} </CodeBlock> </TabItem> <TabItem value="cheerio" label="CheerioCrawler" default> <CodeBlock language="js"> {CheerioSource} </CodeBlock> </TabItem> <TabItem value="jsdom" label="JSDOMCrawler"> <CodeBlock language="js"> {JSDOMSource} </CodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler"> <CodeBlock language="js"> {PlaywrightSource} </CodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler"> <CodeBlock language="js"> {PuppeteerSource} </CodeBlock> </TabItem> </Tabs>

Our crawlers will now use the selected proxies for all connections.

IP Rotation and session management

<ApiLink to="core/class/ProxyConfiguration#newUrl">proxyConfiguration.newUrl()</ApiLink> allows us to pass a sessionId parameter. It will then be used to create a sessionId-proxyUrl pair, and subsequent newUrl() calls with the same sessionId will always return the same proxyUrl. This is extremely useful in scraping, because we want to create the impression of a real user. See the session management guide and <ApiLink to="core/class/SessionPool">SessionPool</ApiLink> class for more information on how keeping a real session helps us avoid blocking.

When no sessionId is provided, our proxy URLs are rotated round-robin.

<Tabs groupId="proxy_session_management"> <TabItem value="http" label="HttpCrawler"> <CodeBlock language="js"> {SessionHttpSource} </CodeBlock> </TabItem> <TabItem value="cheerio" label="CheerioCrawler" default> <CodeBlock language="js"> {SessionCheerioSource} </CodeBlock> </TabItem> <TabItem value="jsdom" label="JSDOMCrawler"> <CodeBlock language="js"> {SessionJSDOMSource} </CodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler"> <CodeBlock language="js"> {SessionPlaywrightSource} </CodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler"> <CodeBlock language="js"> {SessionPuppeteerSource} </CodeBlock> </TabItem> <TabItem value="standalone" label="Standalone"> <CodeBlock language="js"> {SessionStandaloneSource} </CodeBlock> </TabItem> </Tabs>

Inspecting current proxy in Crawlers

HttpCrawler, CheerioCrawler, JSDOMCrawler, PlaywrightCrawler and PuppeteerCrawler grant access to information about the currently used proxy in their requestHandler using a <ApiLink to="core/interface/ProxyInfo">proxyInfo</ApiLink> object. With the proxyInfo object, we can easily access the proxy URL.

<Tabs groupId="proxy_session_management"> <TabItem value="http" label="HttpCrawler"> <CodeBlock language="js"> {InspectionHttpSource} </CodeBlock> </TabItem> <TabItem value="cheerio" label="CheerioCrawler" default> <CodeBlock language="js"> {InspectionCheerioSource} </CodeBlock> </TabItem> <TabItem value="jsdom" label="JSDOMCrawler"> <CodeBlock language="js"> {InspectionJSDOMSource} </CodeBlock> </TabItem> <TabItem value="playwright" label="PlaywrightCrawler"> <CodeBlock language="js"> {InspectionPlaywrightSource} </CodeBlock> </TabItem> <TabItem value="puppeteer" label="PuppeteerCrawler"> <CodeBlock language="js"> {InspectionPuppeteerSource} </CodeBlock> </TabItem> </Tabs>