website/versioned_docs/version-3.11/guides/proxy_management.mdx
import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';
import HttpSource from '!!raw-loader!./proxy_management_integration_http.ts'; import JSDOMSource from '!!raw-loader!./proxy_management_integration_jsdom.ts'; import CheerioSource from '!!raw-loader!./proxy_management_integration_cheerio.ts'; import PlaywrightSource from '!!raw-loader!./proxy_management_integration_playwright.ts'; import PuppeteerSource from '!!raw-loader!./proxy_management_integration_puppeteer.ts'; import SessionStandaloneSource from '!!raw-loader!./proxy_management_session_standalone.ts'; import SessionHttpSource from '!!raw-loader!./proxy_management_session_http.ts'; import SessionJSDOMSource from '!!raw-loader!./proxy_management_session_jsdom.ts'; import SessionCheerioSource from '!!raw-loader!./proxy_management_session_cheerio.ts'; import SessionPlaywrightSource from '!!raw-loader!./proxy_management_session_playwright.ts'; import SessionPuppeteerSource from '!!raw-loader!./proxy_management_session_puppeteer.ts'; import InspectionHttpSource from '!!raw-loader!./proxy_management_inspection_http.ts'; import InspectionJSDOMSource from '!!raw-loader!./proxy_management_inspection_jsdom.ts'; import InspectionCheerioSource from '!!raw-loader!./proxy_management_inspection_cheerio.ts'; import InspectionPlaywrightSource from '!!raw-loader!./proxy_management_inspection_playwright.ts'; import InspectionPuppeteerSource from '!!raw-loader!./proxy_management_inspection_puppeteer.ts';
IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server.
With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.
Check out the avoid blocking guide for more information about blocking.
If we already have proxy URLs of our own, we can start using them immediately in only a few lines of code.
import { ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy-1.com',
'http://proxy-2.com',
]
});
const proxyUrl = await proxyConfiguration.newUrl();
Examples of how to use our proxy URLs with crawlers are shown below in Crawler integration section.
All our proxy needs are managed by the <ApiLink to="core/class/ProxyConfiguration">ProxyConfiguration</ApiLink> class. We create an instance using the ProxyConfiguration <ApiLink to="core/class/ProxyConfiguration#constructor">constructor</ApiLink> function based on the provided options. See the <ApiLink to="core/interface/ProxyConfigurationOptions">ProxyConfigurationOptions</ApiLink> for all the possible constructor options.
ProxyConfiguration integrates seamlessly into <ApiLink to="http-crawler/class/HttpCrawler">HttpCrawler</ApiLink>, <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink>, <ApiLink to="jsdom-crawler/class/JSDOMCrawler">JSDOMCrawler</ApiLink>, <ApiLink to="playwright-crawler/class/PlaywrightCrawler">PlaywrightCrawler</ApiLink> and <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">PuppeteerCrawler</ApiLink>.
Our crawlers will now use the selected proxies for all connections.
<ApiLink to="core/class/ProxyConfiguration#newUrl">proxyConfiguration.newUrl()</ApiLink> allows us to pass a sessionId parameter. It will then be used to create a sessionId-proxyUrl pair, and subsequent newUrl() calls with the same sessionId will always return the same proxyUrl. This is extremely useful in scraping, because we want to create the impression of a real user. See the session management guide and <ApiLink to="core/class/SessionPool">SessionPool</ApiLink> class for more information on how keeping a real session helps us avoid blocking.
When no sessionId is provided, our proxy URLs are rotated round-robin.
HttpCrawler, CheerioCrawler, JSDOMCrawler, PlaywrightCrawler and PuppeteerCrawler grant access to information about the currently used proxy
in their requestHandler using a <ApiLink to="core/interface/ProxyInfo">proxyInfo</ApiLink> object.
With the proxyInfo object, we can easily access the proxy URL.