agent-skill/Scrapling-Skill/references/spiders/proxy-blocking.md
Scrapling's ProxyRotator manages proxy rotation across requests. It works with all session types and integrates with the spider's blocked request retry system.
The ProxyRotator class manages a list of proxies and rotates through them automatically. Pass it to any session type via the proxy_rotator parameter:
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession, ProxyRotator
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
def configure_sessions(self, manager):
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
"http://user:pass@proxy3:8080",
])
manager.add("default", FetcherSession(proxy_rotator=rotator))
async def parse(self, response: Response):
# Check which proxy was used
print(f"Proxy used: {response.meta.get('proxy')}")
yield {"title": response.css("title::text").get("")}
Each request automatically gets the next proxy in the rotation. The proxy used is stored in response.meta["proxy"] so you can track which proxy fetched which page.
Browser sessions support both string and dict proxy formats:
from scrapling.fetchers import AsyncDynamicSession, AsyncStealthySession, ProxyRotator
# String proxies work for all session types
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
])
# Dict proxies (Playwright format) work for browser sessions
rotator = ProxyRotator([
{"server": "http://proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://proxy2:8080"},
])
# Then inside the spider
def configure_sessions(self, manager):
rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
manager.add("browser", AsyncStealthySession(proxy_rotator=rotator))
Important:
proxy_rotator argument together with the static proxy or proxies parameters on the same session. Pick one approach when configuring the session, and override it per request later if needed.ProxyRotator, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed.By default, ProxyRotator uses cyclic rotation - it iterates through proxies sequentially, wrapping around at the end.
You can provide a custom strategy function to change this behavior, but it has to match the below signature:
from scrapling.core._types import ProxyType
def my_strategy(proxies: list, current_index: int) -> tuple[ProxyType, int]:
...
It receives the list of proxies and the current index, and must return the chosen proxy and the next index.
Below are some examples of custom rotation strategies you can use.
import random
from scrapling.fetchers import ProxyRotator
def random_strategy(proxies, current_index):
idx = random.randint(0, len(proxies) - 1)
return proxies[idx], idx
rotator = ProxyRotator(
["http://proxy1:8080", "http://proxy2:8080", "http://proxy3:8080"],
strategy=random_strategy,
)
import random
def weighted_strategy(proxies, current_index):
# First proxy gets 60% of traffic, others split the rest
weights = [60] + [40 // (len(proxies) - 1)] * (len(proxies) - 1)
proxy = random.choices(proxies, weights=weights, k=1)[0]
return proxy, current_index # Index doesn't matter for weighted
rotator = ProxyRotator(proxies, strategy=weighted_strategy)
You can override the rotator for individual requests by passing proxy= as a keyword argument:
async def parse(self, response: Response):
# This request uses the rotator's next proxy
yield response.follow("/page1", callback=self.parse_page)
# This request uses a specific proxy, bypassing the rotator
yield response.follow(
"/special-page",
callback=self.parse_page,
proxy="http://special-proxy:8080",
)
This is useful when certain pages require a specific proxy (e.g., a geo-located proxy for region-specific content).
The spider has built-in blocked request detection and retry. By default, it considers the following HTTP status codes blocked: 401, 403, 407, 429, 444, 500, 502, 503, 504.
The retry system works like this:
is_blocked(response) method.retry_blocked_request() method so you can modify it before retrying.dont_filter=True (bypassing deduplication) and lower priority, so it's not retried right away.max_blocked_retries times (default: 3).Tip:
proxy/proxies kwargs are cleared from the request automatically, so the rotator assigns a fresh proxy.max_blocked_retries attribute is different than the session retries and doesn't share the counter.Override is_blocked() to add your own detection logic:
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
async def is_blocked(self, response: Response) -> bool:
# Check status codes (default behavior)
if response.status in {403, 429, 503}:
return True
# Check response content
body = response.body.decode("utf-8", errors="ignore")
if "access denied" in body.lower() or "rate limit" in body.lower():
return True
return False
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
Override retry_blocked_request() to modify the request before retrying. The max_blocked_retries attribute controls how many times a blocked request is retried (default: 3):
from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
max_blocked_retries = 5
def configure_sessions(self, manager: SessionManager) -> None:
manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari']))
manager.add('stealth', AsyncStealthySession(block_webrtc=True), lazy=True)
async def retry_blocked_request(self, request: Request, response: Response) -> Request:
request.sid = "stealth"
self.logger.info(f"Retrying blocked request: {request.url}")
return request
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
What happened above is that I left the blocking detection logic unchanged and had the spider mainly use requests until it got blocked, then switch to the stealthy browser.
Putting it all together:
from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession, ProxyRotator
cheap_proxies = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080"])
# A format acceptable by the browser
expensive_proxies = ProxyRotator([
{"server": "http://residential_proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://residential_proxy2:8080", "username": "user", "password": "pass"},
{"server": "http://mobile_proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://mobile_proxy2:8080", "username": "user", "password": "pass"},
])
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
max_blocked_retries = 5
def configure_sessions(self, manager: SessionManager) -> None:
manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari'], proxy_rotator=cheap_proxies))
manager.add('stealth', AsyncStealthySession(block_webrtc=True, proxy_rotator=expensive_proxies), lazy=True)
async def retry_blocked_request(self, request: Request, response: Response) -> Request:
request.sid = "stealth"
self.logger.info(f"Retrying blocked request: {request.url}")
return request
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
The above logic is: requests are made with cheap proxies, such as datacenter proxies, until they are blocked, then retried with higher-quality proxies, such as residential or mobile proxies.