agent-skill/Scrapling-Skill/references/mcp-server.md
The Scrapling MCP server exposes ten tools over the MCP protocol. It supports CSS-selector-based content narrowing (reducing tokens by extracting only relevant elements before returning results), three levels of scraping capability (plain HTTP, browser-rendered, and stealth/anti-bot bypass), persistent browser session management, and page screenshots returned as real image content blocks.
All scraping tools return a ResponseModel with fields: status (int), content (list of strings), url (str). The screenshot tool returns a list of MCP content blocks: an ImageContent (the screenshot bytes) followed by a TextContent (the post-redirect URL).
get -- HTTP request (single URL)Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable for static pages with no/low bot protection.
Key parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | URL to fetch |
extraction_type | "markdown" / "html" / "text" | "markdown" | Output format |
css_selector | str or null | null | CSS selector to narrow content (applied after main_content_only) |
main_content_only | bool | true | Restrict to <body> content |
impersonate | str | "chrome" | Browser fingerprint to impersonate |
proxy | str or null | null | Proxy URL, e.g. "http://user:pass@host:port" |
proxy_auth | dict or null | null | {"username": "...", "password": "..."} |
auth | dict or null | null | HTTP basic auth, same format as proxy_auth |
timeout | number | 30 | Seconds before timeout |
retries | int | 3 | Retry attempts on failure |
retry_delay | int | 1 | Seconds between retries |
stealthy_headers | bool | true | Generate realistic browser headers and Google referer |
http3 | bool | false | Use HTTP/3 (may conflict with impersonate) |
follow_redirects | bool or "safe" | "safe" | Follow redirects. "safe" rejects redirects to internal/private IPs |
max_redirects | int | 30 | Max redirects (-1 for unlimited) |
headers | dict or null | null | Custom request headers |
cookies | dict or null | null | Request cookies |
params | dict or null | null | Query string parameters |
verify | bool | true | Verify HTTPS certificates |
bulk_get -- HTTP request (multiple URLs)Async concurrent version of get. Same parameters except url is replaced by urls (list of strings). All URLs are fetched in parallel. Returns a list of ResponseModel.
fetch -- Browser fetch (single URL)Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynamic/SPA pages with no/low bot protection.
Key parameters (beyond shared ones):
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | URL to fetch |
extraction_type | str | "markdown" | "markdown" / "html" / "text" |
css_selector | str or null | null | Narrow content before extraction |
main_content_only | bool | true | Restrict to <body> |
headless | bool | true | Run browser hidden (true) or visible (false) |
proxy | str or dict or null | null | String URL or {"server": "...", "username": "...", "password": "..."} |
timeout | number | 30000 | Timeout in milliseconds |
wait | number | 0 | Extra wait (ms) after page load before extraction |
wait_selector | str or null | null | CSS selector to wait for before extraction |
wait_selector_state | str | "attached" | State for wait_selector: "attached" / "visible" / "hidden" / "detached" |
network_idle | bool | false | Wait until no network activity for 500ms |
disable_resources | bool | false | Block fonts, images, media, stylesheets, etc. for speed |
google_search | bool | true | Set a Google referer header |
real_chrome | bool | false | Use locally installed Chrome instead of bundled Chromium |
cdp_url | str or null | null | Connect to existing browser via CDP URL |
extra_headers | dict or null | null | Additional request headers |
useragent | str or null | null | Custom user-agent (auto-generated if null) |
cookies | list or null | null | Playwright-format cookies |
timezone_id | str or null | null | Browser timezone, e.g. "America/New_York" |
locale | str or null | null | Browser locale, e.g. "en-GB" |
session_id | str or null | null | Reuse a persistent session from open_session instead of creating a new browser |
bulk_fetch -- Browser fetch (multiple URLs)Concurrent browser version of fetch. Same parameters (including session_id) except url is replaced by urls (list of strings). Each URL opens in a separate browser tab. Returns a list of ResponseModel.
stealthy_fetch -- Stealth browser fetch (single URL)Anti-bot bypass fetcher with fingerprint spoofing. Use this for sites with Cloudflare Turnstile/Interstitial or other strong protections.
Additional parameters (beyond those in fetch):
| Parameter | Type | Default | Description |
|---|---|---|---|
solve_cloudflare | bool | false | Automatically solve Cloudflare Turnstile/Interstitial challenges |
hide_canvas | bool | false | Add noise to canvas operations to prevent fingerprinting |
block_webrtc | bool | false | Force WebRTC to respect proxy settings (prevents IP leak) |
allow_webgl | bool | true | Keep WebGL enabled (disabling is detectable by WAFs) |
additional_args | dict or null | null | Extra Playwright context args (overrides Scrapling defaults) |
session_id | str or null | null | Reuse a persistent stealthy session from open_session |
All parameters from fetch are also accepted.
bulk_stealthy_fetch -- Stealth browser fetch (multiple URLs)Concurrent stealth version. Same parameters (including session_id) as stealthy_fetch except url is replaced by urls (list of strings). Returns a list of ResponseModel.
open_session -- Create a persistent browser sessionOpens a browser session that stays alive across multiple fetch calls, avoiding the overhead of launching a new browser each time. Returns a SessionCreatedModel with session_id, session_type, created_at, is_alive, and message.
Key parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
session_type | "dynamic" / "stealthy" | required | Type of browser session to create |
session_id | str or null | null | Custom ID for the session. If omitted, a random 12-char hex ID is generated. Raises if already in use |
headless | bool | true | Run browser hidden or visible |
max_pages | int | 5 | Max concurrent browser tabs (1-50) |
proxy | str or dict or null | null | Proxy for all requests in this session |
timeout | number | 30000 | Default timeout in ms |
solve_cloudflare | bool | false | (Stealthy only) Auto-solve Cloudflare challenges |
hide_canvas | bool | false | (Stealthy only) Canvas fingerprint noise |
block_webrtc | bool | false | (Stealthy only) Block WebRTC IP leak |
allow_webgl | bool | true | (Stealthy only) Keep WebGL enabled |
Plus all other browser session parameters (google_search, real_chrome, cdp_url, locale, timezone_id, useragent, extra_headers, cookies, disable_resources, network_idle, wait_selector, wait_selector_state).
A dynamic session can only be used with fetch/bulk_fetch. A stealthy session can only be used with stealthy_fetch/bulk_stealthy_fetch.
close_session -- Close a persistent browser sessionCloses a session and frees its browser resources. Always close sessions when done.
| Parameter | Type | Default | Description |
|---|---|---|---|
session_id | str | required | Session ID from open_session |
Returns a SessionClosedModel with session_id and message.
list_sessions -- List active sessionsReturns a list of SessionInfo objects, each with session_id, session_type, created_at, and is_alive.
No parameters.
screenshot -- Capture a page screenshotNavigates to a URL inside an existing browser session and returns the screenshot as an MCP ImageContent block (the bytes the model can see directly, not a base64 string in JSON) followed by a TextContent block carrying the post-redirect URL.
Requires an open browser session. Call open_session first, then pass the session_id here. Both dynamic and stealthy sessions are accepted.
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | URL to navigate to and capture |
session_id | str | required | ID of an open browser session created with open_session |
image_type | "png" / "jpeg" | "png" | Image format. Use "jpeg" for smaller payloads |
full_page | bool | false | Capture the full scrollable page instead of just the viewport |
quality | int or null | null | JPEG quality 0-100. Raises if passed with image_type="png" |
wait | number | 0 | Extra wait (ms) after page load before capture |
wait_selector | str or null | null | CSS selector to wait for before capture |
wait_selector_state | str | "attached" | State for wait_selector: "attached" / "visible" / "hidden" / "detached" |
network_idle | bool | false | Wait until no network activity for 500ms |
timeout | number | 30000 | Timeout in milliseconds |
| Scenario | Tool |
|---|---|
| Static page, no bot protection | get |
| Multiple static pages | bulk_get |
| JavaScript-rendered / SPA page | fetch |
| Multiple JS-rendered pages | bulk_fetch |
| Cloudflare or strong anti-bot protection | stealthy_fetch (with solve_cloudflare=true for Turnstile) |
| Multiple protected pages | bulk_stealthy_fetch |
| Multiple pages from the same site | open_session + fetch/stealthy_fetch with session_id |
| Need a screenshot of a page | open_session + screenshot with session_id |
Start with get (fastest, lowest resource cost). Escalate to fetch if content requires JS rendering. Escalate to stealthy_fetch only if blocked. For multiple pages from the same site, use a persistent session to avoid browser launch overhead.
css_selector to narrow results before they reach the model -- this saves significant tokens.main_content_only=true (default) strips nav/footer by restricting to <body>.extraction_type="markdown" (default) is best for readability. Use "text" for minimal output, "html" when structure matters.css_selector matches multiple elements, all are returned in the content list.When main_content_only=true (the default), the server automatically sanitizes scraped content to prevent prompt injection from malicious websites. It strips:
display:none, visibility:hidden, opacity:0, font-size:0, height:0, width:0)aria-hidden="true" elements<template> tagsKeep main_content_only=true for maximum protection.
All browser-based tools (fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch) and persistent sessions (open_session) automatically block requests to ~3,500 known ad and tracker domains. This is always enabled in the MCP server to save tokens and speed up page loads. No configuration needed.
Start the server (stdio transport, used by most MCP clients):
scrapling mcp
Or with Streamable HTTP transport:
scrapling mcp --http
scrapling mcp --http --host 127.0.0.1 --port 8000
Docker alternative:
docker pull pyd4vinci/scrapling
docker run -i --rm scrapling mcp
The MCP server name when registering with a client is ScraplingServer. The command is the path to the scrapling binary and the argument is mcp.