Scrapling MCP Server

The Scrapling MCP server exposes ten tools over the MCP protocol. It supports CSS-selector-based content narrowing (reducing tokens by extracting only relevant elements before returning results), three levels of scraping capability (plain HTTP, browser-rendered, and stealth/anti-bot bypass), persistent browser session management, and page screenshots returned as real image content blocks.

All scraping tools return a ResponseModel with fields: status (int), content (list of strings), url (str). The screenshot tool returns a list of MCP content blocks: an ImageContent (the screenshot bytes) followed by a TextContent (the post-redirect URL).

Tools

`get` -- HTTP request (single URL)

Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable for static pages with no/low bot protection.

Key parameters:

Parameter	Type	Default	Description
`url`	str	required	URL to fetch
`extraction_type`	`"markdown"` / `"html"` / `"text"`	`"markdown"`	Output format
`css_selector`	str or null	null	CSS selector to narrow content (applied after `main_content_only`)
`main_content_only`	bool	true	Restrict to `<body>` content
`impersonate`	str	`"chrome"`	Browser fingerprint to impersonate
`proxy`	str or null	null	Proxy URL, e.g. `"http://user:pass@host:port"`
`proxy_auth`	dict or null	null	`{"username": "...", "password": "..."}`
`auth`	dict or null	null	HTTP basic auth, same format as proxy_auth
`timeout`	number	30	Seconds before timeout
`retries`	int	3	Retry attempts on failure
`retry_delay`	int	1	Seconds between retries
`stealthy_headers`	bool	true	Generate realistic browser headers and Google referer
`http3`	bool	false	Use HTTP/3 (may conflict with `impersonate`)
`follow_redirects`	bool or "safe"	"safe"	Follow redirects. "safe" rejects redirects to internal/private IPs
`max_redirects`	int	30	Max redirects (-1 for unlimited)
`headers`	dict or null	null	Custom request headers
`cookies`	dict or null	null	Request cookies
`params`	dict or null	null	Query string parameters
`verify`	bool	true	Verify HTTPS certificates

`bulk_get` -- HTTP request (multiple URLs)

Async concurrent version of get. Same parameters except url is replaced by urls (list of strings). All URLs are fetched in parallel. Returns a list of ResponseModel.

`fetch` -- Browser fetch (single URL)

Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynamic/SPA pages with no/low bot protection.

Key parameters (beyond shared ones):

Parameter	Type	Default	Description
`url`	str	required	URL to fetch
`extraction_type`	str	`"markdown"`	`"markdown"` / `"html"` / `"text"`
`css_selector`	str or null	null	Narrow content before extraction
`main_content_only`	bool	true	Restrict to `<body>`
`headless`	bool	true	Run browser hidden (true) or visible (false)
`proxy`	str or dict or null	null	String URL or `{"server": "...", "username": "...", "password": "..."}`
`timeout`	number	30000	Timeout in milliseconds
`wait`	number	0	Extra wait (ms) after page load before extraction
`wait_selector`	str or null	null	CSS selector to wait for before extraction
`wait_selector_state`	str	`"attached"`	State for wait_selector: `"attached"` / `"visible"` / `"hidden"` / `"detached"`
`network_idle`	bool	false	Wait until no network activity for 500ms
`disable_resources`	bool	false	Block fonts, images, media, stylesheets, etc. for speed
`google_search`	bool	true	Set a Google referer header
`real_chrome`	bool	false	Use locally installed Chrome instead of bundled Chromium
`cdp_url`	str or null	null	Connect to existing browser via CDP URL
`extra_headers`	dict or null	null	Additional request headers
`useragent`	str or null	null	Custom user-agent (auto-generated if null)
`cookies`	list or null	null	Playwright-format cookies
`timezone_id`	str or null	null	Browser timezone, e.g. `"America/New_York"`
`locale`	str or null	null	Browser locale, e.g. `"en-GB"`
`session_id`	str or null	null	Reuse a persistent session from `open_session` instead of creating a new browser

`bulk_fetch` -- Browser fetch (multiple URLs)

Concurrent browser version of fetch. Same parameters (including session_id) except url is replaced by urls (list of strings). Each URL opens in a separate browser tab. Returns a list of ResponseModel.

`stealthy_fetch` -- Stealth browser fetch (single URL)

Anti-bot bypass fetcher with fingerprint spoofing. Use this for sites with Cloudflare Turnstile/Interstitial or other strong protections.

Additional parameters (beyond those in fetch):

Parameter	Type	Default	Description
`solve_cloudflare`	bool	false	Automatically solve Cloudflare Turnstile/Interstitial challenges
`hide_canvas`	bool	false	Add noise to canvas operations to prevent fingerprinting
`block_webrtc`	bool	false	Force WebRTC to respect proxy settings (prevents IP leak)
`allow_webgl`	bool	true	Keep WebGL enabled (disabling is detectable by WAFs)
`additional_args`	dict or null	null	Extra Playwright context args (overrides Scrapling defaults)
`session_id`	str or null	null	Reuse a persistent stealthy session from `open_session`

All parameters from fetch are also accepted.

`bulk_stealthy_fetch` -- Stealth browser fetch (multiple URLs)

Concurrent stealth version. Same parameters (including session_id) as stealthy_fetch except url is replaced by urls (list of strings). Returns a list of ResponseModel.

`open_session` -- Create a persistent browser session

Opens a browser session that stays alive across multiple fetch calls, avoiding the overhead of launching a new browser each time. Returns a SessionCreatedModel with session_id, session_type, created_at, is_alive, and message.

Key parameters:

Parameter	Type	Default	Description
`session_type`	`"dynamic"` / `"stealthy"`	required	Type of browser session to create
`session_id`	str or null	null	Custom ID for the session. If omitted, a random 12-char hex ID is generated. Raises if already in use
`headless`	bool	true	Run browser hidden or visible
`max_pages`	int	5	Max concurrent browser tabs (1-50)
`proxy`	str or dict or null	null	Proxy for all requests in this session
`timeout`	number	30000	Default timeout in ms
`solve_cloudflare`	bool	false	(Stealthy only) Auto-solve Cloudflare challenges
`hide_canvas`	bool	false	(Stealthy only) Canvas fingerprint noise
`block_webrtc`	bool	false	(Stealthy only) Block WebRTC IP leak
`allow_webgl`	bool	true	(Stealthy only) Keep WebGL enabled

Plus all other browser session parameters (google_search, real_chrome, cdp_url, locale, timezone_id, useragent, extra_headers, cookies, disable_resources, network_idle, wait_selector, wait_selector_state).

A dynamic session can only be used with fetch/bulk_fetch. A stealthy session can only be used with stealthy_fetch/bulk_stealthy_fetch.

`close_session` -- Close a persistent browser session

Closes a session and frees its browser resources. Always close sessions when done.

Parameter	Type	Default	Description
`session_id`	str	required	Session ID from `open_session`

Returns a SessionClosedModel with session_id and message.

`list_sessions` -- List active sessions

Returns a list of SessionInfo objects, each with session_id, session_type, created_at, and is_alive.

No parameters.

`screenshot` -- Capture a page screenshot

Navigates to a URL inside an existing browser session and returns the screenshot as an MCP ImageContent block (the bytes the model can see directly, not a base64 string in JSON) followed by a TextContent block carrying the post-redirect URL.

Requires an open browser session. Call open_session first, then pass the session_id here. Both dynamic and stealthy sessions are accepted.

Parameter	Type	Default	Description
`url`	str	required	URL to navigate to and capture
`session_id`	str	required	ID of an open browser session created with `open_session`
`image_type`	`"png"` / `"jpeg"`	`"png"`	Image format. Use `"jpeg"` for smaller payloads
`full_page`	bool	false	Capture the full scrollable page instead of just the viewport
`quality`	int or null	null	JPEG quality 0-100. Raises if passed with `image_type="png"`
`wait`	number	0	Extra wait (ms) after page load before capture
`wait_selector`	str or null	null	CSS selector to wait for before capture
`wait_selector_state`	str	`"attached"`	State for `wait_selector`: `"attached"` / `"visible"` / `"hidden"` / `"detached"`
`network_idle`	bool	false	Wait until no network activity for 500ms
`timeout`	number	30000	Timeout in milliseconds

Tool selection guide

Scenario	Tool
Static page, no bot protection	`get`
Multiple static pages	`bulk_get`
JavaScript-rendered / SPA page	`fetch`
Multiple JS-rendered pages	`bulk_fetch`
Cloudflare or strong anti-bot protection	`stealthy_fetch` (with `solve_cloudflare=true` for Turnstile)
Multiple protected pages	`bulk_stealthy_fetch`
Multiple pages from the same site	`open_session` + `fetch`/`stealthy_fetch` with `session_id`
Need a screenshot of a page	`open_session` + `screenshot` with `session_id`

Start with get (fastest, lowest resource cost). Escalate to fetch if content requires JS rendering. Escalate to stealthy_fetch only if blocked. For multiple pages from the same site, use a persistent session to avoid browser launch overhead.

Content extraction tips

Use css_selector to narrow results before they reach the model -- this saves significant tokens.
main_content_only=true (default) strips nav/footer by restricting to <body>.
extraction_type="markdown" (default) is best for readability. Use "text" for minimal output, "html" when structure matters.
If a css_selector matches multiple elements, all are returned in the content list.

Prompt injection protection

When main_content_only=true (the default), the server automatically sanitizes scraped content to prevent prompt injection from malicious websites. It strips:

CSS-hidden elements (display:none, visibility:hidden, opacity:0, font-size:0, height:0, width:0)
aria-hidden="true" elements
<template> tags
HTML comments
Zero-width unicode characters

Keep main_content_only=true for maximum protection.

Ad blocking

All browser-based tools (fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch) and persistent sessions (open_session) automatically block requests to ~3,500 known ad and tracker domains. This is always enabled in the MCP server to save tokens and speed up page loads. No configuration needed.

Setup

Start the server (stdio transport, used by most MCP clients):

bash

scrapling mcp

Or with Streamable HTTP transport:

bash

scrapling mcp --http
scrapling mcp --http --host 127.0.0.1 --port 8000

Docker alternative:

bash

docker pull pyd4vinci/scrapling
docker run -i --rm scrapling mcp

The MCP server name when registering with a client is ScraplingServer. The command is the path to the scrapling binary and the argument is mcp.

Scrapling MCP Server

Scrapling MCP Server

Tools

get -- HTTP request (single URL)

bulk_get -- HTTP request (multiple URLs)

fetch -- Browser fetch (single URL)

bulk_fetch -- Browser fetch (multiple URLs)

stealthy_fetch -- Stealth browser fetch (single URL)

bulk_stealthy_fetch -- Stealth browser fetch (multiple URLs)

open_session -- Create a persistent browser session

close_session -- Close a persistent browser session

list_sessions -- List active sessions

screenshot -- Capture a page screenshot

Tool selection guide

Content extraction tips

Prompt injection protection

Ad blocking

Setup

`get` -- HTTP request (single URL)

`bulk_get` -- HTTP request (multiple URLs)

`fetch` -- Browser fetch (single URL)

`bulk_fetch` -- Browser fetch (multiple URLs)

`stealthy_fetch` -- Stealth browser fetch (single URL)

`bulk_stealthy_fetch` -- Stealth browser fetch (multiple URLs)

`open_session` -- Create a persistent browser session

`close_session` -- Close a persistent browser session

`list_sessions` -- List active sessions

`screenshot` -- Capture a page screenshot