Back to Scrapling

Scrapling MCP Server

agent-skill/Scrapling-Skill/references/mcp-server.md

0.4.717.7 KB
Original Source

Scrapling MCP Server

The Scrapling MCP server exposes ten tools over the MCP protocol. It supports CSS-selector-based content narrowing (reducing tokens by extracting only relevant elements before returning results), three levels of scraping capability (plain HTTP, browser-rendered, and stealth/anti-bot bypass), persistent browser session management, and page screenshots returned as real image content blocks.

All scraping tools return a ResponseModel with fields: status (int), content (list of strings), url (str). The screenshot tool returns a list of MCP content blocks: an ImageContent (the screenshot bytes) followed by a TextContent (the post-redirect URL).

Tools

get -- HTTP request (single URL)

Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable for static pages with no/low bot protection.

Key parameters:

ParameterTypeDefaultDescription
urlstrrequiredURL to fetch
extraction_type"markdown" / "html" / "text""markdown"Output format
css_selectorstr or nullnullCSS selector to narrow content (applied after main_content_only)
main_content_onlybooltrueRestrict to <body> content
impersonatestr"chrome"Browser fingerprint to impersonate
proxystr or nullnullProxy URL, e.g. "http://user:pass@host:port"
proxy_authdict or nullnull{"username": "...", "password": "..."}
authdict or nullnullHTTP basic auth, same format as proxy_auth
timeoutnumber30Seconds before timeout
retriesint3Retry attempts on failure
retry_delayint1Seconds between retries
stealthy_headersbooltrueGenerate realistic browser headers and Google referer
http3boolfalseUse HTTP/3 (may conflict with impersonate)
follow_redirectsbool or "safe""safe"Follow redirects. "safe" rejects redirects to internal/private IPs
max_redirectsint30Max redirects (-1 for unlimited)
headersdict or nullnullCustom request headers
cookiesdict or nullnullRequest cookies
paramsdict or nullnullQuery string parameters
verifybooltrueVerify HTTPS certificates

bulk_get -- HTTP request (multiple URLs)

Async concurrent version of get. Same parameters except url is replaced by urls (list of strings). All URLs are fetched in parallel. Returns a list of ResponseModel.

fetch -- Browser fetch (single URL)

Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynamic/SPA pages with no/low bot protection.

Key parameters (beyond shared ones):

ParameterTypeDefaultDescription
urlstrrequiredURL to fetch
extraction_typestr"markdown""markdown" / "html" / "text"
css_selectorstr or nullnullNarrow content before extraction
main_content_onlybooltrueRestrict to <body>
headlessbooltrueRun browser hidden (true) or visible (false)
proxystr or dict or nullnullString URL or {"server": "...", "username": "...", "password": "..."}
timeoutnumber30000Timeout in milliseconds
waitnumber0Extra wait (ms) after page load before extraction
wait_selectorstr or nullnullCSS selector to wait for before extraction
wait_selector_statestr"attached"State for wait_selector: "attached" / "visible" / "hidden" / "detached"
network_idleboolfalseWait until no network activity for 500ms
disable_resourcesboolfalseBlock fonts, images, media, stylesheets, etc. for speed
google_searchbooltrueSet a Google referer header
real_chromeboolfalseUse locally installed Chrome instead of bundled Chromium
cdp_urlstr or nullnullConnect to existing browser via CDP URL
extra_headersdict or nullnullAdditional request headers
useragentstr or nullnullCustom user-agent (auto-generated if null)
cookieslist or nullnullPlaywright-format cookies
timezone_idstr or nullnullBrowser timezone, e.g. "America/New_York"
localestr or nullnullBrowser locale, e.g. "en-GB"
session_idstr or nullnullReuse a persistent session from open_session instead of creating a new browser

bulk_fetch -- Browser fetch (multiple URLs)

Concurrent browser version of fetch. Same parameters (including session_id) except url is replaced by urls (list of strings). Each URL opens in a separate browser tab. Returns a list of ResponseModel.

stealthy_fetch -- Stealth browser fetch (single URL)

Anti-bot bypass fetcher with fingerprint spoofing. Use this for sites with Cloudflare Turnstile/Interstitial or other strong protections.

Additional parameters (beyond those in fetch):

ParameterTypeDefaultDescription
solve_cloudflareboolfalseAutomatically solve Cloudflare Turnstile/Interstitial challenges
hide_canvasboolfalseAdd noise to canvas operations to prevent fingerprinting
block_webrtcboolfalseForce WebRTC to respect proxy settings (prevents IP leak)
allow_webglbooltrueKeep WebGL enabled (disabling is detectable by WAFs)
additional_argsdict or nullnullExtra Playwright context args (overrides Scrapling defaults)
session_idstr or nullnullReuse a persistent stealthy session from open_session

All parameters from fetch are also accepted.

bulk_stealthy_fetch -- Stealth browser fetch (multiple URLs)

Concurrent stealth version. Same parameters (including session_id) as stealthy_fetch except url is replaced by urls (list of strings). Returns a list of ResponseModel.

open_session -- Create a persistent browser session

Opens a browser session that stays alive across multiple fetch calls, avoiding the overhead of launching a new browser each time. Returns a SessionCreatedModel with session_id, session_type, created_at, is_alive, and message.

Key parameters:

ParameterTypeDefaultDescription
session_type"dynamic" / "stealthy"requiredType of browser session to create
session_idstr or nullnullCustom ID for the session. If omitted, a random 12-char hex ID is generated. Raises if already in use
headlessbooltrueRun browser hidden or visible
max_pagesint5Max concurrent browser tabs (1-50)
proxystr or dict or nullnullProxy for all requests in this session
timeoutnumber30000Default timeout in ms
solve_cloudflareboolfalse(Stealthy only) Auto-solve Cloudflare challenges
hide_canvasboolfalse(Stealthy only) Canvas fingerprint noise
block_webrtcboolfalse(Stealthy only) Block WebRTC IP leak
allow_webglbooltrue(Stealthy only) Keep WebGL enabled

Plus all other browser session parameters (google_search, real_chrome, cdp_url, locale, timezone_id, useragent, extra_headers, cookies, disable_resources, network_idle, wait_selector, wait_selector_state).

A dynamic session can only be used with fetch/bulk_fetch. A stealthy session can only be used with stealthy_fetch/bulk_stealthy_fetch.

close_session -- Close a persistent browser session

Closes a session and frees its browser resources. Always close sessions when done.

ParameterTypeDefaultDescription
session_idstrrequiredSession ID from open_session

Returns a SessionClosedModel with session_id and message.

list_sessions -- List active sessions

Returns a list of SessionInfo objects, each with session_id, session_type, created_at, and is_alive.

No parameters.

screenshot -- Capture a page screenshot

Navigates to a URL inside an existing browser session and returns the screenshot as an MCP ImageContent block (the bytes the model can see directly, not a base64 string in JSON) followed by a TextContent block carrying the post-redirect URL.

Requires an open browser session. Call open_session first, then pass the session_id here. Both dynamic and stealthy sessions are accepted.

ParameterTypeDefaultDescription
urlstrrequiredURL to navigate to and capture
session_idstrrequiredID of an open browser session created with open_session
image_type"png" / "jpeg""png"Image format. Use "jpeg" for smaller payloads
full_pageboolfalseCapture the full scrollable page instead of just the viewport
qualityint or nullnullJPEG quality 0-100. Raises if passed with image_type="png"
waitnumber0Extra wait (ms) after page load before capture
wait_selectorstr or nullnullCSS selector to wait for before capture
wait_selector_statestr"attached"State for wait_selector: "attached" / "visible" / "hidden" / "detached"
network_idleboolfalseWait until no network activity for 500ms
timeoutnumber30000Timeout in milliseconds

Tool selection guide

ScenarioTool
Static page, no bot protectionget
Multiple static pagesbulk_get
JavaScript-rendered / SPA pagefetch
Multiple JS-rendered pagesbulk_fetch
Cloudflare or strong anti-bot protectionstealthy_fetch (with solve_cloudflare=true for Turnstile)
Multiple protected pagesbulk_stealthy_fetch
Multiple pages from the same siteopen_session + fetch/stealthy_fetch with session_id
Need a screenshot of a pageopen_session + screenshot with session_id

Start with get (fastest, lowest resource cost). Escalate to fetch if content requires JS rendering. Escalate to stealthy_fetch only if blocked. For multiple pages from the same site, use a persistent session to avoid browser launch overhead.

Content extraction tips

  • Use css_selector to narrow results before they reach the model -- this saves significant tokens.
  • main_content_only=true (default) strips nav/footer by restricting to <body>.
  • extraction_type="markdown" (default) is best for readability. Use "text" for minimal output, "html" when structure matters.
  • If a css_selector matches multiple elements, all are returned in the content list.

Prompt injection protection

When main_content_only=true (the default), the server automatically sanitizes scraped content to prevent prompt injection from malicious websites. It strips:

  • CSS-hidden elements (display:none, visibility:hidden, opacity:0, font-size:0, height:0, width:0)
  • aria-hidden="true" elements
  • <template> tags
  • HTML comments
  • Zero-width unicode characters

Keep main_content_only=true for maximum protection.

Ad blocking

All browser-based tools (fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch) and persistent sessions (open_session) automatically block requests to ~3,500 known ad and tracker domains. This is always enabled in the MCP server to save tokens and speed up page loads. No configuration needed.

Setup

Start the server (stdio transport, used by most MCP clients):

bash
scrapling mcp

Or with Streamable HTTP transport:

bash
scrapling mcp --http
scrapling mcp --http --host 127.0.0.1 --port 8000

Docker alternative:

bash
docker pull pyd4vinci/scrapling
docker run -i --rm scrapling mcp

The MCP server name when registering with a client is ScraplingServer. The command is the path to the scrapling binary and the argument is mcp.