docs/ai/mcp-server.md
The Scrapling MCP Server is a new feature that brings Scrapling's powerful Web Scraping capabilities directly to your favorite AI chatbot or AI agent. This integration allows you to scrape websites, extract data, and bypass anti-bot protections conversationally through Claude's AI interface or any interface that supports MCP.
The Scrapling MCP Server provides ten powerful tools for web scraping:
get: Fast HTTP requests with browser fingerprint impersonation, generating real browser headers matching the TLS version, HTTP/3, and more!bulk_get: An async version of the above tool that allows scraping of multiple URLs at the same time!fetch: Rapidly fetch dynamic content with Chromium/Chrome browser with complete control over the request/browser, and more!bulk_fetch: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!stealthy_fetch: Uses our Stealthy browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser!bulk_stealthy_fetch: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!screenshot: Capture a PNG or JPEG screenshot of a page using an open browser session, returned as an image content block the model can actually see (not a base64 string blob). Supports full-page captures, JPEG quality, and the usual readiness controls (wait, wait_selector, network_idle).open_session: Create a persistent browser session (dynamic or stealthy) that stays open across multiple fetch calls, avoiding the overhead of launching a new browser each time.close_session: Close a persistent browser session and free its resources.list_sessions: List all active browser sessions with their details.Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that lets you select specific elements to pass to the AI, saving a lot of time and tokens!
The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume far more tokens than needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient.
If you don't know how to write/use CSS selectors, don't worry. You can tell the AI in the prompt to write selectors to match possible fields for you and watch it try different combinations until it finds the right one, as we will show in the examples section.
Install Scrapling with MCP Support, then double-check that the browser dependencies are installed.
# Install Scrapling with MCP server dependencies
pip install "scrapling[ai]"
# Install browser dependencies
scrapling install
Or use the Docker image directly from the Docker registry:
docker pull pyd4vinci/scrapling
Or download it from the GitHub registry:
docker pull ghcr.io/d4vinci/scrapling:latest
Here we will explain how to add Scrapling MCP Server to Claude Desktop and Claude Code, but the same logic applies to any other chatbot that supports MCP:
"ScraplingServer": {
"command": "scrapling",
"args": [
"mcp"
]
}
If that's the first MCP server you're adding, set the content of the file to this:
{
"mcpServers": {
"ScraplingServer": {
"command": "scrapling",
"args": [
"mcp"
]
}
}
}
As per the official article, this action either creates a new configuration file if none exists or opens your existing configuration. The file is located at
~/Library/Application Support/Claude/claude_desktop_config.json%APPDATA%\Claude\claude_desktop_config.jsonTo ensure it's working, use the full path to the scrapling executable. Open the terminal and execute the following command:
which scraplingwhere scraplingFor me, on my Mac, it returned /Users/<MyUsername>/.venv/bin/scrapling, so the config I used in the end is:
{
"mcpServers": {
"ScraplingServer": {
"command": "/Users/<MyUsername>/.venv/bin/scrapling",
"args": [
"mcp"
]
}
}
}
If you are using the Docker image, then it would be something like
{
"mcpServers": {
"ScraplingServer": {
"command": "docker",
"args": [
"run", "-i", "--rm", "scrapling", "mcp"
]
}
}
}
The same logic applies to Cursor, WindSurf, and others.
Here it's much simpler to do. If you have Claude Code installed, open the terminal and execute the following command:
claude mcp add ScraplingServer "/Users/<MyUsername>/.venv/bin/scrapling" mcp
Same as above, to get Scrapling's executable path, open the terminal and execute the following command:
which scraplingwhere scraplingHere's the main article from Anthropic on how to add MCP servers to Claude code for further details.
Then, after you've added the server, you need to completely quit and restart the app you used above. In Claude Desktop, you should see an MCP server indicator (π§) in the bottom-right corner of the chat input or see ScraplingServer in the Search and tools dropdown in the chat input box.
As per version 0.3.6, we have added the ability to make the MCP server use the 'Streamable HTTP' transport mode instead of the traditional 'stdio' transport.
So instead of using the following command (the 'stdio' one):
scrapling mcp
Use the following to enable 'Streamable HTTP' transport mode:
scrapling mcp --http
Hence, the default value for the host the server is listening to is '0.0.0.0' and the port is 8000, which both can be configured as below:
scrapling mcp --http --host '127.0.0.1' --port 8000
Now we will show you some examples of prompts we used while testing the MCP server, but you are probably more creative than we are and better at prompt engineering than we are :)
We will gradually go from simple prompts to more complex ones. We will use Claude Desktop for the examples, but the same logic applies to the rest, of course.
Basic Web Scraping
Extract the main content from a webpage as Markdown:
Scrape the main content from https://example.com and convert it to markdown format.
Claude will use the get tool to fetch the page and return clean, readable content. If it fails, it will continue retrying every second for 3 attempts, unless you instruct it otherwise. If it fails to retrieve content for any reason, such as protection or if it's a dynamic website, it will automatically try the other tools. If Claude didn't do that automatically for some reason, you can add that to the prompt.
A more optimized version of the same prompt would be:
Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a rule of thumb, you should always tell Claude which tool to use to save time and money and get consistent results.
Targeted Data Extraction
Extract specific elements using CSS selectors:
Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases.
E-commerce Data Collection
Another example of a bit more complex prompt:
Extract product information from these e-commerce URLs using bulk browser fetches:
- https://shop1.com/product-a
- https://shop2.com/product-b
- https://shop3.com/product-c
Get the product names, prices, and descriptions from each page.
Claude will use bulk_fetch to concurrently scrape all URLs, then analyze the extracted data.
More advanced workflow
Let's say I want to get all the action games available on PlayStation's store first page right now. I can use the following prompt to do that:
Extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse
Note that I instructed it to use a bulk request for all the URLs collected. If I hadn't mentioned it, sometimes it works as intended, and other times it makes a separate request to each URL, which takes significantly longer. This prompt takes approximately one minute to complete.
However, because I wasn't specific enough, it actually used the stealthy_fetch here and the bulk_stealthy_fetch in the second step, which unnecessarily consumed a large number of tokens. A better prompt would be:
Use normal requests to extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse
And if you know how to write CSS selectors, you can instruct Claude to apply the selectors to the elements you want, and it will nearly complete the task immediately.
Use normal requests to extract the URLs of all games on the page below, then perform a bulk request to them and return a list of all action games.
The selector for games in the first page is `[href*="/concept/"]` and the selector for the genre in the second request is `[data-qa="gameInfo#releaseInformation#genre-value"]`.
URL: https://store.playstation.com/en-us/pages/browse
Get data from a website with Cloudflare protection
If you think the website you are targeting has Cloudflare protection, tell Claude instead of letting it discover it on its own.
What's the price of this product? Be cautious, as it utilizes Cloudflare's Turnstile protection. Make the browser visible while you work.
https://ao.com/product/oo101uk-ninja-woodfire-outdoor-pizza-oven-brown-99357-685.aspx
Long workflow
You can, for example, use a prompt like this:
Extract all product URLs for the following category, then return the prices and details for the first 3 products.
https://www.arnotts.ie/furniture/bedroom/bed-frames/
But a better prompt would be:
Go to the following category URL and extract all product URLs using the CSS selector "a". Then, fetch the first 3 product pages in parallel and extract each productβs price and details.
Keep the output in markdown format to reduce irrelevant content.
Category URL:
https://www.arnotts.ie/furniture/bedroom/bed-frames/
Using Persistent Sessions
When scraping multiple pages from the same site, use a persistent browser session to avoid the overhead of launching a new browser for each request:
Open a stealthy browser session with 5 pages maximum pool, then use it to scrape the main details in bulk from the first 5 product pages on https://shop.example.com. Close the session when you're done.
Claude will use open_session to create a persistent browser, pass the session_id to bulk_stealthy_fetch call while opening all pages at the same time, and then call close_session at the end. This is significantly faster than launching a new browser for each page.
!!! danger
When using persistent sessions, always remember to close the session after you finish or it will stay open!
Using Persistent Session on a long flow
Another long test example that makes Clause think:
Use Scrapling MCP to do the following in this order:
1. Open a stealthy browser session with headless mode off.
2. Go to this page and collect the number of stars: https://github.com/D4Vinci/Scrapling
3. From the README, get the URL that shows the number of downloads and go to it.
4. Get the number of downloads and the top 3 countries from the graph.
5. Prepare a report with the results.
6. Close the browser.
And so on, you get the idea. Your creativity is the key here.
Here is some technical advice for you.
get: Fast, simple websitesfetch: Sites with JavaScript/dynamic contentstealthy_fetch: Protected sites, Cloudflare, anti-bot systemsnetwork_idle for SPAswait_selector for specific elementsmain_content_only=true to avoid navigation/adsextraction_type for your use caseThe MCP server automatically sanitizes scraped content when main_content_only is enabled (the default). This strips hidden content that malicious websites could use to inject instructions into the AI's context:
display:none, visibility:hidden, opacity:0, font-size:0, height:0, width:0aria-hidden="true"<template> elements<!-- ... -->This protection runs automatically on all MCP tool responses. Keep main_content_only=true (the default) for maximum protection.
open_session to create a persistent browser session when scraping multiple pagessession_id to fetch or stealthy_fetch calls to reuse the same browserclose_session when done to free resourceslist_sessions to check which sessions are still activesession_id from a dynamic session can only be used with fetch/bulk_fetch, and a stealthy session can only be used with stealthy_fetch/bulk_stealthy_fetchsession_id to open_session to give sessions meaningful names (e.g. "search", "checkout") instead of the random hex default. open_session raises if the chosen ID is already in use, so you can detect collisions up frontscreenshot only works through an existing browser session, so call open_session first (either dynamic or stealthy works)ImageContent block, not a base64 string in JSON, so the model sees the page directlyfull_page=True when you need everything below the fold; the default captures only the visible viewportimage_type="jpeg" with a quality value (0-100) for smaller payloads when pixel-perfect color isn't neededwait, wait_selector, network_idle, and timeout controls used by fetch are available here tooβ οΈ Important Guidelines:
https://website.com/robots.txt to see scraping rulesBuilt with β€οΈ by the Scrapling team. Happy scraping!