docs/en/tools/web-scraping/scrapegraphscrapetool.mdx
ScrapegraphScrapeToolThe ScrapegraphScrapeTool is designed to leverage Scrapegraph AI's SmartScraper API to intelligently extract content from websites. This tool provides advanced web scraping capabilities with AI-powered content extraction, making it ideal for targeted data collection and content analysis tasks. Unlike traditional web scrapers, it can understand the context and structure of web pages to extract the most relevant information based on natural language prompts.
To use this tool, you need to install the Scrapegraph Python client:
uv add scrapegraph-py
You'll also need to set up your Scrapegraph API key as an environment variable:
export SCRAPEGRAPH_API_KEY="your_api_key"
You can obtain an API key from Scrapegraph AI.
To effectively use the ScrapegraphScrapeTool, follow these steps:
The following example demonstrates how to use the ScrapegraphScrapeTool to extract content from a website:
from crewai import Agent, Task, Crew
from crewai_tools import ScrapegraphScrapeTool
# Initialize the tool
scrape_tool = ScrapegraphScrapeTool(api_key="your_api_key")
# Define an agent that uses the tool
web_scraper_agent = Agent(
role="Web Scraper",
goal="Extract specific information from websites",
backstory="An expert in web scraping who can extract targeted content from web pages.",
tools=[scrape_tool],
verbose=True,
)
# Example task to extract product information from an e-commerce site
scrape_task = Task(
description="Extract product names, prices, and descriptions from the featured products section of example.com.",
expected_output="A structured list of product information including names, prices, and descriptions.",
agent=web_scraper_agent,
)
# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()
You can also initialize the tool with predefined parameters:
# Initialize the tool with predefined parameters
scrape_tool = ScrapegraphScrapeTool(
website_url="https://www.example.com",
user_prompt="Extract all product prices and descriptions",
api_key="your_api_key"
)
The ScrapegraphScrapeTool accepts the following parameters during initialization:
SCRAPEGRAPH_API_KEY environment variable.False.When using the ScrapegraphScrapeTool with an agent, the agent will need to provide the following parameters (unless they were specified during initialization):
The tool will return the extracted content based on the provided prompt.
# Example of using the tool with an agent
web_scraper_agent = Agent(
role="Web Scraper",
goal="Extract specific information from websites",
backstory="An expert in web scraping who can extract targeted content from web pages.",
tools=[scrape_tool],
verbose=True,
)
# Create a task for the agent to extract specific content
extract_task = Task(
description="Extract the main heading and summary from example.com",
expected_output="The main heading and summary from the website",
agent=web_scraper_agent,
)
# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[extract_task])
result = crew.kickoff()
The ScrapegraphScrapeTool may raise the following exceptions:
It's recommended to instruct agents to handle potential errors gracefully:
# Create a task that includes error handling instructions
robust_extract_task = Task(
description="""
Extract the main heading from example.com.
Be aware that you might encounter errors such as:
- Invalid URL format
- Missing API key
- Rate limit exceeded
- Network or API errors
If you encounter any errors, provide a clear explanation of what went wrong
and suggest possible solutions.
""",
expected_output="Either the extracted heading or a clear error explanation",
agent=web_scraper_agent,
)
The Scrapegraph API has rate limits that vary based on your subscription plan. Consider the following best practices:
The ScrapegraphScrapeTool uses the Scrapegraph Python client to interact with the SmartScraper API:
class ScrapegraphScrapeTool(BaseTool):
"""
A tool that uses Scrapegraph AI to intelligently scrape website content.
"""
# Implementation details...
def _run(self, **kwargs: Any) -> Any:
website_url = kwargs.get("website_url", self.website_url)
user_prompt = (
kwargs.get("user_prompt", self.user_prompt)
or "Extract the main content of the webpage"
)
if not website_url:
raise ValueError("website_url is required")
# Validate URL format
self._validate_url(website_url)
try:
# Make the SmartScraper request
response = self._client.smartscraper(
website_url=website_url,
user_prompt=user_prompt,
)
return response
# Error handling...
The ScrapegraphScrapeTool provides a powerful way to extract content from websites using AI-powered understanding of web page structure. By enabling agents to target specific information using natural language prompts, it makes web scraping tasks more efficient and focused. This tool is particularly useful for data extraction, content monitoring, and research tasks where specific information needs to be extracted from web pages.