Back to Crawl4ai

Web Scraper API with Custom Model Support

docs/examples/website-to-api/README.md

0.8.66.8 KB
Original Source

Web Scraper API with Custom Model Support

A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!

Features

  • AI-Powered Scraping: Provide a URL and plain English query to extract structured data
  • Beautiful Frontend: Modern minimalist black-and-white interface with smooth UX
  • Custom Model Support: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
  • Model Management: Save, list, and manage multiple model configurations via web interface
  • Dual Scraping Approaches: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
  • API Request History: Automatic saving and display of all API requests with cURL commands
  • Schema Caching: Intelligent caching of generated schemas for faster subsequent requests
  • Duplicate Prevention: Avoids saving duplicate requests (same URL + query)
  • RESTful API: Easy-to-use HTTP endpoints for all operations

Quick Start

1. Install Dependencies

bash
pip install -r requirements.txt

2. Start the API Server

bash
python app.py

The server will start on http://localhost:8000 with a beautiful web interface!

3. Using the Web Interface

Once the server is running, open your browser and go to http://localhost:8000 to access the modern web interface!

Pages:

  • Scrape Data: Enter URLs and queries to extract structured data
  • Models: Manage your AI model configurations (add, list, delete)
  • API Requests: View history of all scraping requests with cURL commands

Features:

  • Minimalist Design: Clean black-and-white theme inspired by modern web apps
  • Real-time Results: See extracted data in formatted JSON
  • Copy to Clipboard: Easy copying of results
  • Toast Notifications: User-friendly feedback
  • Dual Scraping Modes: Choose between Schema-based and LLM-based approaches

Model Management

Adding Models via Web Interface

  1. Go to the Models page
  2. Enter your model details:
    • Provider: LLM provider (e.g., gemini/gemini-2.5-flash, openai/gpt-4o)
    • API Token: Your API key for the provider
  3. Click "Add Model"

API Usage for Model Management

Save a Model Configuration

bash
curl -X POST "http://localhost:8000/models" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini/gemini-2.5-flash",
    "api_token": "your-api-key-here"
  }'

List Saved Models

bash
curl -X GET "http://localhost:8000/models"

Delete a Model Configuration

bash
curl -X DELETE "http://localhost:8000/models/my-gemini"

Scraping Approaches

1. Schema-based Scraping (Faster)

  • Generates CSS selectors for targeted extraction
  • Caches schemas for repeated requests
  • Faster execution for structured websites

2. LLM-based Scraping (More Flexible)

  • Direct LLM extraction without schema generation
  • More flexible for complex or dynamic content
  • Better for unstructured data extraction

Supported LLM Providers

The API supports any LLM provider that crawl4ai supports, including:

  • Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-pro
  • OpenAI: openai/gpt-4, openai/gpt-3.5-turbo
  • Anthropic: anthropic/claude-3-opus, anthropic/claude-3-sonnet
  • And more...

API Endpoints

Core Endpoints

  • POST /scrape - Schema-based scraping
  • POST /scrape-with-llm - LLM-based scraping
  • GET /schemas - List cached schemas
  • POST /clear-cache - Clear schema cache
  • GET /health - Health check

Model Management Endpoints

  • GET /models - List saved model configurations
  • POST /models - Save a new model configuration
  • DELETE /models/{model_name} - Delete a model configuration

API Request History

  • GET /saved-requests - List all saved API requests
  • DELETE /saved-requests/{request_id} - Delete a saved request

Request/Response Examples

Scrape Request

json
{
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "model_name": "my-custom-model"
}

Scrape Response

json
{
  "success": true,
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "extracted_data": {
    "product_name": "Example Product",
    "price": "$99.99",
    "description": "This is an example product description"
  },
  "schema_used": { ... },
  "timestamp": "2024-01-01T12:00:00Z"
}

Model Configuration Request

json
{
  "provider": "gemini/gemini-2.5-flash",
  "api_token": "your-api-key-here"
}

Testing

Run the test script to verify the model management functionality:

bash
python test_models.py

File Structure

parse_example/
├── api_server.py          # FastAPI server with all endpoints
├── web_scraper_lib.py     # Core scraping library
├── test_models.py         # Test script for model management
├── requirements.txt       # Dependencies
├── static/               # Frontend files
│   ├── index.html        # Main HTML interface
│   ├── styles.css        # CSS styles (minimalist theme)
│   └── script.js         # JavaScript functionality
├── schemas/              # Cached schemas
├── models/               # Saved model configurations
├── saved_requests/       # API request history
└── README.md            # This file

Advanced Usage

Using the Library Directly

python
from web_scraper_lib import WebScraperAgent

# Initialize agent
agent = WebScraperAgent()

# Save a model configuration
agent.save_model_config(
    model_name="my-model",
    provider="openai/gpt-4",
    api_token="your-api-key"
)

# Schema-based scraping
result = await agent.scrape_data(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

# LLM-based scraping
result = await agent.scrape_data_with_llm(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

Schema Caching

The system automatically caches generated schemas based on URL and query combinations:

  • First request: Generates schema using AI
  • Subsequent requests: Uses cached schema for faster extraction

API Request History

All API requests are automatically saved with:

  • Request details (URL, query, model used)
  • Response data
  • Timestamp
  • cURL command for re-execution

Duplicate Prevention

The system prevents saving duplicate requests:

  • Same URL + query combinations are not saved multiple times
  • Returns existing request ID for duplicates
  • Keeps the API request history clean

Error Handling

The API provides detailed error messages for common issues:

  • Invalid URLs
  • Missing model configurations
  • API key errors
  • Network timeouts
  • Parsing errors