Web Scraper API with Custom Model Support

A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!

Features

AI-Powered Scraping: Provide a URL and plain English query to extract structured data
Beautiful Frontend: Modern minimalist black-and-white interface with smooth UX
Custom Model Support: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
Model Management: Save, list, and manage multiple model configurations via web interface
Dual Scraping Approaches: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
API Request History: Automatic saving and display of all API requests with cURL commands
Schema Caching: Intelligent caching of generated schemas for faster subsequent requests
Duplicate Prevention: Avoids saving duplicate requests (same URL + query)
RESTful API: Easy-to-use HTTP endpoints for all operations

Quick Start

1. Install Dependencies

bash

pip install -r requirements.txt

2. Start the API Server

bash

python app.py

The server will start on http://localhost:8000 with a beautiful web interface!

3. Using the Web Interface

Once the server is running, open your browser and go to http://localhost:8000 to access the modern web interface!

Features:

Minimalist Design: Clean black-and-white theme inspired by modern web apps
Real-time Results: See extracted data in formatted JSON
Copy to Clipboard: Easy copying of results
Toast Notifications: User-friendly feedback
Dual Scraping Modes: Choose between Schema-based and LLM-based approaches

Model Management

Adding Models via Web Interface

Go to the Models page
Enter your model details:
- Provider: LLM provider (e.g., gemini/gemini-2.5-flash, openai/gpt-4o)
- API Token: Your API key for the provider
Click "Add Model"

API Usage for Model Management

Save a Model Configuration

bash

curl -X POST "http://localhost:8000/models" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini/gemini-2.5-flash",
    "api_token": "your-api-key-here"
  }'

List Saved Models

bash

curl -X GET "http://localhost:8000/models"

Delete a Model Configuration

bash

curl -X DELETE "http://localhost:8000/models/my-gemini"

Scraping Approaches

1. Schema-based Scraping (Faster)

Generates CSS selectors for targeted extraction
Caches schemas for repeated requests
Faster execution for structured websites

2. LLM-based Scraping (More Flexible)

Direct LLM extraction without schema generation
More flexible for complex or dynamic content
Better for unstructured data extraction

Supported LLM Providers

The API supports any LLM provider that crawl4ai supports, including:

Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-pro
OpenAI: openai/gpt-4, openai/gpt-3.5-turbo
Anthropic: anthropic/claude-3-opus, anthropic/claude-3-sonnet
And more...

API Endpoints

Core Endpoints

POST /scrape - Schema-based scraping
POST /scrape-with-llm - LLM-based scraping
GET /schemas - List cached schemas
POST /clear-cache - Clear schema cache
GET /health - Health check

Model Management Endpoints

GET /models - List saved model configurations
POST /models - Save a new model configuration
DELETE /models/{model_name} - Delete a model configuration

API Request History

GET /saved-requests - List all saved API requests
DELETE /saved-requests/{request_id} - Delete a saved request

Request/Response Examples

Scrape Request

json

{
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "model_name": "my-custom-model"
}

Scrape Response

json

{
  "success": true,
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "extracted_data": {
    "product_name": "Example Product",
    "price": "$99.99",
    "description": "This is an example product description"
  },
  "schema_used": { ... },
  "timestamp": "2024-01-01T12:00:00Z"
}

Model Configuration Request

json

{
  "provider": "gemini/gemini-2.5-flash",
  "api_token": "your-api-key-here"
}

Testing

Run the test script to verify the model management functionality:

bash

python test_models.py

File Structure

parse_example/
├── api_server.py          # FastAPI server with all endpoints
├── web_scraper_lib.py     # Core scraping library
├── test_models.py         # Test script for model management
├── requirements.txt       # Dependencies
├── static/               # Frontend files
│   ├── index.html        # Main HTML interface
│   ├── styles.css        # CSS styles (minimalist theme)
│   └── script.js         # JavaScript functionality
├── schemas/              # Cached schemas
├── models/               # Saved model configurations
├── saved_requests/       # API request history
└── README.md            # This file

Advanced Usage

Using the Library Directly

python

from web_scraper_lib import WebScraperAgent

# Initialize agent
agent = WebScraperAgent()

# Save a model configuration
agent.save_model_config(
    model_name="my-model",
    provider="openai/gpt-4",
    api_token="your-api-key"
)

# Schema-based scraping
result = await agent.scrape_data(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

# LLM-based scraping
result = await agent.scrape_data_with_llm(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

Schema Caching

The system automatically caches generated schemas based on URL and query combinations:

First request: Generates schema using AI
Subsequent requests: Uses cached schema for faster extraction

API Request History

All API requests are automatically saved with:

Request details (URL, query, model used)
Response data
Timestamp
cURL command for re-execution

Duplicate Prevention

The system prevents saving duplicate requests:

Same URL + query combinations are not saved multiple times
Returns existing request ID for duplicates
Keeps the API request history clean

Error Handling

The API provides detailed error messages for common issues:

Invalid URLs
Missing model configurations
API key errors
Network timeouts
Parsing errors

Web Scraper API with Custom Model Support

Web Scraper API with Custom Model Support

Features

Quick Start

1. Install Dependencies

2. Start the API Server

3. Using the Web Interface

Pages:

Features:

Model Management

Adding Models via Web Interface

API Usage for Model Management

Save a Model Configuration

List Saved Models

Delete a Model Configuration

Scraping Approaches

1. Schema-based Scraping (Faster)

2. LLM-based Scraping (More Flexible)

Supported LLM Providers

API Endpoints

Core Endpoints

Model Management Endpoints

API Request History

Request/Response Examples

Scrape Request

Scrape Response

Model Configuration Request

Testing

File Structure

Advanced Usage

Using the Library Directly

Schema Caching

API Request History

Duplicate Prevention

Error Handling