docs/examples/website-to-api/README.md
A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!
pip install -r requirements.txt
python app.py
The server will start on http://localhost:8000 with a beautiful web interface!
Once the server is running, open your browser and go to http://localhost:8000 to access the modern web interface!
gemini/gemini-2.5-flash, openai/gpt-4o)curl -X POST "http://localhost:8000/models" \
-H "Content-Type: application/json" \
-d '{
"provider": "gemini/gemini-2.5-flash",
"api_token": "your-api-key-here"
}'
curl -X GET "http://localhost:8000/models"
curl -X DELETE "http://localhost:8000/models/my-gemini"
The API supports any LLM provider that crawl4ai supports, including:
gemini/gemini-2.5-flash, gemini/gemini-proopenai/gpt-4, openai/gpt-3.5-turboanthropic/claude-3-opus, anthropic/claude-3-sonnetPOST /scrape - Schema-based scrapingPOST /scrape-with-llm - LLM-based scrapingGET /schemas - List cached schemasPOST /clear-cache - Clear schema cacheGET /health - Health checkGET /models - List saved model configurationsPOST /models - Save a new model configurationDELETE /models/{model_name} - Delete a model configurationGET /saved-requests - List all saved API requestsDELETE /saved-requests/{request_id} - Delete a saved request{
"url": "https://example.com",
"query": "Extract the product name, price, and description",
"model_name": "my-custom-model"
}
{
"success": true,
"url": "https://example.com",
"query": "Extract the product name, price, and description",
"extracted_data": {
"product_name": "Example Product",
"price": "$99.99",
"description": "This is an example product description"
},
"schema_used": { ... },
"timestamp": "2024-01-01T12:00:00Z"
}
{
"provider": "gemini/gemini-2.5-flash",
"api_token": "your-api-key-here"
}
Run the test script to verify the model management functionality:
python test_models.py
parse_example/
├── api_server.py # FastAPI server with all endpoints
├── web_scraper_lib.py # Core scraping library
├── test_models.py # Test script for model management
├── requirements.txt # Dependencies
├── static/ # Frontend files
│ ├── index.html # Main HTML interface
│ ├── styles.css # CSS styles (minimalist theme)
│ └── script.js # JavaScript functionality
├── schemas/ # Cached schemas
├── models/ # Saved model configurations
├── saved_requests/ # API request history
└── README.md # This file
from web_scraper_lib import WebScraperAgent
# Initialize agent
agent = WebScraperAgent()
# Save a model configuration
agent.save_model_config(
model_name="my-model",
provider="openai/gpt-4",
api_token="your-api-key"
)
# Schema-based scraping
result = await agent.scrape_data(
url="https://example.com",
query="Extract product information",
model_name="my-model"
)
# LLM-based scraping
result = await agent.scrape_data_with_llm(
url="https://example.com",
query="Extract product information",
model_name="my-model"
)
The system automatically caches generated schemas based on URL and query combinations:
All API requests are automatically saved with:
The system prevents saving duplicate requests:
The API provides detailed error messages for common issues: