tools/firecrawl/README.md
This integration adds Firecrawl's powerful web scraping capabilities to RAGFlow, enabling users to import web content directly into their RAG workflows.
This integration implements the requirements from Firecrawl Issue #2167 to add Firecrawl as a data source option in RAGFlow.
intergrations/firecrawl/
โโโ __init__.py # Package initialization
โโโ firecrawl_connector.py # API communication with Firecrawl
โโโ firecrawl_config.py # Configuration management
โโโ firecrawl_processor.py # Content processing for RAGFlow
โโโ firecrawl_ui.py # UI components for RAGFlow
โโโ ragflow_integration.py # Main integration class
โโโ example_usage.py # Usage examples
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ INSTALLATION.md # Installation guide
Get Firecrawl API Key:
fc-)Configure in RAGFlow:
Test Connection:
| Option | Description | Default | Required |
|---|---|---|---|
api_key | Your Firecrawl API key | - | Yes |
api_url | Firecrawl API endpoint | https://api.firecrawl.dev | No |
max_retries | Maximum retry attempts | 3 | No |
timeout | Request timeout (seconds) | 30 | No |
rate_limit_delay | Delay between requests (seconds) | 1.0 | No |
Main integration class for Firecrawl with RAGFlow.
scrape_and_import(urls, formats, extract_options) - Scrape URLs and convert to RAGFlow documentscrawl_and_import(start_url, limit, scrape_options) - Crawl website and convert to RAGFlow documentstest_connection() - Test connection to Firecrawl APIvalidate_config(config_dict) - Validate configuration settingsHandles communication with the Firecrawl API.
scrape_url(url, formats, extract_options) - Scrape single URLstart_crawl(url, limit, scrape_options) - Start crawl jobget_crawl_status(job_id) - Get crawl job statusbatch_scrape(urls, formats) - Scrape multiple URLs concurrentlyProcesses Firecrawl output for RAGFlow integration.
process_content(content) - Process scraped content into RAGFlow document formatprocess_batch(contents) - Process multiple scraped contentschunk_content(document, chunk_size, chunk_overlap) - Chunk document content for RAG processingThe integration includes comprehensive testing:
# Run the test suite
cd intergrations/firecrawl
python3 -c "
import sys
sys.path.append('.')
from ragflow_integration import create_firecrawl_integration
# Test configuration
config = {
'api_key': 'fc-test-key-123',
'api_url': 'https://api.firecrawl.dev'
}
integration = create_firecrawl_integration(config)
print('โ
Integration working!')
"
The integration includes robust error handling for:
This integration was created as part of the Firecrawl bounty program.
This integration is licensed under the same license as RAGFlow (Apache 2.0).
This integration was developed as part of the Firecrawl bounty program to bridge the gap between web content and RAG applications, making it easier for developers to build AI applications that can leverage real-time web data.
Ready for RAGFlow Integration! ๐
This integration enables RAGFlow users to easily import web content into their knowledge retrieval systems, expanding the ecosystem for both Firecrawl and RAGFlow.