apps/docs/connectors/web-crawler.mdx
Connect websites to automatically crawl and sync web pages into your Supermemory knowledge base. The web crawler respects robots.txt rules, includes SSRF protection, and automatically recrawls sites on a schedule.
<Warning> The web crawler connector requires a **Scale Plan** or **Enterprise Plan**. </Warning>const client = new Supermemory({
apiKey: process.env.SUPERMEMORY_API_KEY!
});
const connection = await client.connections.create('web-crawler', {
redirectUrl: 'https://yourapp.com/callback',
containerTags: ['user-123', 'website-sync'],
documentLimit: 5000,
metadata: {
startUrl: 'https://docs.example.com'
}
});
// Web crawler doesn't require OAuth - connection is ready immediately
console.log('Connection ID:', connection.id);
console.log('Connection created:', connection.createdAt);
// Note: connection.authLink is undefined for web-crawler
```
client = Supermemory(api_key=os.environ.get("SUPERMEMORY_API_KEY"))
connection = client.connections.create(
'web-crawler',
redirect_url='https://yourapp.com/callback',
container_tags=['user-123', 'website-sync'],
document_limit=5000,
metadata={
'startUrl': 'https://docs.example.com'
}
)
# Web crawler doesn't require OAuth - connection is ready immediately
print(f'Connection ID: {connection.id}')
print(f'Connection created: {connection.created_at}')
# Note: connection.auth_link is None for web-crawler
```
# Response: {
# "id": "conn_wc123",
# "redirectsTo": "https://yourapp.com/callback",
# "authLink": null,
# "expiresIn": null
# }
```
Unlike other connectors, the web crawler doesn't require OAuth authentication. The connection is established immediately upon creation, and crawling begins automatically.
console.log('Start URL:', connection.metadata?.startUrl);
console.log('Connection created:', connection.createdAt);
// List synced web pages
const documents = await client.connections.listDocuments('web-crawler', {
containerTags: ['user-123', 'website-sync']
});
console.log(`Synced ${documents.length} web pages`);
```
print(f'Start URL: {connection.metadata.get("startUrl")}')
print(f'Connection created: {connection.created_at}')
# List synced web pages
documents = client.connections.list_documents(
'web-crawler',
container_tags=['user-123', 'website-sync']
)
print(f'Synced {len(documents)} web pages')
```
# Response includes connection details:
# {
# "id": "conn_wc123",
# "provider": "web-crawler",
# "createdAt": "2024-01-15T10:00:00Z",
# "documentLimit": 5000,
# "metadata": {"startUrl": "https://docs.example.com", ...}
# }
# List synced documents
curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/documents" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123", "website-sync"]}'
# Response: Array of document objects
# [
# {"title": "Home Page", "type": "webpage", "status": "done", "url": "https://docs.example.com"},
# {"title": "Getting Started", "type": "webpage", "status": "done", "url": "https://docs.example.com/getting-started"}
# ]
```
The web crawler only processes valid public URLs:
The web crawler uses scheduled recrawling rather than real-time webhooks:
const webCrawlerConnections = connections.filter(
conn => conn.provider === 'web-crawler'
);
webCrawlerConnections.forEach(conn => {
console.log(`Start URL: ${conn.metadata?.startUrl}`);
console.log(`Connection ID: ${conn.id}`);
console.log(`Created: ${conn.createdAt}`);
});
```
web_crawler_connections = [
conn for conn in connections if conn.provider == 'web-crawler'
]
for conn in web_crawler_connections:
print(f'Start URL: {conn.metadata.get("startUrl")}')
print(f'Connection ID: {conn.id}')
print(f'Created: {conn.created_at}')
```
# Response: [
# {
# "id": "conn_wc123",
# "provider": "web-crawler",
# "createdAt": "2024-01-15T10:30:00.000Z",
# "documentLimit": 5000,
# "metadata": {"startUrl": "https://docs.example.com", ...}
# }
# ]
```
Remove a web crawler connection when no longer needed:
<Tabs> <Tab title="TypeScript"> ```typescript // Delete by connection ID const result = await client.connections.delete('connection_id_123'); console.log('Deleted connection:', result.id);// Delete by provider and container tags
const providerResult = await client.connections.deleteByProvider('web-crawler', {
containerTags: ['user-123']
});
console.log('Deleted web crawler connection for user');
```
# Delete by provider and container tags
provider_result = client.connections.delete_by_provider(
'web-crawler',
container_tags=['user-123']
)
print('Deleted web crawler connection for user')
```
# Delete by provider and container tags
curl -X DELETE "https://api.supermemory.ai/v3/connections/web-crawler" \
-H "Authorization: Bearer $SUPERMEMORY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"containerTags": ["user-123"]}'
```
Control which web pages get synced using the settings API:
<Tabs> <Tab title="TypeScript"> ```typescript // Configure intelligent filtering for web content await client.settings.update({ shouldLLMFilter: true, includeItems: { urlPatterns: ['*docs*', '*documentation*', '*guide*'], titlePatterns: ['*Getting Started*', '*API Reference*', '*Tutorial*'] }, excludeItems: { urlPatterns: ['*admin*', '*private*', '*test*'], titlePatterns: ['*Draft*', '*Archive*', '*Old*'] }, filterPrompt: "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." }); ``` </Tab> <Tab title="Python"> ```python # Configure intelligent filtering for web content client.settings.update( should_llm_filter=True, include_items={ 'urlPatterns': ['*docs*', '*documentation*', '*guide*'], 'titlePatterns': ['*Getting Started*', '*API Reference*', '*Tutorial*'] }, exclude_items={ 'urlPatterns': ['*admin*', '*private*', '*test*'], 'titlePatterns': ['*Draft*', '*Archive*', '*Old*'] }, filter_prompt="Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." ) ``` </Tab> <Tab title="cURL"> ```bash # Configure intelligent filtering for web content curl -X PATCH "https://api.supermemory.ai/v3/settings" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "shouldLLMFilter": true, "includeItems": { "urlPatterns": ["*docs*", "*documentation*", "*guide*"], "titlePatterns": ["*Getting Started*", "*API Reference*", "*Tutorial*"] }, "excludeItems": { "urlPatterns": ["*admin*", "*private*", "*test*"], "titlePatterns": ["*Draft*", "*Archive*", "*Old*"] }, "filterPrompt": "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." }' ``` </Tab> </Tabs>Built-in protection against Server-Side Request Forgery (SSRF) attacks:
All URLs are validated before crawling: