packages/web-crawler/README.md
LobeHub's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
@lobechat/web-crawler is a core component of LobeHub responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
// Example: handling specific websites
const url = [
// ... other URL matching rules
{
// URL matching pattern, supports regex
urlPattern: 'https://example.com/articles/(.*)',
// Optional: URL transformation, redirects to an easier-to-crawl version
urlTransform: 'https://example.com/print/$1',
// Optional: specify crawling implementation, supports 'naive', 'jina', 'search1api', and 'browserless'
impls: ['naive', 'jina', 'search1api', 'browserless'],
// Optional: content filtering configuration
filterOptions: {
// Whether to enable Readability algorithm for filtering distracting elements
enableReadability: true,
// Whether to convert to plain text
pureText: false,
},
},
];
This is an internal module of LobeHub ("private": true), designed specifically for LobeHub and not published as a standalone package.