examples/projects/web-scraping/README.md
This project demonstrates how to create a real-time web scraper using Pathway, a powerful data processing framework. The implementation fetches and processes news articles from websites, making it possible to continuously monitor and analyze the web content.
This project consists of two main Python files:
scraping_python.py: Contains the core web scraping functionality using the newspaper4k and news-please librariesscraping_pathway.py: Implements a Pathway connector that integrates the scraper with Pathway's data processing pipelinepip install -r requirements.txt
This provides the core scraping functionality:
newspaper4k to build a site map and discover article URLsnews-please to fetch and parse article contentThe main function scrape_articles() is a generator that yields article data with the configurable refresh intervals.
This file integrates the scraper with Pathway:
NewsScraperSubject that inherits from Pathway's ConnectorSubjectBuild the image:
docker build . -t scraper
Run the container:
docker run -t scraper
Optionally, you can mount a volume to save the scraped articles data (jsonl file in this case):
docker run -v $(pwd):/app scraper
Run the scraper with:
python scraping_pathway.py
In scraping_pathway.py, you can configure:
website_urls: List of websites to scraperefresh_interval: Time between scraping cycles (in seconds)The scraper produces a JSONL file containing the scraped articles with content and optional metadata.