docs/examples/data_connectors/WebPageDemo.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/WebPageDemo.ipynb" target="_parent"></a>
Demonstrates our web page reader.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex π¦.
%pip install llama-index llama-index-readers-web
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os
# NOTE: the html_to_text=True option requires html2text to be installed
documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)
documents[0]
index = SummaryIndex.from_documents(documents)
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
Spider is the fastest crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
Prerequisites: you need to have a Spider api key to use this loader. You can get one on spider.cloud.
# Scrape single URL
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY", # Get one at https://spider.cloud
mode="scrape",
# params={} # Optional parameters see more on https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
Crawl domain following all deeper subpages
# Crawl domain with deeper crawling following subpages
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY",
mode="crawl",
# params={} # Optional parameters see more on https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
For guides and documentation, visit Spider
Browserbase is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.
BROWSERBASE_API_KEY, BROWSERBASE_PROJECT_ID).%pip install browserbase
from llama_index.readers.web import BrowserbaseWebReader
reader = BrowserbaseWebReader()
docs = reader.load_data(
urls=[
"https://example.com",
],
# Text mode
text_content=False,
)
Firecrawl is an api that turns entire websites into clean, LLM accessible markdown.
Using Firecrawl to gather an entire website
%pip install firecrawl-py
from llama_index.readers.web import FireCrawlWebReader
# using firecrawl to crawl a website
firecrawl_reader = FireCrawlWebReader(
api_key="<your_api_key>", # Replace with your actual API key from https://www.firecrawl.dev/
mode="scrape", # Choose between "crawl" and "scrape" for single page scraping
params={"additional": "parameters"}, # Optional additional parameters
)
# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/")
index = SummaryIndex.from_documents(documents)
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
Using firecrawl for a single page
# Initialize the FireCrawlWebReader with your API key and desired mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
firecrawl_reader = FireCrawlWebReader(
api_key="<your_api_key>", # Replace with your actual API key from https://www.firecrawl.dev/
mode="scrape", # Choose between "crawl" and "scrape" for single page scraping
params={"additional": "parameters"}, # Optional additional parameters
)
# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/worked.html")
index = SummaryIndex.from_documents(documents)
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
Using FireCrawl's extract mode to extract structured data from URLs
# Initialize the FireCrawlWebReader with your API key and extract mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
firecrawl_reader = FireCrawlWebReader(
api_key="<your_api_key>", # Replace with your actual API key from https://www.firecrawl.dev/
mode="extract", # Use extract mode to extract structured data
params={
"prompt": "Extract the title, author, and main points from this essay",
# Required prompt parameter for extract mode
},
)
# Load documents by providing a list of URLs to extract data from
documents = firecrawl_reader.load_data(
urls=[
"https://www.paulgraham.com",
"https://www.paulgraham.com/worked.html",
]
)
index = SummaryIndex.from_documents(documents)
# Query the extracted structured data
query_engine = index.as_query_engine()
response = query_engine.query("What are the main points from these essays?")
display(Markdown(f"<b>{response}</b>"))
Hyperbrowser is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
Key Features:
For more information about Hyperbrowser, please visit the Hyperbrowser website or if you want to check out the docs, you can visit the Hyperbrowser docs.
HYPERBROWSER_API_KEY environment variable or you can pass it to the HyperbrowserWebReader constructor.%pip install hyperbrowser
from llama_index.readers.web import HyperbrowserWebReader
reader = HyperbrowserWebReader(api_key="your_api_key_here")
docs = reader.load_data(
urls=["https://example.com"],
operation="scrape",
)
docs
from llama_index.readers.web import TrafilaturaWebReader
documents = TrafilaturaWebReader().load_data(
["http://paulgraham.com/worked.html"]
)
index = SummaryIndex.from_documents(documents)
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
from llama_index.core import SummaryIndex
from llama_index.readers.web import RssReader
documents = RssReader().load_data(
["https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
)
index = SummaryIndex.from_documents(documents)
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What happened in the news today?")
ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text. Install ScrapFly Python SDK using pip:
pip install scrapfly-sdk
Here is a basic usage of ScrapflyReader
from llama_index.readers.web import ScrapflyReader
# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
The ScrapflyReader also allows passigng ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started
from llama_index.readers.web import ScrapflyReader
# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"],
scrape_config=scrapfly_scrape_config, # Pass the scrape config
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
)
ZyteWebReader allows a user to access the content of webpage in different modes ("article", "html-text", "html"). It enables user to change setting such as browser rendering and JS as the content of many sites would require setting these options to access relevant content. All supported options can be found here: https://docs.zyte.com/zyte-api/usage/reference.html
To install dependencies:
pip install zyte-api
To get access to your ZYTE API key please visit: https://docs.zyte.com/zyte-api/get-started.html
from llama_index.readers.web import ZyteWebReader
# Required to run it in notebook
# import nest_asyncio
# nest_asyncio.apply()
# Initiate ZyteWebReader with your Zyte API key
zyte_reader = ZyteWebReader(
api_key="your ZYTE API key here",
mode="article", # or "html-text" or "html"
)
urls = [
"https://www.zyte.com/blog/web-scraping-apis/",
"https://www.zyte.com/blog/system-integrators-extract-big-data/",
]
documents = zyte_reader.load_data(
urls=urls,
)
print(len(documents[0].text))
Browser rendering and javascript can be enabled by passing setting corresponding parameters during initialization.
zyte_dw_params = {
"browserHtml": True, # Enable browser rendering
"javascript": True, # Enable JavaScript
}
# Initiate ZyteWebReader with your Zyte API key and use default "article" mode
zyte_reader = ZyteWebReader(
api_key="your ZYTE API key here",
download_kwargs=zyte_dw_params,
)
# Load documents from URLs
documents = zyte_reader.load_data(
urls=urls,
)
len(documents[0].text)
Set "continue_on_failure" to False if you'd like to stop when any request fails.
zyte_reader = ZyteWebReader(
api_key="your ZYTE API key here",
mode="html-text",
download_kwargs=zyte_dw_params,
continue_on_failure=False,
)
# Load documents from URLs
documents = zyte_reader.load_data(
urls=urls,
)
len(documents[0].text)
In default mode ("article") only the article text is extracted while in the "html-text" full text is extracted from the webpage, there the length of the text is significantly longer.
Use AgentQL to scrape structured data from a website.
from llama_index.readers.web import AgentQLWebReader
from llama_index.core import VectorStoreIndex
from IPython.display import Markdown, display
# Using AgentQL to crawl a website
agentql_reader = AgentQLWebReader(
api_key="YOUR_API_KEY", # Replace with your actual API key from https://dev.agentql.com
params={
"is_scroll_to_bottom_enabled": True
}, # Optional additional parameters
)
# Load documents from a single page URL
document = agentql_reader.load_data(
url="https://www.ycombinator.com/companies?batch=W25",
query="{ company[] { name location description industry_category link(a link to the company's detail on Ycombinator)} }",
)
index = VectorStoreIndex.from_documents(document)
query_engine = index.as_query_engine()
response = query_engine.query(
"Find companies that are working on web agent, list their names, locations and link"
)
display(Markdown(f"<b>{response}</b>"))
OxylabsWebReader allows a user to scrape any website with different parameters while bypassing most of the anti-bot tools. Check out the Oxylabs documentation to get the full list of parameters.
Claim free API credentials by creating an Oxylabs account here.
from llama_index.readers.web import OxylabsWebReader
reader = OxylabsWebReader(
username="OXYLABS_USERNAME", password="OXYLABS_PASSWORD"
)
documents = reader.load_data(
[
"https://sandbox.oxylabs.io/products/1",
"https://sandbox.oxylabs.io/products/2",
]
)
print(documents[0].text)
Another example with parameters for selecting the geolocation, user agent type, JavaScript rendering, headers, and cookies.
documents = reader.load_data(
[
"https://sandbox.oxylabs.io/products/3",
],
{
"geo_location": "Berlin, Germany",
"render": "html",
"user_agent_type": "mobile",
"context": [
{"key": "force_headers", "value": True},
{"key": "force_cookies", "value": True},
{
"key": "headers",
"value": {
"Content-Type": "text/html",
"Custom-Header-Name": "custom header content",
},
},
{
"key": "cookies",
"value": [
{"key": "NID", "value": "1234567890"},
{"key": "1P JAR", "value": "0987654321"},
],
},
{"key": "http_method", "value": "get"},
{"key": "follow_redirects", "value": True},
{"key": "successful_status_codes", "value": [808, 909]},
],
},
)
ZenRows is a powerful web scraping API that provides advanced features for bypassing anti-bot measures and extracting data from modern websites.
Key Features:
Prerequisites: You need to have a ZenRows API key to use this reader. You can get one at zenrows.com.
# Basic web scraping with ZenRows
from llama_index.readers.web import ZenRowsWebReader
zenrows_reader = ZenRowsWebReader(
api_key="YOUR_API_KEY", # Get one at https://app.zenrows.com/register
response_type="markdown",
)
# Scrape a single URL
documents = zenrows_reader.load_data(["https://httpbin.io/html"])
print(documents[0].text[:500]) # Print first 500 characters
# Advanced scraping with anti-bot bypass
zenrows_advanced = ZenRowsWebReader(
api_key="YOUR_API_KEY",
js_render=True, # Enable JavaScript rendering
premium_proxy=True, # Use residential proxies
proxy_country="us", # Optional: specify country
)
documents = zenrows_advanced.load_data(
["https://www.scrapingcourse.com/antibot-challenge"]
)
print(f"Scraped {len(documents[0].text)} characters with advanced features")
# Integration with LlamaIndex - scraping multiple pages
zenrows_reader = ZenRowsWebReader(
api_key="YOUR_API_KEY", js_render=True, response_type="markdown"
)
# Scrape multiple URLs
urls = ["https://example.com/", "https://httpbin.io/html"]
documents = zenrows_reader.load_data(urls)
# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What content was found on these pages?")
display(Markdown(f"<b>{response}</b>"))
For more advanced features like custom headers, CSS data extraction, screenshot capabilities, and detailed configuration options, visit the ZenRows documentation.
Olostep is reliable and cost-effective web scraping API built for scale. It bypasses bot detection, delivers results in seconds, and can process millions of requests.
The API returns clean data from any website in various formats, including Markdown, HTML, and structured JSON.
Sign up here and get 1000 credits for free.
# Scraping content in Markdown
from llama_index.readers.web import OlostepWebReader
from llama_index.core import SummaryIndex
# Initialize the reader in scrape mode
reader = OlostepWebReader(api_key="YOUR_OLOSTEP_API_KEY", mode="scrape")
# Load data from a URL
documents = reader.load_data(url="https://www.olostep.com/")
# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize in 100 words")
print(response)
# Running Google Searches
from llama_index.readers.web import OlostepWebReader
from llama_index.core import SummaryIndex
# Initialize the reader in search mode
reader = OlostepWebReader(api_key="YOUR_OLOSTEP_API_KEY", mode="search")
# Load data using a search query
documents = reader.load_data(query="What are the latest advancements in AI?")
# You can also pass additional parameters, for example, to specify the country for the search
documents_with_params = reader.load_data(
query="What are the latest advancements in AI?", params={"country": "US"}
)
# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("List me the headlines")
print(response)
Scrapy is a popular web crawling framework for Python. The ScrapyWebReader allows you to leverage Scrapy's powerful crawling capabilities to extract data from websites. It can be used in 2 ways
from scrapy.spiders import Spider
from llama_index.readers.web import ScrapyWebReader
class SampleSpider(Spider):
name = "sample_spider"
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
...
reader = ScrapyWebReader()
docs = reader.load_data(SampleSpider)
Downloading a Sample Scrapy Project
%git clone https://github.com/scrapy/quotesbot.git
Using the scrapy project with spider named "toscrape-css"
from llama_index.readers.web import ScrapyWebReader
reader = ScrapyWebReader(project_path="./quotesbot")
docs = reader.load_data("toscrape-css")
Some keys from the scraped items can be stored as metadata in the Document object. You can specify which keys to include as metadata using the metadata_keys parameter. If you want to keep the keys in both the content and as metadata, you can set the keep_keys parameter to True.
reader = ScrapyWebReader(
project_path="./quotesbot",
metadata_keys=["author", "tags"],
keep_keys=True,
)
docs = reader.load_data("toscrape-css")