Back to Llama Index

Web Page Reader

docs/examples/data_connectors/WebPageDemo.ipynb

0.14.2121.4 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/WebPageDemo.ipynb" target="_parent"></a>

Web Page Reader

Demonstrates our web page reader.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex πŸ¦™.

python
%pip install llama-index llama-index-readers-web
python
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

Using SimpleWebPageReader

python
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os
python
# NOTE: the html_to_text=True option requires html2text to be installed
python
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)
python
documents[0]
python
index = SummaryIndex.from_documents(documents)
python
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
python
display(Markdown(f"<b>{response}</b>"))

Using Spider Reader πŸ•·

Spider is the fastest crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...

Prerequisites: you need to have a Spider api key to use this loader. You can get one on spider.cloud.

python
# Scrape single URL
from llama_index.readers.web import SpiderWebReader

spider_reader = SpiderWebReader(
    api_key="YOUR_API_KEY",  # Get one at https://spider.cloud
    mode="scrape",
    # params={} # Optional parameters see more on https://spider.cloud/docs/api
)

documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)

Crawl domain following all deeper subpages

python
# Crawl domain with deeper crawling following subpages
from llama_index.readers.web import SpiderWebReader

spider_reader = SpiderWebReader(
    api_key="YOUR_API_KEY",
    mode="crawl",
    # params={} # Optional parameters see more on https://spider.cloud/docs/api
)

documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)

For guides and documentation, visit Spider

Using Browserbase Reader πŸ…±οΈ

Browserbase is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.

Installation and Setup

  • Get an API key and Project ID from browserbase.com and set it in environment variables (BROWSERBASE_API_KEY, BROWSERBASE_PROJECT_ID).
  • Install the Browserbase SDK:
python
%pip install browserbase
python
from llama_index.readers.web import BrowserbaseWebReader
python
reader = BrowserbaseWebReader()
docs = reader.load_data(
    urls=[
        "https://example.com",
    ],
    # Text mode
    text_content=False,
)

Using FireCrawl Reader πŸ”₯

Firecrawl is an api that turns entire websites into clean, LLM accessible markdown.

Using Firecrawl to gather an entire website

python
%pip install firecrawl-py
python
from llama_index.readers.web import FireCrawlWebReader
python
# using firecrawl to crawl a website
firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",  # Choose between "crawl" and "scrape" for single page scraping
    params={"additional": "parameters"},  # Optional additional parameters
)

# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/")
python
index = SummaryIndex.from_documents(documents)
python
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
python
display(Markdown(f"<b>{response}</b>"))

Using firecrawl for a single page

python
# Initialize the FireCrawlWebReader with your API key and desired mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader

firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",  # Choose between "crawl" and "scrape" for single page scraping
    params={"additional": "parameters"},  # Optional additional parameters
)

# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/worked.html")
python
index = SummaryIndex.from_documents(documents)
python
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
python
display(Markdown(f"<b>{response}</b>"))

Using FireCrawl's extract mode to extract structured data from URLs

python
# Initialize the FireCrawlWebReader with your API key and extract mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader

firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="extract",  # Use extract mode to extract structured data
    params={
        "prompt": "Extract the title, author, and main points from this essay",
        # Required prompt parameter for extract mode
    },
)

# Load documents by providing a list of URLs to extract data from
documents = firecrawl_reader.load_data(
    urls=[
        "https://www.paulgraham.com",
        "https://www.paulgraham.com/worked.html",
    ]
)
python
index = SummaryIndex.from_documents(documents)
python
# Query the extracted structured data
query_engine = index.as_query_engine()
response = query_engine.query("What are the main points from these essays?")

display(Markdown(f"<b>{response}</b>"))

Using Hyperbrowser Reader ⚑

Hyperbrowser is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.

Key Features:

  • Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
  • Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
  • Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
  • Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies

For more information about Hyperbrowser, please visit the Hyperbrowser website or if you want to check out the docs, you can visit the Hyperbrowser docs.

Installation and Setup

  • Head to Hyperbrowser to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable or you can pass it to the HyperbrowserWebReader constructor.
  • Install the Hyperbrowser SDK:
python
%pip install hyperbrowser
python
from llama_index.readers.web import HyperbrowserWebReader

reader = HyperbrowserWebReader(api_key="your_api_key_here")
docs = reader.load_data(
    urls=["https://example.com"],
    operation="scrape",
)
docs

Using TrafilaturaWebReader

python
from llama_index.readers.web import TrafilaturaWebReader
python
documents = TrafilaturaWebReader().load_data(
    ["http://paulgraham.com/worked.html"]
)
python
index = SummaryIndex.from_documents(documents)
python
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
python
display(Markdown(f"<b>{response}</b>"))

Using RssReader

python
from llama_index.core import SummaryIndex
from llama_index.readers.web import RssReader

documents = RssReader().load_data(
    ["https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
)

index = SummaryIndex.from_documents(documents)

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What happened in the news today?")

Using ScrapFly

ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text. Install ScrapFly Python SDK using pip:

shell
pip install scrapfly-sdk

Here is a basic usage of ScrapflyReader

python
from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

The ScrapflyReader also allows passigng ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started

python
from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"],
    scrape_config=scrapfly_scrape_config,  # Pass the scrape config
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

Using ZyteWebReader

ZyteWebReader allows a user to access the content of webpage in different modes ("article", "html-text", "html"). It enables user to change setting such as browser rendering and JS as the content of many sites would require setting these options to access relevant content. All supported options can be found here: https://docs.zyte.com/zyte-api/usage/reference.html

To install dependencies:

shell
pip install zyte-api

To get access to your ZYTE API key please visit: https://docs.zyte.com/zyte-api/get-started.html

python
from llama_index.readers.web import ZyteWebReader

# Required to run it in notebook
# import nest_asyncio
# nest_asyncio.apply()


# Initiate ZyteWebReader with your Zyte API key
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    mode="article",  # or "html-text" or "html"
)

urls = [
    "https://www.zyte.com/blog/web-scraping-apis/",
    "https://www.zyte.com/blog/system-integrators-extract-big-data/",
]

documents = zyte_reader.load_data(
    urls=urls,
)

print(len(documents[0].text))

Browser rendering and javascript can be enabled by passing setting corresponding parameters during initialization.

python
zyte_dw_params = {
    "browserHtml": True,  # Enable browser rendering
    "javascript": True,  # Enable JavaScript
}

# Initiate ZyteWebReader with your Zyte API key and use default "article" mode
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    download_kwargs=zyte_dw_params,
)

# Load documents from URLs
documents = zyte_reader.load_data(
    urls=urls,
)
python
len(documents[0].text)

Set "continue_on_failure" to False if you'd like to stop when any request fails.

python
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    mode="html-text",
    download_kwargs=zyte_dw_params,
    continue_on_failure=False,
)

# Load documents from URLs
documents = zyte_reader.load_data(
    urls=urls,
)
python
len(documents[0].text)

In default mode ("article") only the article text is extracted while in the "html-text" full text is extracted from the webpage, there the length of the text is significantly longer.

Using AgentQLWebReader 🐠

Use AgentQL to scrape structured data from a website.

python
from llama_index.readers.web import AgentQLWebReader
from llama_index.core import VectorStoreIndex
from IPython.display import Markdown, display
python
# Using AgentQL to crawl a website
agentql_reader = AgentQLWebReader(
    api_key="YOUR_API_KEY",  # Replace with your actual API key from https://dev.agentql.com
    params={
        "is_scroll_to_bottom_enabled": True
    },  # Optional additional parameters
)

# Load documents from a single page URL
document = agentql_reader.load_data(
    url="https://www.ycombinator.com/companies?batch=W25",
    query="{ company[] { name location description industry_category link(a link to the company's detail on Ycombinator)} }",
)
python
index = VectorStoreIndex.from_documents(document)
query_engine = index.as_query_engine()
response = query_engine.query(
    "Find companies that are working on web agent, list their names, locations and link"
)

display(Markdown(f"<b>{response}</b>"))

Using OxylabsWebReader

OxylabsWebReader allows a user to scrape any website with different parameters while bypassing most of the anti-bot tools. Check out the Oxylabs documentation to get the full list of parameters.

Claim free API credentials by creating an Oxylabs account here.

python
from llama_index.readers.web import OxylabsWebReader


reader = OxylabsWebReader(
    username="OXYLABS_USERNAME", password="OXYLABS_PASSWORD"
)

documents = reader.load_data(
    [
        "https://sandbox.oxylabs.io/products/1",
        "https://sandbox.oxylabs.io/products/2",
    ]
)

print(documents[0].text)

Another example with parameters for selecting the geolocation, user agent type, JavaScript rendering, headers, and cookies.

python
documents = reader.load_data(
    [
        "https://sandbox.oxylabs.io/products/3",
    ],
    {
        "geo_location": "Berlin, Germany",
        "render": "html",
        "user_agent_type": "mobile",
        "context": [
            {"key": "force_headers", "value": True},
            {"key": "force_cookies", "value": True},
            {
                "key": "headers",
                "value": {
                    "Content-Type": "text/html",
                    "Custom-Header-Name": "custom header content",
                },
            },
            {
                "key": "cookies",
                "value": [
                    {"key": "NID", "value": "1234567890"},
                    {"key": "1P JAR", "value": "0987654321"},
                ],
            },
            {"key": "http_method", "value": "get"},
            {"key": "follow_redirects", "value": True},
            {"key": "successful_status_codes", "value": [808, 909]},
        ],
    },
)

Using ZenRows Web Reader 🌐

ZenRows is a powerful web scraping API that provides advanced features for bypassing anti-bot measures and extracting data from modern websites.

Key Features:

  • JavaScript Rendering: Handle SPAs and dynamic content with headless browser rendering
  • Premium Proxies: Bypass anti-bot protection with 55M+ residential IPs from 190+ countries
  • Session Management: Maintain the same IP across multiple requests
  • Advanced Data Extraction: Use CSS selectors or automatic parsing to extract specific data
  • Multiple Output Formats: Get results in HTML, Markdown, Text, or PDF format
  • Geolocation Support: Use proxies from specific countries for geo-restricted content

Prerequisites: You need to have a ZenRows API key to use this reader. You can get one at zenrows.com.

python
# Basic web scraping with ZenRows
from llama_index.readers.web import ZenRowsWebReader

zenrows_reader = ZenRowsWebReader(
    api_key="YOUR_API_KEY",  # Get one at https://app.zenrows.com/register
    response_type="markdown",
)

# Scrape a single URL
documents = zenrows_reader.load_data(["https://httpbin.io/html"])
print(documents[0].text[:500])  # Print first 500 characters
python
# Advanced scraping with anti-bot bypass
zenrows_advanced = ZenRowsWebReader(
    api_key="YOUR_API_KEY",
    js_render=True,  # Enable JavaScript rendering
    premium_proxy=True,  # Use residential proxies
    proxy_country="us",  # Optional: specify country
)

documents = zenrows_advanced.load_data(
    ["https://www.scrapingcourse.com/antibot-challenge"]
)
print(f"Scraped {len(documents[0].text)} characters with advanced features")
python
# Integration with LlamaIndex - scraping multiple pages
zenrows_reader = ZenRowsWebReader(
    api_key="YOUR_API_KEY", js_render=True, response_type="markdown"
)

# Scrape multiple URLs
urls = ["https://example.com/", "https://httpbin.io/html"]

documents = zenrows_reader.load_data(urls)

# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What content was found on these pages?")

display(Markdown(f"<b>{response}</b>"))

For more advanced features like custom headers, CSS data extraction, screenshot capabilities, and detailed configuration options, visit the ZenRows documentation.

Using Olostep Web Reader 🧒

Olostep is reliable and cost-effective web scraping API built for scale. It bypasses bot detection, delivers results in seconds, and can process millions of requests.

The API returns clean data from any website in various formats, including Markdown, HTML, and structured JSON.

Sign up here and get 1000 credits for free.

python
# Scraping content in Markdown

from llama_index.readers.web import OlostepWebReader
from llama_index.core import SummaryIndex

# Initialize the reader in scrape mode
reader = OlostepWebReader(api_key="YOUR_OLOSTEP_API_KEY", mode="scrape")

# Load data from a URL
documents = reader.load_data(url="https://www.olostep.com/")

# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize in 100 words")

print(response)
python
# Running Google Searches

from llama_index.readers.web import OlostepWebReader
from llama_index.core import SummaryIndex

# Initialize the reader in search mode
reader = OlostepWebReader(api_key="YOUR_OLOSTEP_API_KEY", mode="search")

# Load data using a search query
documents = reader.load_data(query="What are the latest advancements in AI?")

# You can also pass additional parameters, for example, to specify the country for the search
documents_with_params = reader.load_data(
    query="What are the latest advancements in AI?", params={"country": "US"}
)

# Create index and query
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("List me the headlines")

print(response)

Using Scrapy Web Reader πŸ•ΈοΈ

Scrapy is a popular web crawling framework for Python. The ScrapyWebReader allows you to leverage Scrapy's powerful crawling capabilities to extract data from websites. It can be used in 2 ways

  1. By providing an Scrapy spider class.
  2. By providing the path to a Scrapy project.

1. Using with Scrapy Spider Class

python
from scrapy.spiders import Spider
from llama_index.readers.web import ScrapyWebReader


class SampleSpider(Spider):
    name = "sample_spider"
    start_urls = ["http://quotes.toscrape.com"]

    def parse(self, response):
        ...


reader = ScrapyWebReader()
docs = reader.load_data(SampleSpider)

2. Using with Scrapy Project Path

Downloading a Sample Scrapy Project

python
%git clone https://github.com/scrapy/quotesbot.git

Using the scrapy project with spider named "toscrape-css"

python
from llama_index.readers.web import ScrapyWebReader

reader = ScrapyWebReader(project_path="./quotesbot")
docs = reader.load_data("toscrape-css")

Metadata

Some keys from the scraped items can be stored as metadata in the Document object. You can specify which keys to include as metadata using the metadata_keys parameter. If you want to keep the keys in both the content and as metadata, you can set the keep_keys parameter to True.

python
reader = ScrapyWebReader(
    project_path="./quotesbot",
    metadata_keys=["author", "tags"],
    keep_keys=True,
)
docs = reader.load_data("toscrape-css")