pgml-cms/blog/korvus-firecrawl-rag-in-a-single-query.md
Silas Marvin
August 8, 2024
We’re excited to share a quick guide on how you use the power of Korvus’ single query RAG along with Firecrawl to quickly and easily standup a retrieval augmented generation system with data from any website.
You’ll learn how to:
Firecrawl is a nifty web scraper that turns websites into clean, structured markdown data — perfect to create a knowledge base for RAG applications.
Korvus is the Python, JavaScript, Rust or C SDK for PostgresML. It handles the heavy lifting of document processing, vector search, and response generation in a single query.
PostgresML is an in-database ML/AI engine built by the ML engineers at Instacart. It lets you train, test and deploy models right inside Postgres. With Korvus, you can get all the efficiencies of in-database machine learning without SQL or database management.
These three tools are all you’ll need to deploy a flexible and powerful RAG stack grounded in web data. Since your data is stored right where you're performing inference, you won’t need a vector database or an additional framework like LlamaIndex or Langchain to tie everything together. Mo’ microservices = more problems.
Let’s dive in!
To follow along you will need to set both the FIRECRAWL_API_KEY and KORVUS_DATABASE_URL env variables.
Sign up at firecrawl.dev to get your FIRECRAWL_API_KEY.
The easiest way to get your KORVUS_DATABASE_URL is by signing up at postgresml.org but you can also host postgres with the pgml and pgvector extensions yourself.
First, let's break down the initial setup and imports:
from korvus import Collection, Pipeline
from firecrawl import FirecrawlApp
import os
import time
import asyncio
from rich import print
# Initialize the FirecrawlApp with your API key
firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
Here we're importing korvus, firecrawl, and some other convenient libraries, and initializing the FirecrawlApp with an API key stored in an environment variable. This setup allows us to use Firecrawl for web scraping.
Next, we define our Pipeline and Collection:
pipeline = Pipeline(
"v0",
{
"markdown": {
"splitter": {"model": "markdown"},
"semantic_search": {
"model": "mixedbread-ai/mxbai-embed-large-v1",
},
},
},
)
collection = Collection("fire-crawl-demo-v0")
# Add our Pipeline to our Collection
async def add_pipeline():
await collection.add_pipeline(pipeline)
This Pipeline configuration tells Korvus how to process our documents. It specifies that we'll be working with markdown content, using a markdown-specific splitter, and the mixedbread-ai/mxbai-embed-large-v1 model for semantic search embeddings.
See the Korvus guide to construction Pipelines for more information on Collections and Pipelines.
The crawl() function demonstrates how to use Firecrawl to scrape a website:
def crawl():
crawl_url = "https://postgresml.org/blog"
params = {
"crawlerOptions": {
"excludes": [],
"includes": ["blog/*"],
"limit": 250,
},
"pageOptions": {"onlyMainContent": True},
}
job = firecrawl.crawl_url(crawl_url, params=params, wait_until_done=False)
while True:
print("Scraping...")
status = firecrawl.check_crawl_status(job["jobId"])
if not status["status"] == "active":
break
time.sleep(5)
return status
This function initiates a crawl of the PostgresML blog, focusing on blog posts and limiting the crawl to 250 pages. It then periodically checks the status of the crawl job until it's complete.
Alternativly to sleeping, we could set the wait_until_done parameter to True and the crawl_url method would block until the data is ready.
After crawling the website, we need to process and index the data for efficient searching. This is done in the main() function:
async def main():
# Add our Pipeline to our Collection
await add_pipeline()
# Crawl the website
results = crawl()
# Construct our documents to upsert
documents = [
{"id": data["metadata"]["sourceURL"], "markdown": data["markdown"]}
for data in results["data"]
]
# Upsert our documents
await collection.upsert_documents(documents)
This code does the following:
crawl() function.With our data indexed, we can now perform RAG:
async def do_rag(user_query):
results = await collection.rag(
{
"CONTEXT": {
"vector_search": {
"query": {
"fields": {
"markdown": {
"query": user_query,
"parameters": {
"prompt": "Represent this sentence for searching relevant passages: "
},
}
},
},
"document": {"keys": ["id"]},
"rerank": {
"model": "mixedbread-ai/mxbai-rerank-base-v1",
"query": user_query,
"num_documents_to_rerank": 100,
},
"limit": 5,
},
"aggregate": {"join": "\n\n\n"},
},
"chat": {
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a question and answering bot. Answer the users question given the context succinctly.",
},
{
"role": "user",
"content": f"Given the context\n<context>\n:{{CONTEXT}}\n</context>\nAnswer the question: {user_query}",
},
],
"max_tokens": 256,
},
},
pipeline,
)
return results
This function combines vector search, reranking, and text generation to provide context-aware answers to user queries. It uses the Meta-Llama-3.1-405B-Instruct model for text generation.
This query can be broken down into 4 steps:
user_querymixedbread-ai/mxbai-rerank-base-v1 cross-encoder and limit the results to 5\n\n\n and substitute them in place of the {{CONTEXT}} placeholder in the messagesmeta-llama/Meta-Llama-3.1-405B-InstructThis is a complex query and there are more options and parameters to be tuned. See the Korvus guide to RAG for more information on the rag method.
To tie everything together, we use an interactive loop in our main() function:
async def main():
# ... (previous code for setup and indexing)
# Now we can search
while True:
user_query = input("\n\nquery > ")
if user_query == "q":
break
results = await do_rag(user_query)
print(results)
asyncio.run(main())
This loop allows users to input queries and receive RAG-powered responses based on the crawled and indexed content from the PostgresML blog.
We've demonstrated how to create a powerful RAG system using Firecrawl and Korvus – but it’s just a small example of the simplicity of doing RAG in-database, with fewer microservices.
It’s faster, cheaper and easier to manage than the common approach to RAG (Vector DB + frameworks + moving your data to the models). But don’t take our word for it. Try out Firecrawl and Korvus on PostgresML, and see the performance benefits yourself. And as always, let us know what you think.