docs/examples/query_engine/SQLJoinQueryEngine.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/query_engine/SQLJoinQueryEngine.ipynb" target="_parent"></a>
In this tutorial, we show you how to use our SQLJoinQueryEngine.
This query engine allows you to combine insights from your structured tables with your unstructured data. It first decides whether to query your structured tables for insights. Once it does, it can then infer a corresponding query to the vector store in order to fetch corresponding documents.
NOTE: Any Text-to-SQL application should be aware that executing arbitrary SQL queries can be a security risk. It is recommended to take precautions as needed, such as using restricted roles, read-only databases, sandboxing, etc.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-readers-wikipedia
%pip install llama-index-llms-openai
!pip install llama-index
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
# This results in nested event-loops when we start an event-loop to make async queries.
# This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
This includes a ServiceContext object containing abstractions such as the LLM and chunk size.
This also includes a StorageContext object containing our vector store abstractions.
# # define pinecone index
# import pinecone
# import os
# api_key = os.environ['PINECONE_API_KEY']
# pinecone.init(api_key=api_key, environment="us-west1-gcp")
# # dimensions are for text-embedding-ada-002
# # pinecone.create_index("quickstart", dimension=1536, metric="euclidean", pod_type="p1")
# pinecone_index = pinecone.Index("quickstart")
# # OPTIONAL: delete all
# pinecone_index.delete(deleteAll=True)
Here we introduce a toy scenario where there are 100 tables (too big to fit into the prompt)
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
select,
column,
)
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()
# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
table_name,
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
# print tables
metadata_obj.tables.keys()
We introduce some test data into the city_stats table
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
with engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
print(cursor.fetchall())
We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.
# install wikipedia python package
!pip install wikipedia
from llama_index.readers.wikipedia import WikipediaReader
cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)
from llama_index.core import SQLDatabase
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
# Insert documents into vector index
# Each document has metadata of the city attached
vector_indices = {}
vector_query_engines = {}
for city, wiki_doc in zip(cities, wiki_docs):
vector_index = VectorStoreIndex.from_documents([wiki_doc])
# modify default llm to be gpt-3.5 for quick/cheap queries
query_engine = vector_index.as_query_engine(
similarity_top_k=2, llm=OpenAI(model="gpt-3.5-turbo")
)
vector_indices[city] = vector_index
vector_query_engines[city] = query_engine
from llama_index.core.query_engine import NLSQLTableQueryEngine
sql_query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["city_stats"],
)
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
query_engine_tools = []
for city in cities:
query_engine = vector_query_engines[city]
query_engine_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name=city, description=f"Provides information about {city}"
),
)
query_engine_tools.append(query_engine_tool)
s_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools, llm=OpenAI(model="gpt-3.5-turbo")
)
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
from llama_index.core.query_engine import RetrieverQueryEngine
# vector_store_info = VectorStoreInfo(
# content_info='articles about different cities',
# metadata_info=[
# MetadataInfo(
# name='title',
# type='str',
# description='The name of the city'),
# ]
# )
# vector_auto_retriever = VectorIndexAutoRetriever(vector_index, vector_store_info=vector_store_info, llm=OpenAI(model='gpt-4')
# retriever_query_engine = RetrieverQueryEngine.from_args(
# vector_auto_retriever, llm=OpenAI(model='gpt-3.5-turbo')
# )
sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
"Useful for translating a natural language query into a SQL query over"
" a table containing: city_stats, containing the population/country of"
" each city"
),
)
s_engine_tool = QueryEngineTool.from_defaults(
query_engine=s_engine,
description=(
f"Useful for answering semantic questions about different cities"
),
)
from llama_index.core.query_engine import SQLJoinQueryEngine
query_engine = SQLJoinQueryEngine(
sql_tool, s_engine_tool, llm=OpenAI(model="gpt-4")
)
response = query_engine.query(
"Tell me about the arts and culture of the city with the highest"
" population"
)
print(str(response))
response = query_engine.query(
"Compare and contrast the demographics of Berlin and Toronto"
)
print(str(response))