website/versioned_docs/version-1.0.13/Explore Algorithms/OpenAI/Langchain.md
LangChain is a software development framework designed to simplify the creation of applications using large language models (LLMs). Chains in LangChain go beyond just a single LLM call and are sequences of calls (can be a call to an LLM or a different utility), automating the execution of a series of calls and actions. To make it easier to scale up the LangChain execution on a large dataset, we have integrated LangChain with the distributed machine learning library SynapseML. This integration makes it easy to use the Apache Spark distributed computing framework to process millions of data with the LangChain Framework.
This tutorial shows how to apply LangChain at scale for paper summarization and organization. We start with a table of arxiv links and apply the LangChain Transformerto automatically extract the corresponding paper title, authors, summary, and some related works.
The key prerequisites for this quickstart include a working Azure OpenAI resource, and an Apache Spark cluster with SynapseML installed. We suggest creating a Synapse workspace, but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a python environment with the pyspark package will work.
The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo. Or download the notebook and import it into Synapse Analytics
%pip install openai==0.28.1 langchain==0.0.331 pdf2image pdfminer.six unstructured==0.10.24 pytesseract numpy==1.22.4 nltk==3.8.1
import os, openai, langchain, uuid
from langchain.llms import AzureOpenAI, OpenAI
from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.chains import TransformChain, LLMChain, SimpleSequentialChain
from langchain.document_loaders import OnlinePDFLoader
from langchain.tools.bing_search.tool import BingSearchRun, BingSearchAPIWrapper
from langchain.prompts import PromptTemplate
from synapse.ml.services.langchain import LangchainTransformer
from synapse.ml.core.platform import running_on_synapse, find_secret
Next, please edit the cell in the notebook to point to your service. In particular set the model_name, deployment_name, openai_api_base, and open_api_key variables to match those for your OpenAI service. Please feel free to replace find_secret with your key as follows
openai_api_key = "99sj2w82o...."
bing_subscription_key = "..."
Note that you also need to set up your Bing search to gain access to your Bing Search subscription key.
openai_api_key = find_secret(
secret_name="openai-api-key-2", keyvault="mmlspark-build-keys"
)
openai_api_base = "https://synapseml-openai-2.openai.azure.com/"
openai_api_version = "2022-12-01"
openai_api_type = "azure"
deployment_name = "gpt-35-turbo"
bing_search_url = "https://api.bing.microsoft.com/v7.0/search"
bing_subscription_key = find_secret(
secret_name="bing-search-key", keyvault="mmlspark-build-keys"
)
os.environ["BING_SUBSCRIPTION_KEY"] = bing_subscription_key
os.environ["BING_SEARCH_URL"] = bing_search_url
os.environ["OPENAI_API_TYPE"] = openai_api_type
os.environ["OPENAI_API_VERSION"] = openai_api_version
os.environ["OPENAI_API_BASE"] = openai_api_base
os.environ["OPENAI_API_KEY"] = openai_api_key
llm = AzureOpenAI(
deployment_name=deployment_name,
model_name=deployment_name,
temperature=0.1,
verbose=True,
)
We will start by demonstrating the basic usage with a simple chain that creates definitions for input words
copy_prompt = PromptTemplate(
input_variables=["technology"],
template="Define the following word: {technology}",
)
chain = LLMChain(llm=llm, prompt=copy_prompt)
transformer = (
LangchainTransformer()
.setInputCol("technology")
.setOutputCol("definition")
.setChain(chain)
.setSubscriptionKey(openai_api_key)
.setUrl(openai_api_base)
)
# construction of test dataframe
df = spark.createDataFrame(
[(0, "docker"), (1, "spark"), (2, "python")], ["label", "technology"]
)
display(transformer.transform(df))
LangChain Transformers can be saved and loaded. Note that LangChain serialization only works for chains that don't have memory.
temp_dir = "tmp"
if not os.path.exists(temp_dir):
os.mkdir(temp_dir)
path = os.path.join(temp_dir, "langchainTransformer")
transformer.save(path)
loaded = LangchainTransformer.load(path)
display(loaded.transform(df))
We will now construct a Sequential Chain for extracting structured information from an arxiv link. In particular, we will ask langchain to extract the title, author information, and a summary of the paper content. After that, we use a web search tool to find the recent papers written by the first author.
To summarize, our sequential chain contains the following steps:
def paper_content_extraction(inputs: dict) -> dict:
arxiv_link = inputs["arxiv_link"]
loader = OnlinePDFLoader(arxiv_link)
pages = loader.load_and_split()
return {"paper_content": pages[0].page_content + pages[1].page_content}
def prompt_generation(inputs: dict) -> dict:
output = inputs["Output"]
prompt = (
"find the paper title, author, summary in the paper description below, output them. After that, Use websearch to find out 3 recent papers of the first author in the author section below (first author is the first name separated by comma) and list the paper titles in bullet points: <Paper Description Start>\n"
+ output
+ "<Paper Description End>."
)
return {"prompt": prompt}
paper_content_extraction_chain = TransformChain(
input_variables=["arxiv_link"],
output_variables=["paper_content"],
transform=paper_content_extraction,
verbose=False,
)
paper_summarizer_template = """You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, and extract authors and paper title from the paper content.
Here is the paper content:
{paper_content}
Output:
paper title, authors and summary.
"""
prompt = PromptTemplate(
input_variables=["paper_content"], template=paper_summarizer_template
)
summarize_chain = LLMChain(llm=llm, prompt=prompt, verbose=False)
prompt_generation_chain = TransformChain(
input_variables=["Output"],
output_variables=["prompt"],
transform=prompt_generation,
verbose=False,
)
bing = BingSearchAPIWrapper(k=3)
tools = [BingSearchRun(api_wrapper=bing)]
web_search_agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False
)
sequential_chain = SimpleSequentialChain(
chains=[
paper_content_extraction_chain,
summarize_chain,
prompt_generation_chain,
web_search_agent,
]
)
We can now use our chain at scale using the LangchainTransformer
paper_df = spark.createDataFrame(
[
(0, "https://arxiv.org/pdf/2107.13586.pdf"),
(1, "https://arxiv.org/pdf/2101.00190.pdf"),
(2, "https://arxiv.org/pdf/2103.10385.pdf"),
(3, "https://arxiv.org/pdf/2110.07602.pdf"),
],
["label", "arxiv_link"],
)
# construct langchain transformer using the paper summarizer chain define above
paper_info_extractor = (
LangchainTransformer()
.setInputCol("arxiv_link")
.setOutputCol("paper_info")
.setChain(sequential_chain)
.setSubscriptionKey(openai_api_key)
.setUrl(openai_api_base)
)
# extract paper information from arxiv links, the paper information needs to include:
# paper title, paper authors, brief paper summary, and recent papers published by the first author
display(paper_info_extractor.transform(paper_df))