docs/examples/response_synthesizers/long_context_test.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/agent/openai_retrieval_benchmark.ipynb" target="_parent"></a>
In this section we stress-test long context recall capabilities of GPT-4 and Claude v2. This is inspired by Greg Kamradt's tweet.
Similarly, we analyze the "needle in a haystack" recall capabilities of long-context LLms. We show an incremental extension by 1) adding Claude, and 2) testing recall where context exceeds context window, triggering response synthesis strategies.
We use a fixed document - the 2021 Uber 10-K, which contains ~290k tokens.
%pip install llama-index-llms-openai
%pip install llama-index-llms-anthropic
import nest_asyncio
nest_asyncio.apply()
from llama_index.core import SimpleDirectoryReader, Document
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.core.evaluation import CorrectnessEvaluator
We load the Uber 10-k
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
## load data
uber_docs0 = SimpleDirectoryReader(
input_files=["./data/10k/uber_2021.pdf"]
).load_data()
uber_doc = Document(text="\n\n".join([d.get_content() for d in uber_docs0]))
We print the number of tokens below. Note that this overflows the context window of existing LLMs, requiring response synthesis strategies.
# count the number of tokens
from llama_index.core.utils import globals_helper
num_tokens = len(globals_helper.tokenizer(uber_doc.get_content()))
print(f"NUM TOKENS: {num_tokens}")
Here we insert a single sentence of context that we're going to "hide" within the overall document at different positions.
context_str = "Jerry's favorite snack is Hot Cheetos."
query_str = "What is Jerry's favorite snack?"
def augment_doc(doc_str, context, position):
"""Augment doc with additional context at a given position."""
doc_str1 = doc_str[:position]
doc_str2 = doc_str[position:]
return f"{doc_str1}...\n\n{context}\n\n...{doc_str2}"
test_str = augment_doc(
uber_doc.get_content(), context_str, int(0.5 * len(uber_doc.get_content()))
)
The experiment loop is the following:
SummaryIndex, get the corresponding query engine.CorrectnessEvaluatorasync def run_experiments(
doc, position_percentiles, context_str, query, llm, response_mode="compact"
):
eval_llm = OpenAI(model="gpt-4-1106-preview")
correctness_evaluator = CorrectnessEvaluator(llm=eval_llm)
eval_scores = {}
for idx, position_percentile in enumerate(position_percentiles):
print(f"Position percentile: {position_percentile}")
position_idx = int(position_percentile * len(uber_doc.get_content()))
new_doc_str = augment_doc(
uber_doc.get_content(), context_str, position_idx
)
new_doc = Document(text=new_doc_str)
index = SummaryIndex.from_documents(
[new_doc],
)
query_engine = index.as_query_engine(
response_mode=response_mode, llm=llm
)
print(f"Query: {query}")
# uncomment for async
# response = await query_engine.aquery(query)
response = query_engine.query(query)
print(f"Response: {str(response)}")
eval_result = correctness_evaluator.evaluate(
query=query, response=str(response), reference=context_str
)
eval_score = eval_result.score
print(f"Eval score: {eval_score}")
eval_scores[position_percentile] = eval_score
return eval_scores
position_percentiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
llm = OpenAI(model="gpt-4-1106-preview")
eval_scores_gpt4 = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="compact",
)
llm = OpenAI(model="gpt-4-1106-preview")
eval_scores_gpt4_ts = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="tree_summarize",
)
llm = Anthropic(model="claude-2")
eval_scores_anthropic = await run_experiments(
[uber_doc], position_percentiles, context_str, query_str, llm
)
# NOTE: incomplete, running into timeout errors
llm = Anthropic(model="claude-2")
eval_scores_anthropic = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="tree_summarize",
)