docs/examples/finetuning/embeddings/finetune_embedding.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"></a>
In this notebook, we show users how to finetune their own embedding models.
We go through three main sections:
generate_qa_embedding_pairs function makes this easy)SentenceTransformersFinetuneEngine)First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.
%pip install datasets
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
%pip install llama-index-readers-file
%pip install llama-index-embeddings-huggingface
%pip install "transformers[torch]"
import json
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode
Download Data
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
VAL_FILES = ["./data/10k/uber_2021.pdf"]
TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"
def load_corpus(files, verbose=False):
if verbose:
print(f"Loading files {files}")
reader = SimpleDirectoryReader(input_files=files)
docs = reader.load_data()
if verbose:
print(f"Loaded {len(docs)} docs")
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)
if verbose:
print(f"Parsed {len(nodes)} nodes")
return nodes
We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)
Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.
Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
import os
OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from llama_index.llms.openai import OpenAI
train_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"),
nodes=train_nodes,
output_path="train_dataset.json",
)
val_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"),
nodes=val_nodes,
output_path="val_dataset.json",
)
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")
from llama_index.finetuning import SentenceTransformersFinetuneEngine
finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset,
model_id="BAAI/bge-small-en",
model_output_path="test_model",
val_dataset=val_dataset,
)
finetune_engine.finetune()
embed_model = finetune_engine.get_finetuned_model()
embed_model
In this section, we evaluate 3 different embedding models:
BAAI/bge-small-en, andWe consider 2 evaluation approaches:
InformationRetrievalEvaluator from sentence_transformersWe show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd
Option 1: We use a simple hit rate metric for evaluation:
This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.
def evaluate(
dataset,
embed_model,
top_k=5,
verbose=False,
):
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs
nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
index = VectorStoreIndex(
nodes, embed_model=embed_model, show_progress=True
)
retriever = index.as_retriever(similarity_top_k=top_k)
eval_results = []
for query_id, query in tqdm(queries.items()):
retrieved_nodes = retriever.retrieve(query)
retrieved_ids = [node.node.node_id for node in retrieved_nodes]
expected_id = relevant_docs[query_id][0]
is_hit = expected_id in retrieved_ids # assume 1 relevant doc
eval_result = {
"is_hit": is_hit,
"retrieved": retrieved_ids,
"expected": expected_id,
"query": query_id,
}
eval_results.append(eval_result)
return eval_results
Option 2: We use the InformationRetrievalEvaluator from sentence_transformers.
This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, not the OpenAI embedding model).
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path
def evaluate_st(
dataset,
model_id,
name,
):
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs
evaluator = InformationRetrievalEvaluator(
queries, corpus, relevant_docs, name=name
)
model = SentenceTransformer(model_id)
output_path = "results/"
Path(output_path).mkdir(exist_ok=True, parents=True)
return evaluator(model, output_path=output_path)
Note: this might take a few minutes to run since we have to embed the corpus and queries
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)
df_ada = pd.DataFrame(ada_val_results)
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)
df_bge = pd.DataFrame(bge_val_results)
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)
df_finetuned = pd.DataFrame(val_results_finetuned)
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned
evaluate_st(val_dataset, "test_model", name="finetuned")
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"
We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")
df_st_bge = pd.read_csv(
"results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
"results/Information-Retrieval_evaluation_finetuned_results.csv"
)
We can see that embedding finetuning improves metrics consistently across the suite of eval metrics
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all