docs/examples/llm/cleanlab.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/cleanlab.ipynb" target="_parent"></a>
Cleanlab’s Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, using state-of-the-art uncertainty estimates for LLMs. Trust scoring is crucial for applications where unchecked hallucinations and other LLM errors are a show-stopper.
This page demonstrates how to use TLM in place of your own LLM, to both generate responses and score their trustworthiness. That’s not the only way to use TLM though.
To add trust scoring to your existing unmodified RAG application, you can instead see this Trustworthy RAG tutorial.
Beyond RAG applications, you can score the trustworthiness of responses already generated from any LLM via TLM.get_trustworthiness_score().
Learn more in the Cleanlab documentation.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-cleanlab
%pip install llama-index
from llama_index.llms.cleanlab import CleanlabTLM
# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"
llm = CleanlabTLM(api_key="your_api_key")
resp = llm.complete("Who is Paul Graham?")
print(resp)
You also get the trustworthiness score of the above response in additional_kwargs. TLM automatically computes this score for all the <prompt, response> pair.
print(resp.additional_kwargs)
A high score indicates that LLM's response can be trusted. Let's take another example here.
resp = llm.complete(
"What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
print(resp)
print(resp.additional_kwargs)
A low score indicates that the LLM's response shouldn't be trusted.
From these 2 straightforward examples, we can observe that the LLM's responses with the highest scores are direct, accurate, and appropriately detailed.
On the other hand, LLM's responses with low trustworthiness score convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.
Cleanlab’s TLM does not natively support streaming both the response and the trustworthiness score. However, there is an alternative approach available to achieve low-latency, streaming responses that can be used for your application.
Detailed information about the approach, along with example code, is available here.
TLM can be configured with the following options:
These configurations are passed as a dictionary to the CleanlabTLM object during initialization.
More details about these options can be referred from Cleanlab's API documentation and a few use-cases of these options are explored in this notebook.
Let's consider an example where the application requires gpt-4 model with 128 output tokens.
options = {
"model": "gpt-4",
"max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
resp = llm.complete("Who is Paul Graham?")
print(resp)
To understand why the TLM estimated low trustworthiness for the previous horsepower related question, specify the "explanation" flag when initializing the TLM.
options = {
"log": ["explanation"],
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
resp = llm.complete(
"What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
print(resp)
print(resp.additional_kwargs["explanation"])