Back to Llama Index

Cleanlab Trustworthy Language Model

docs/examples/llm/cleanlab.ipynb

0.14.214.8 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/cleanlab.ipynb" target="_parent"></a>

Cleanlab Trustworthy Language Model

Cleanlab’s Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, using state-of-the-art uncertainty estimates for LLMs. Trust scoring is crucial for applications where unchecked hallucinations and other LLM errors are a show-stopper.

This page demonstrates how to use TLM in place of your own LLM, to both generate responses and score their trustworthiness. That’s not the only way to use TLM though. To add trust scoring to your existing unmodified RAG application, you can instead see this Trustworthy RAG tutorial. Beyond RAG applications, you can score the trustworthiness of responses already generated from any LLM via TLM.get_trustworthiness_score().

Learn more in the Cleanlab documentation.

Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-cleanlab
python
%pip install llama-index
python
from llama_index.llms.cleanlab import CleanlabTLM
python
# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")
python
resp = llm.complete("Who is Paul Graham?")
python
print(resp)

You also get the trustworthiness score of the above response in additional_kwargs. TLM automatically computes this score for all the <prompt, response> pair.

python
print(resp.additional_kwargs)

A high score indicates that LLM's response can be trusted. Let's take another example here.

python
resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
python
print(resp)
python
print(resp.additional_kwargs)

A low score indicates that the LLM's response shouldn't be trusted.

From these 2 straightforward examples, we can observe that the LLM's responses with the highest scores are direct, accurate, and appropriately detailed.

On the other hand, LLM's responses with low trustworthiness score convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.

Streaming

Cleanlab’s TLM does not natively support streaming both the response and the trustworthiness score. However, there is an alternative approach available to achieve low-latency, streaming responses that can be used for your application.

Detailed information about the approach, along with example code, is available here.

Advance use of TLM

TLM can be configured with the following options:

  • model: underlying LLM to use
  • max_tokens: maximum number of tokens to generate in the response
  • num_candidate_responses: number of alternative candidate responses internally generated by TLM
  • num_consistency_samples: amount of internal sampling to evaluate LLM-response-consistency
  • use_self_reflection: whether the LLM is asked to self-reflect upon the response it generated and self-evaluate this response
  • log: specify additional metadata to return. include “explanation” here to get explanations of why a response is scored with low trustworthiness

These configurations are passed as a dictionary to the CleanlabTLM object during initialization.

More details about these options can be referred from Cleanlab's API documentation and a few use-cases of these options are explored in this notebook.

Let's consider an example where the application requires gpt-4 model with 128 output tokens.

python
options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)
python
resp = llm.complete("Who is Paul Graham?")
python
print(resp)

To understand why the TLM estimated low trustworthiness for the previous horsepower related question, specify the "explanation" flag when initializing the TLM.

python
options = {
    "log": ["explanation"],
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

resp = llm.complete(
    "What was the horsepower of the first automobile engine used in a commercial truck in the United States?"
)
python
print(resp)
python
print(resp.additional_kwargs["explanation"])