Trustworthy RAG with the Trustworthy Language Model

This tutorial demonstrates how to use Cleanlab's Trustworthy Language Model (TLM) in any RAG system, to score the trustworthiness of answers and automatically catch incorrect/hallucinated responses in real-time.

Today's RAG and Agent applications often produce unreliable responses, because they depend on LLMs which are fundamentally unreliable. Cleanlab’s Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, using state-of-the-art uncertainty estimates for LLMs. Cleanlab works effectively no matter your RAG architecture or retrieval and indexing processes.

To diagnose when RAG answers cannot be trusted, this tutorial demonstrates how to replace your LLM with Cleanlab's to generate responses and score their trustworthiness. You can alternatively use Cleanlab only to score responses from your unmodified RAG system and run other real-time Evals, see our Evaluation tutorial.

Setup

RAG is all about connecting LLMs to data, to better inform their answers. This tutorial uses Nvidia's Q1 FY2024 earnings report as an example dataset. Use the following commands to download the data (earnings report) and store it in a directory named data/.

python

!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/

Now let's install the required dependencies.

python

%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface

We then initialize Cleanlab's TLM. Here we initialize a CleanlabTLM object with default settings.

python

from llama_index.llms.cleanlab import CleanlabTLM

# set api key in env or in llm
# get free API key from: https://cleanlab.ai/
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")

Note: If you encounter ValidationError during the above import, please upgrade your python version to >= 3.11

You can achieve better results by playing with the TLM configurations outlined in this advanced TLM tutorial.

For example, if your application requires OpenAI's GPT-4 model and restrict the output tokens to 256, you can configure it using the options argument:

python

options = {
    "model": "gpt-4",
    "max_tokens": 256,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

Let's start by asking the LLM a simple question.

python

response = llm.complete("What is NVIDIA's ticker symbol?")
print(response)

TLM not only provides a response but also includes a trustworthiness score indicating the confidence that this response is good/accurate. You can access this score from the response itself.

python

response.additional_kwargs

Build a RAG pipeline with TLM

Now let's integrate TLM into a RAG pipeline.

python

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Settings.llm = llm

Specify Embedding Model

RAG uses an embedding model to match queries against document chunks to retrieve the most relevant data. Here we opt for a no-cost, local embedding model from Hugging Face. You can use any other embedding model by referring to this LlamaIndex guide.

python

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

Load Data and Create Index + Query Engine

Let's create an index from the documents stored in the data directory. The system can index multiple files within the same folder, although for this tutorial, we'll use just one document. We stick with the default index from LlamaIndex for this tutorial.

python

documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
    # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
    doc.excluded_llm_metadata_keys.append("file_path")
index = VectorStoreIndex.from_documents(documents)

The generated index is used to power a query engine over the data.

python

query_engine = index.as_query_engine()

Note that TLM is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.

In addition, you can just use TLM's trustworthiness score in an existing custom-built RAG pipeline (using any other LLM generator, streaming or not).

To achieve this, you'd need to fetch the prompt sent to LLM (including system instructions, retrieved context, user query, etc.) and the returned response. TLM requires both to predict trustworthiness.

Details about this approach and example code are available here.

Extract Trustworthiness Score from LLM response

As we saw earlier, Cleanlab's TLM also provides the trustworthiness_score in addition to the text, in its response to the prompt.

To get this score out when TLM is used in a RAG pipeline, Llamaindex provides an instrumentation tool that allows us to observe the events running behind the scenes in RAG.

We can utilise this tooling to extract trustworthiness_score from LLM's response.

Let's define a simple event handler that stores this score for every request sent to the LLM. You can refer to Llamaindex's documentation for more details on instrumentation.

python

from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent


class GetTrustworthinessScore(BaseEventHandler):
    events: ClassVar[List[BaseEvent]] = []
    trustworthiness_score: float = 0.0

    @classmethod
    def class_name(cls) -> str:
        """Class name."""
        return "GetTrustworthinessScore"

    def handle(self, event: BaseEvent) -> Dict:
        if isinstance(event, LLMCompletionEndEvent):
            self.trustworthiness_score = event.response.additional_kwargs[
                "trustworthiness_score"
            ]
            self.events.append(event)


# Root dispatcher
root_dispatcher = get_dispatcher()

# Register event handler
event_handler = GetTrustworthinessScore()
root_dispatcher.add_event_handler(event_handler)

For each query, we can fetch this score from event_handler.trustworthiness_score. Let's see it in action.

Answering queries with our RAG system

Let's try out our RAG pipeline based on TLM. Here we pose questions with differing levels of complexity.

python

# Optional: Define `display_response` helper function


# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the text response itself and the corresponding trustworthiness score.
def display_response(response):
    response_str = response.response
    trustworthiness_score = event_handler.trustworthiness_score
    print(f"Response: {response_str}")
    print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")

Easy Questions

We first pose straightforward questions that can be directly answered by the provided data and can be easily located within a few lines of text.

python

response = query_engine.query(
    "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)

python

response = query_engine.query(
    "What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)

python

response = query_engine.query(
    "What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)

TLM returns high trustworthiness scores for these responses, indicating high confidence they are accurate. After doing a quick fact-check (reviewing the original earnings report), we can confirm that TLM indeed accurately answered these questions. In case you're curious, here are relevant excerpts from the data context for these questions:

NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, ...

GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.

Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, ...

Questions without Available Context

Now let's see how TLM responds to queries that cannot be answered using the provided data.

python

response = query_engine.query(
    "What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)

The lower TLM trustworthiness score indicates a bit more uncertainty about the response, which aligns with the lack of information available. Let's try some more questions.

python

response = query_engine.query(
    "How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)

python

response = query_engine.query(
    "How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)

We observe that TLM demonstrates the ability to recognize the limitations of the available information. It refrains from generating speculative responses or hallucinations, thereby maintaining the reliability of the question-answering system. This behavior showcases an understanding of the boundaries of the context and prioritizes accuracy over conjecture.

Challenging Questions

Let's see how our RAG system responds to harder questions, some of which may be misleading.

python

response = query_engine.query(
    "How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?"
)
display_response(response)

python

response = query_engine.query(
    "This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)

python

response = query_engine.query(
    "How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)

python

response = query_engine.query(
    "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)

TLM automatically alerts us that these answers are unreliable, by the low trustworthiness score. RAG systems with TLM help you properly exercise caution when you see low trustworthiness scores. Here are the correct answers to the aforementioned questions:

NVIDIA's revenue increased by $1.14 billion this quarter compared to last quarter.

Google, Amazon Web Services, Microsoft, Oracle, ServiceNow, Medtronic, Dell Technologies.

There is not a specific total count of RTX GPUs mentioned.

Projected annual revenue if this growth rate is maintained for the next four quarters: approximately $26.34 billion.

With TLM, you can easily increase trust in any RAG system!

Read TLM's performance benchmarks to learn about the effectiveness of the trustworthiness scoring.

Rather than replacing your LLM with Cleanlab's (as done in this tutorial), you can alternatively use Cleanlab only to detect incorrect responses from your existing unmodified RAG system; check out our real-time Evaluation tutorial.