Back to Lightrag

Evaluation Result Reproduce

docs/Reproduce.md

1.4.158.3 KB
Original Source

Evaluation Result Reproduce

Dataset

The dataset used in LightRAG can be downloaded from TommyChien/UltraDomain.

Generate Query

LightRAG uses the following prompt to generate high-level queries, with the corresponding code in examples/generate_query.py.

Prompt

Given the following description of a dataset:

{description}

Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the entire dataset.

Output the results in the following structure:
- User 1: [user description]
    - Task 1: [task description]
        - Question 1:
        - Question 2:
        - Question 3:
        - Question 4:
        - Question 5:
    - Task 2: [task description]
        ...
    - Task 5: [task description]
- User 2: [user description]
    ...
- User 5: [user description]
    ...

Batch Eval

To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in reproduce/batch_eval.py.

Prompt

---Role---
You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
---Goal---
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.

- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
- **Empowerment**: How well does the answer help the reader understand and make informed judgments about the topic?

For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner based on these three categories.

Here is the question:
{query}

Here are the two answers:

**Answer 1:**
{answer1}

**Answer 2:**
{answer2}

Evaluate both answers using the three criteria listed above and provide detailed explanations for each criterion.

Output your evaluation in the following JSON format:

{{
    "Comprehensiveness": {{
        "Winner": "[Answer 1 or Answer 2]",
        "Explanation": "[Provide explanation here]"
    }},
    "Empowerment": {{
        "Winner": "[Answer 1 or Answer 2]",
        "Explanation": "[Provide explanation here]"
    }},
    "Overall Winner": {{
        "Winner": "[Answer 1 or Answer 2]",
        "Explanation": "[Summarize why this answer is the overall winner based on the three criteria]"
    }}
}}

Overall Performance Table

AgricultureCSLegalMix
NaiveRAGLightRAGNaiveRAGLightRAGNaiveRAGLightRAGNaiveRAGLightRAG
Comprehensiveness32.4%67.6%38.4%61.6%16.4%83.6%38.8%61.2%
Diversity23.6%76.4%38.0%62.0%13.6%86.4%32.4%67.6%
Empowerment32.4%67.6%38.8%61.2%16.4%83.6%42.8%57.2%
Overall32.4%67.6%38.8%61.2%15.2%84.8%40.0%60.0%
RQ-RAGLightRAGRQ-RAGLightRAGRQ-RAGLightRAGRQ-RAGLightRAG
Comprehensiveness31.6%68.4%38.8%61.2%15.2%84.8%39.2%60.8%
Diversity29.2%70.8%39.2%60.8%11.6%88.4%30.8%69.2%
Empowerment31.6%68.4%36.4%63.6%15.2%84.8%42.4%57.6%
Overall32.4%67.6%38.0%62.0%14.4%85.6%40.0%60.0%
HyDELightRAGHyDELightRAGHyDELightRAGHyDELightRAG
Comprehensiveness26.0%74.0%41.6%58.4%26.8%73.2%40.4%59.6%
Diversity24.0%76.0%38.8%61.2%20.0%80.0%32.4%67.6%
Empowerment25.2%74.8%40.8%59.2%26.0%74.0%46.0%54.0%
Overall24.8%75.2%41.6%58.4%26.4%73.6%42.4%57.6%
GraphRAGLightRAGGraphRAGLightRAGGraphRAGLightRAGGraphRAGLightRAG
Comprehensiveness45.6%54.4%48.4%51.6%48.4%51.6%50.4%49.6%
Diversity22.8%77.2%40.8%59.2%26.4%73.6%36.0%64.0%
Empowerment41.2%58.8%45.2%54.8%43.6%56.4%50.8%49.2%
Overall45.2%54.8%48.0%52.0%47.2%52.8%50.4%49.6%

Reproduce

All the code can be found in the ./reproduce directory.

Step-0 Extract Unique Contexts

First, extract unique contexts from the datasets.

Code

python
def extract_unique_contexts(input_directory, output_directory):

    os.makedirs(output_directory, exist_ok=True)

    jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl'))
    print(f"Found {len(jsonl_files)} JSONL files.")

    for file_path in jsonl_files:
        filename = os.path.basename(file_path)
        name, ext = os.path.splitext(filename)
        output_filename = f"{name}_unique_contexts.json"
        output_path = os.path.join(output_directory, output_filename)

        unique_contexts_dict = {}

        print(f"Processing file: {filename}")

        try:
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line_number, line in enumerate(infile, start=1):
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        json_obj = json.loads(line)
                        context = json_obj.get('context')
                        if context and context not in unique_contexts_dict:
                            unique_contexts_dict[context] = None
                    except json.JSONDecodeError as e:
                        print(f"JSON decoding error in file {filename} at line {line_number}: {e}")
        except FileNotFoundError:
            print(f"File not found: {filename}")
            continue
        except Exception as e:
            print(f"An error occurred while processing file {filename}: {e}")
            continue

        unique_contexts_list = list(unique_contexts_dict.keys())
        print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.")

        try:
            with open(output_path, 'w', encoding='utf-8') as outfile:
                json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4)
            print(f"Unique `context` entries have been saved to: {output_filename}")
        except Exception as e:
            print(f"An error occurred while saving to the file {output_filename}: {e}")

    print("All files have been processed.")

Step-1 Insert Contexts

Insert the extracted contexts into the LightRAG system.

Code

python
def insert_text(rag, file_path):
    with open(file_path, mode='r') as f:
        unique_contexts = json.load(f)

    retries = 0
    max_retries = 3
    while retries < max_retries:
        try:
            rag.insert(unique_contexts)
            break
        except Exception as e:
            retries += 1
            print(f"Insertion failed, retrying ({retries}/{max_retries}), error: {e}")
            time.sleep(10)
    if retries == max_retries:
        print("Insertion failed after exceeding the maximum number of retries")

Step-2 Generate Queries

Extract tokens from the first and second half of each context, then combine them as dataset descriptions to generate queries.

Code

python
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def get_summary(context, tot_tokens=2000):
    tokens = tokenizer.tokenize(context)
    half_tokens = tot_tokens // 2

    start_tokens = tokens[1000:1000 + half_tokens]
    end_tokens = tokens[-(1000 + half_tokens):1000]

    summary_tokens = start_tokens + end_tokens
    summary = tokenizer.convert_tokens_to_string(summary_tokens)

    return summary

Step-3 Query

Extract and query LightRAG with the queries generated in Step-2.

Code

python
def extract_queries(file_path):
    with open(file_path, 'r') as f:
        data = f.read()

    data = data.replace('**', '')

    queries = re.findall(r'- Question \d+: (.+)', data)

    return queries