Copyright (c) 2024 Microsoft Corporation. - Graphrag

python

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

Index Migration (v2 to v3)

This notebook is used to maintain data model parity with older indexes for version 3.0 of GraphRAG. If you have a pre-3.0 index and need to migrate without re-running the entire pipeline, you can use this notebook to only update the pieces necessary for alignment. If you have a pre-2.0 index, please run the v2 migration notebook first!

NOTE: we recommend regenerating your settings.yml with the latest version of GraphRAG using graphrag init. Copy your LLM settings into it before running this notebook. This ensures your config is aligned with the latest version for the migration. The config changes from v2 to v3 are significant in places!

WARNING: This will overwrite your parquet files, you may want to make a backup!

python

# This is the directory that has your settings.yaml
PROJECT_DIRECTORY = "<your project directory>"

python

from pathlib import Path

from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag_common.config import load_config
from graphrag_storage.storage_factory import create_storage

config = load_config(GraphRagConfig, config_path=Path(PROJECT_DIRECTORY))
storage = create_storage(config.output_storage)

python

def remove_columns(df, columns):
    """Remove columns from a DataFrame, suppressing errors."""
    df.drop(labels=columns, axis=1, errors="ignore", inplace=True)

python

from graphrag_storage.tables.parquet_table_provider import ParquetTableProvider

# Create table provider from storage
table_provider = ParquetTableProvider(storage)

text_units = await table_provider.read_dataframe("text_units")

text_units["document_id"] = text_units["document_ids"].apply(lambda ids: ids[0])
remove_columns(text_units, ["document_ids"])

await table_provider.write_dataframe("text_units", text_units)

Update settings.yaml

If you have left the default settings for your vector store schema, you may need to set explicit values that map each embedding type to a vector schema name. If you have already customized your vector store schema it may not be necessary.

Old default index names:

default-text_unit-text
default-entity-description
default-community-full_content

(if you left all of the defaults, check your output/lancedb folder to confirm the above)

v3 versions are:

text_unit_text
entity_description
community_full_content

Therefore, with a v2 index need to explicitly set the old index names so it connects correctly. We no longer support the "prefix" - you can just set an explicit index_name for each embedding.

NOTE: we are also setting the default vector_size for each index below, under the assumption that you are using a prior default with 1536 dimensions. Our new default of text-embedding-3-large has 3072 dimensions, which will be populated as the default if unset. Again, if you have a more complicated situation you may want to manually configure this.

Here is an example of the new vector store config block that you may need in your settings.yaml:

yaml

vector_store:
  type: lancedb
  db_uri: output/lancedb
  index_schema:
    text_unit_text:
      index_name: default-text_unit-text
      vector_size: 1536
    entity_description:
      index_name: default-entity-description
      vector_size: 1536
    community_full_content:
      index_name: default-community-full_content
      vector_size: 1536