Back to Graphrag

Outputs

docs/index/outputs.md

3.0.97.1 KB
Original Source

Outputs

The default pipeline produces a series of output tables that align with the conceptual knowledge model. This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.

Shared fields

All tables have two identifier fields:

nametypedescription
idstrGenerated UUID, assuring global uniqueness
human_readable_idintThis is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually.

communities

This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.

nametypedescription
communityintLeiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment.
parentintParent community ID.
childrenint[]List of child community IDs.
levelintDepth of the community in the hierarchy.
titlestrFriendly name of the community.
entity_idsstr[]List of entities that are members of the community.
relationship_idsstr[]List of relationships that are wholly within the community (source and target are both in the community).
text_unit_idsstr[]List of text units represented within the community.
periodstrDate of ingest, used for incremental update merges. ISO8601
sizeintSize of the community (entity count), used for incremental update merges.

community_reports

This is the list of summarized reports for each community.

nametypedescription
communityintShort ID of the community this report applies to.
parentintParent community ID.
childrenint[]List of child community IDs.
levelintLevel of the community this report applies to.
titlestrLM-generated title for the report.
summarystrLM-generated summary of the report.
full_contentstrLM-generated full report.
rankfloatLM-derived relevance ranking of the report based on member entity salience
rating_explanationstrLM-derived explanation of the rank.
findingsdictLM-derived list of the top 5-10 insights from the community. Contains summary and explanation values.
full_content_jsonjsonFull JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users.
periodstrDate of ingest, used for incremental update merges. ISO8601
sizeintSize of the community (entity count), used for incremental update merges.

covariates

(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.

nametypedescription
covariate_typestrThis is always "claim" with our default covariates.
typestrNature of the claim type.
descriptionstrLM-generated description of the behavior.
subject_idstrName of the source entity (that is performing the claimed behavior).
object_idstrName of the target entity (that the claimed behavior is performed on).
statusstrLM-derived assessment of the correctness of the claim. One of [TRUE, FALSE, SUSPECTED]
start_datestrLM-derived start of the claimed activity. ISO8601
end_datestrLM-derived end of the claimed activity. ISO8601
source_textstrShort string of text containing the claimed behavior.
text_unit_idstrID of the text unit the claim text was extracted from.

documents

List of document content after import.

nametypedescription
titlestrFilename, unless otherwise configured during CSV import.
textstrFull text of the document.
text_unit_idsstr[]List of text units (chunks) that were parsed from the document.
metadatadictIf specified during CSV import, this is a dict of metadata for the document.

entities

List of all entities found in the data by the LM.

nametypedescription
titlestrName of the entity.
typestrType of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used.
descriptionstrTextual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions.
text_unit_idsstr[]List of the text units containing the entity.
frequencyintCount of text units the entity was found within.
degreeintNode degree (connectedness) in the graph.

relationships

List of all entity-to-entity relationships found in the data by the LM. This is also the edge list for the graph.

nametypedescription
sourcestrName of the source entity.
targetstrName of the target entity.
descriptionstrLM-derived description of the relationship. Also see note for entity descriptions.
weightfloatWeight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance.
combined_degreeintSum of source and target node degrees.
text_unit_idsstr[]List of text units the relationship was found within.

text_units

List of all text chunks parsed from the input documents.

nametypedescription
textstrRaw full text of the chunk.
n_tokensintNumber of tokens in the chunk. This should normally match the chunk_size config parameter, except for the last chunk which is often shorter.
document_idstrID of the document the chunk came from.
entity_idsstr[]List of entities found in the text unit.
relationships_idsstr[]List of relationships found in the text unit.
covariate_idsstr[]Optional list of covariates found in the text unit.