metadata-models/docs/entities/query.md
The query entity represents SQL queries (or queries in other languages) that have been executed against one or more data assets such as datasets, tables, or views. Query entities capture both manually created queries and queries discovered through automated crawling of query logs from data platforms like BigQuery, Snowflake, Redshift, and others.
Queries are powerful building blocks for understanding data lineage, usage patterns, and relationships between datasets. When DataHub ingests query logs from data warehouses, it automatically creates query entities that capture the SQL statements, the datasets they reference, execution statistics, and usage patterns over time.
Query entities are identified by a single piece of information:
id) that serves as the key for the query entity. This identifier is typically generated as a hash of the normalized query text, ensuring that identical queries are deduplicated and treated as the same entity.An example of a query identifier is urn:li:query:3b8d7b8c7e4e8b4e3c2e1a5c6d7e8f9a. The identifier is a unique string that can be generated through various means:
The queryProperties aspect contains the core information about a query:
MANUAL for user-entered queries via UI, or SYSTEM for queries discovered by crawlers)The following code snippet shows you how to create a query entity with properties.
<details> <summary>Python SDK: Create a query with properties</summary>{{ inline /metadata-ingestion/examples/library/query_create.py show_path_as_comment }}
You can also update specific properties of an existing query:
<details> <summary>Python SDK: Update query properties</summary>{{ inline /metadata-ingestion/examples/library/query_update_properties.py show_path_as_comment }}
The querySubjects aspect captures the data assets that are referenced by a query. These are the datasets, tables, views, or other entities that the query reads from or writes to.
In single-asset queries (e.g., SELECT * FROM table), the subjects will contain a single table reference. In multi-asset queries (e.g., joins across multiple tables), the subjects may contain multiple table references.
{{ inline /metadata-ingestion/examples/library/query_add_subjects.py show_path_as_comment }}
The queryUsageStatistics aspect is a timeseries aspect that tracks execution statistics and usage patterns for queries over time. This includes:
This aspect is typically populated automatically by ingestion connectors that process query logs from data platforms. The timeseries nature allows for tracking trends and patterns in query usage over time.
Like other DataHub entities, queries can have Tags or Terms attached to them. Tags are informal labels for categorizing queries (e.g., "production", "experimental", "deprecated"), while Terms are formal vocabulary from a business glossary (e.g., "Customer Data", "Financial Reporting").
Tags are added to queries using the globalTags aspect.
{{ inline /metadata-ingestion/examples/library/query_add_tag.py show_path_as_comment }}
Terms are added using the glossaryTerms aspect.
{{ inline /metadata-ingestion/examples/library/query_add_term.py show_path_as_comment }}
Ownership is associated to a query using the ownership aspect. Owners can be of different types such as TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, etc. Ownership helps identify who is responsible for maintaining and understanding specific queries.
{{ inline /metadata-ingestion/examples/library/query_add_owner.py show_path_as_comment }}
The dataPlatformInstance aspect allows you to specify which specific instance of a data platform the query is associated with. This is useful when you have multiple instances of the same platform (e.g., multiple Snowflake accounts or BigQuery projects).
Queries have a fundamental relationship with dataset entities through the querySubjects aspect. Each subject in a query references a dataset URN, creating a bidirectional relationship that allows you to:
This relationship is crucial for understanding dataset usage and query-based lineage.
Queries play a central role in DataHub's lineage capabilities:
Query-based Lineage: When DataHub processes SQL queries (either from query logs or manually provided), it performs SQL parsing to extract column-level lineage information. This lineage is then attached to datasets, showing how data flows from source columns to destination columns through SQL transformations.
Fine-grained Lineage: Queries can be referenced in fine-grained lineage edges on datasets, providing the SQL context for how specific columns are derived. The query URN is stored in the query field of fine-grained lineage information.
Origin Tracking: Queries can have an origin field pointing to the entity they came from (e.g., a View or Stored Procedure), creating a traceable chain from the query execution back to its source definition.
Several DataHub ingestion connectors automatically discover and create query entities:
STL_QUERY and SVL_QUERYThese connectors typically:
Queries can be created, updated, and deleted through the DataHub GraphQL API:
These mutations are available through the GraphQL endpoint and are used by the DataHub UI for manual query management.
Query entities contribute to dataset usage analytics. When query usage statistics are ingested, they:
Queries are automatically deduplicated based on their normalized query text. This means that:
This deduplication is essential for managing the volume of queries in large-scale deployments.
When processing queries that involve temporary tables, the SQL parsing aggregator maintains session context to:
This ensures that query lineage reflects the actual data dependencies rather than intermediate temporary structures.
Very large query statements (e.g., generated queries with thousands of lines) may be truncated or rejected to maintain system performance. The exact limits depend on the backend configuration and the storage layer.
Currently, query entities primarily support SQL as the query language. While there is an UNKNOWN language option, DataHub's SQL parsing and lineage extraction capabilities are specifically designed for SQL dialects. Other query languages (e.g., Cypher, SPARQL, or proprietary query languages) can be stored but will not benefit from automatic lineage extraction.
Queries can have two sources:
MANUAL: Queries created by users through the DataHub UI or APISYSTEM: Queries discovered automatically by ingestion connectorsThis distinction helps differentiate between user-curated queries (which might be documented and named) and the potentially large volume of queries automatically discovered from query logs.