metadata-ingestion/docs/sources/dbt/dbt_pre.md
The dbt module ingests metadata from Dbt into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
DataHub can ingest Query entities from the meta.queries field in your dbt models. This allows you to document "blessed" or commonly-used query patterns directly in dbt and surface them in DataHub's Queries tab for easy discovery and reuse by your team.
Config Options:
source:
type: dbt
config:
manifest_path: target/manifest.json
# Control Query entity emission (default: YES)
entities_enabled:
queries: "NO" # or "YES" (default), "ONLY"
# Limit queries per model (default: 100, set 0 for unlimited)
max_queries_per_model: 100
:::note Integration with Warehouse Query Ingestion
If you're also using warehouse query ingestion (e.g., Snowflake usage, BigQuery audit logs), dbt-emitted queries will coexist with warehouse-discovered queries in the Queries tab. They're differentiated by source: dbt queries have source: MANUAL while warehouse queries typically have source: SYSTEM.
:::
The meta.queries field is defined in your dbt model's properties file (e.g., schema.yml, models.yml, or any .yml file in your dbt project). When you run dbt docs generate or dbt compile, this metadata is included in the manifest.json file, which DataHub then ingests.
Add queries to your model's meta field in your dbt properties file:
# models/schema.yml or models/customers.yml
version: 2
models:
- name: customers
description: "Customer dimension table"
meta:
queries:
- name: "Active customers (30d)"
description: "Customers active in the last 30 days"
sql: |
SELECT *
FROM {{ ref('customers') }}
WHERE active = true
AND last_seen > CURRENT_DATE - INTERVAL '30 days'
tags: ["production", "analytics"]
terms: ["CustomerData", "Engagement"]
- name: "Revenue by customer"
description: "Total revenue aggregated by customer"
sql: |
SELECT
customer_id,
SUM(amount) as total_revenue
FROM {{ ref('customers') }}
GROUP BY customer_id
tags: ["finance", "reporting"]
Then generate your dbt artifacts:
dbt docs generate
# This creates/updates target/manifest.json with the meta.queries data
Finally, run DataHub ingestion:
datahub ingest -c your_dbt_recipe.yml
# DataHub reads manifest.json and creates Query entities
Each query in the queries list supports the following fields:
| Field | Required | Type | Description |
|---|---|---|---|
name | ✅ Yes | string | Unique name for the query |
sql | ✅ Yes | string | SQL statement for the query |
description | ❌ No | string | Human-readable description |
tags | ❌ No | list of strings | Tags for categorization (stored in customProperties) |
terms | ❌ No | list of strings | Glossary terms for classification (stored in customProperties) |
queries in the meta field of your dbt model propertiesdbt docs generate, the meta.queries data is included in manifest.jsonmeta.queries fieldmeta.queries becomes a Query entity in DataHuburn:li:query:{dbt_unique_id}_{sanitized_query_name} (e.g., urn:li:query:model.my_project.customers_Active_customers_30d_)dbt_executor actorgenerated_at for reproducibility; falls back to current time if unavailablecustomProperties (see Known Limitations below)[^a-zA-Z0-9_\-\.]+ → _)| Scenario | Behavior |
|---|---|
meta.queries not a list | Skipped with WARNING log |
Query missing name or sql | Skipped, all validation errors shown in log and queries_failed_list |
| Duplicate query names | Duplicate skipped, first definition wins (WARNING) |
Invalid description (not string) | Field ignored with WARNING log |
Invalid tags/terms (not list) | Field ignored with WARNING log |
| Empty values in tags/terms list | Filtered out automatically |
| Manifest timestamp unparseable | Falls back to current time with WARNING; tracked in query_timestamps_fallback_used |
| Queries on ephemeral model | Skipped with WARNING (ephemeral models don't exist in target platform) |
Exceeds max_queries_per_model | Only first N processed (configurable, default 100), WARNING logged |
All validation errors are logged at WARNING level and tracked in the ingestion report.
After ingestion, you'll see:
dbt_executor actorThe meta.queries feature works alongside other dbt metadata capabilities in DataHub:
| Feature | Purpose | Config Key |
|---|---|---|
| meta.queries | Define Query entities for discovery | entities_enabled.queries |
| meta_mapping | Map dbt meta fields to DataHub tags/terms/owners | meta_mapping |
| column_meta_mapping | Map column-level meta to DataHub aspects | column_meta_mapping |
| owner_extraction_pattern | Extract owners from meta fields | owner_extraction_pattern |
| tag_prefix | Prefix for auto-generated tags | tag_prefix |
Example combining features:
source:
type: dbt
config:
manifest_path: target/manifest.json
catalog_path: target/catalog.json
target_platform: snowflake
# Enable query entities from meta.queries
entities_enabled:
queries: "YES"
max_queries_per_model: 100
# Map other meta fields to DataHub aspects
meta_mapping:
business_owner:
match: ".*"
operation: "add_owner"
config:
owner_type: user
data_tier:
match: ".*"
operation: "add_tag"
# Extract owners from specific meta patterns
enable_meta_mapping: true
Tags/Terms in Custom Properties: Query entities don't currently support native GlobalTags or GlossaryTerms aspects. Tags and terms are stored as comma-separated strings in customProperties. This means:
No SQL Validation Against Model: The sql field in meta.queries is not validated against the model it's defined on. You could define sql: "SELECT * FROM products" under the customers model. DataHub trusts that users define meaningful queries. Consider documenting your team's conventions for query definitions.
URN Collision on Similar Names: Query names are sanitized for URN generation. Names like "Revenue (USD)" and "Revenue [USD]" both become Revenue_USD_, causing a collision (second one is skipped with a warning). Use distinct, alphanumeric query names to avoid this.
Ephemeral Models Not Supported: Queries defined on ephemeral models (materialized: ephemeral) are skipped because ephemeral models don't exist as physical tables in the target platform. Queries are linked to target platform datasets, so there's no dataset to link to.
:::tip Choosing Between meta.queries and meta_mapping
:::
The artifacts used by this source are:
Recommended workflow for dbt build and DataHub ingestion:
dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate
cp target/run_results_backup.json target/run_results.json
# Run datahub ingestion, pointing at the files in the target/ directory
The necessary artifact files will then appear in the target/ directory of your dbt project.
We also have guides on handling more complex dbt orchestration techniques and multi-project setups below.
:::note Entity is in manifest but missing from catalog
This warning usually appears when the catalog.json file was not generated by a dbt docs generate command.
Most other dbt commands generate a partial catalog file, which may impact the completeness of the metadata in ingested into DataHub.
Following the above workflow should ensure that the catalog file is generated correctly.
:::