metadata-ingestion/docs/sources/postgres/postgres_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
DataHub can extract table-level lineage from your PostgreSQL query history using the pg_stat_statements extension. This feature analyzes executed SQL queries to automatically discover upstream and downstream dataset dependencies.
INSERT...SELECT, CREATE TABLE AS SELECT, CREATE VIEW, and other DML/DDL statementsEnable query-based lineage in your DataHub recipe:
source:
type: postgres
config:
host_port: "localhost:5432"
database: "your_database"
username: "datahub_user"
password: "your_password"
# Enable query-based lineage extraction
include_query_lineage: true
# Optional: Configure lineage extraction
max_queries_to_extract: 1000 # Default: 1000
min_query_calls: 10 # Only extract queries executed ≥10 times
# Optional: Exclude specific query patterns
query_exclude_patterns:
- "%pg_catalog%" # Exclude system catalog queries
- "%temp_%" # Exclude temporary table queries
- "%staging%" # Exclude staging queries
# Optional: Enable usage statistics
include_usage_statistics: true
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
include_query_lineage | boolean | false | Enable query-based lineage extraction from pg_stat_statements |
max_queries_to_extract | integer | 1000 | Maximum number of queries to extract. Queries are prioritized by execution time and frequency. |
min_query_calls | integer | 1 | Minimum number of times a query must be executed to be included in lineage analysis. Higher values focus on frequently-used queries. |
query_exclude_patterns | list[string] | [] | SQL LIKE patterns to exclude queries. Patterns are case-insensitive. Example: "%pg_catalog%" excludes all queries containing pg_catalog. |
include_usage_statistics | boolean | false | Generate dataset usage metrics from query history. Requires include_query_lineage: true. Shows unique user counts, query frequencies, and column access patterns in the DataHub UI. |
INSERT INTO target_table (col1, col2)
SELECT col1, col2 FROM source_table;
Lineage: source_table → target_table
CREATE TABLE new_table AS
SELECT a.col1, b.col2
FROM table_a a
JOIN table_b b ON a.id = b.id;
Lineage: table_a, table_b → new_table
CREATE VIEW customer_summary AS
SELECT c.customer_id, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id;
Lineage: customers, orders → customer_summary
WITH monthly_revenue AS (
SELECT customer_id, SUM(amount) as revenue
FROM transactions
WHERE date >= '2024-01-01'
GROUP BY customer_id
)
INSERT INTO customer_metrics (customer_id, total_revenue)
SELECT mr.customer_id, mr.revenue
FROM monthly_revenue mr
JOIN customers c ON mr.customer_id = c.id
WHERE c.active = true;
Lineage: transactions, customers → customer_metrics
After running ingestion, verify that lineage was extracted:
1. Check ingestion logs
Look for messages like:
INFO - Prerequisites check: Prerequisites met
INFO - Extracted 850 queries from pg_stat_statements in 1.23 seconds
INFO - Processed 850 queries for lineage extraction (12 failed) in 4.56 seconds
2. Query pg_stat_statements directly
-- Check if queries are being tracked
SELECT COUNT(*) FROM pg_stat_statements;
-- View most frequently executed queries
SELECT
calls,
total_exec_time / 1000 as total_seconds,
query
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
3. Verify in DataHub UI
Navigate to a dataset in DataHub and check the "Lineage" tab. You should see upstream and downstream dependencies derived from query history.
Memory usage:
pg_stat_statements.max based on your available memoryQuery overhead:
Ingestion performance:
max_queries_to_extract to limit extraction timeReset pg_stat_statements periodically
-- Reset statistics (caution: clears all tracked queries)
SELECT pg_stat_statements_reset();
This prevents unbounded memory growth and focuses on recent query patterns.
Use meaningful filters
query_exclude_patternsmin_query_calls to focus on production workloadsMonitor memory usage
-- Check current query count
SELECT COUNT(*) as query_count FROM pg_stat_statements;
-- Check memory usage
SELECT pg_size_pretty(
pg_database_size('your_database')
);
Schedule regular ingestion
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Historical data only
pg_stat_statements_reset()Dynamic SQL
SELECT * FROM users WHERE id = $1 (value not captured)Complex transformations
No column-level lineage
If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
Error message:
ERROR - Insufficient permissions. Grant pg_read_all_stats role: GRANT pg_read_all_stats TO <user>;
Solution:
-- Grant the required role
GRANT pg_read_all_stats TO datahub_user;
-- Or verify current permissions
SELECT
pg_has_role(current_user, 'pg_read_all_stats', 'MEMBER') as has_stats_role,
usesuper as is_superuser
FROM pg_user
WHERE usename = current_user;
Error message:
ERROR - PostgreSQL version 12.0 detected. Query-based lineage requires PostgreSQL 13+
due to column name changes in pg_stat_statements (total_time -> total_exec_time).
Please upgrade to PostgreSQL 13 or later.
Solution:
Upgrade your PostgreSQL installation to version 13 or later. PostgreSQL 13 was released in September 2020 and introduced breaking changes to the pg_stat_statements view column names.
Check your current version:
SELECT version();
-- Or
SHOW server_version;
Upgrade path:
pg_upgrade for in-place upgradesError message:
ERROR - pg_stat_statements extension not installed. Install with: CREATE EXTENSION pg_stat_statements;
Solution:
shared_preload_libraries includes pg_stat_statements in postgresql.confCREATE EXTENSION pg_stat_statements;Possible causes:
No queries in pg_stat_statements
Queries excluded by filters
min_query_calls - lower this value to include less-frequent queriesquery_exclude_patterns - ensure you're not excluding too broadlyEmpty pg_stat_statements
-- Check if queries are being tracked
SELECT COUNT(*) FROM pg_stat_statements;
-- If 0, reset and run some test queries
SELECT pg_stat_statements_reset();
-- Run sample queries
SELECT * FROM your_table LIMIT 10;
-- Verify queries were tracked
SELECT COUNT(*) FROM pg_stat_statements;
By default, PostgreSQL truncates query text to 1024 characters. Increase this limit in postgresql.conf:
# Increase max query length tracked (requires restart)
track_activity_query_size = 4096