metadata-ingestion/docs/sources/mssql/mssql_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
Extracts lineage and usage statistics by analyzing SQL queries:
Enable query-based lineage in your DataHub recipe:
source:
type: mssql
config:
host_port: localhost:1433
database: YourDatabase
username: datahub_user
password: your_password
# Enable query-based lineage extraction
include_query_lineage: true
# Maximum number of queries to extract (default: 1000, max: 10000)
max_queries_to_extract: 1000
# Minimum query execution count to include (default: 1)
# Higher values reduce noise from rarely-executed queries
min_query_calls: 5
# Exclude system and temporary queries (comprehensive list)
query_exclude_patterns:
- "%sys.%" # System tables
- "%tempdb.%" # Temp database
- "%INFORMATION_SCHEMA%" # Metadata views
- "%msdb.%" # SQL Agent database
- "%master.%" # Master database
- "%model.%" # Model database
- "%#%" # Temporary tables
- "%sp_reset_connection%" # JDBC connection resets
- "%SET ANSI_%" # Connection setup
- "%SELECT @@%" # Driver metadata queries
- "%ReportServer%" # SSRS system tables
# Enable usage statistics (requires graph connection)
# include_usage_statistics: true
sink:
# Your sink config
| Option | Type | Default | Description |
|---|---|---|---|
include_query_lineage | boolean | false | Enable query-based lineage extraction |
max_queries_to_extract | integer | 1000 | Maximum queries to analyze (range: 1-10000) |
min_query_calls | integer | 1 | Minimum execution count to include query |
query_exclude_patterns | list[string] | [] | SQL LIKE patterns to exclude queries (max 100 patterns) |
include_usage_statistics | boolean | false | Extract usage statistics (requires include_query_lineage: true) |
DataHub automatically selects the best available method:
Query Store (Preferred) - SQL Server 2016+
DMV Fallback - All supported versions
sys.dm_exec_cached_plans and related DMVsThe source will automatically detect and use the appropriate method based on your SQL Server version and configuration.
max_queries_to_extract: 1000 and min_query_calls: 5Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Enable debug logging to see detailed information about query extraction and parsing:
datahub ingest -c recipe.yml --debug
Look for these key log messages:
INFO: Extracted X queries from query_store - Confirms queries were fetchedINFO: Processed X queries for lineage extraction (X failed) - Shows parsing resultsINFO: Generated X lineage workunits from queries - Confirms lineage generationDEBUG: Query extraction completed in X.XX seconds - Performance metricsSolution: Follow the Query Store setup instructions above (see "Enable Query Store" section).
Solution: Follow the permission grant instructions above (see "Permission Requirements" section).
Solution:
Possible causes:
No queries in history:
All queries filtered by exclude patterns:
query_exclude_patterns configurationQueries below min_query_calls threshold:
min_query_calls are excludedmin_query_calls value or execute queries more frequentlyQuery Store query capture mode:
NONE or CUSTOM with restrictive filtersAUTO or ALL:
ALTER DATABASE [YourDatabase]
SET QUERY_STORE (QUERY_CAPTURE_MODE = AUTO);
Query Store data not yet captured:
SELECT COUNT(*) AS query_count
FROM sys.query_store_query;
-- Should return > 0
Symptoms:
Possible causes:
SQL parsing errors:
SqlUnderstandingError or UnsupportedStatementTypeErrorTables filtered by patterns:
table_pattern deny listtable_pattern and schema_pattern configurationTables not yet ingested:
Database name mismatch:
database: null to ingest all databases, or adjust database configDebug steps:
# Run with debug logging
datahub ingest -c recipe.yml --debug 2>&1 | grep -i "lineage\|parsing\|query"
# Look for these patterns:
# - "Unable to parse query" - indicates SQL parsing issues
# - "Table X not found" - indicates missing base tables
# - "Filtered table" - indicates pattern matching issues
Error message:
Login failed for user 'datahub_user'
Cannot open database "YourDatabase" requested by the login
Solutions:
Login failure:
-- Verify user exists and has correct password
SELECT name, type_desc, is_disabled
FROM sys.server_principals
WHERE name = 'datahub_user';
Database access denied:
-- Grant database access
USE [YourDatabase];
CREATE USER [datahub_user] FOR LOGIN [datahub_user];
GRANT CONNECT TO [datahub_user];
Connection timeout:
source:
config:
uri_args:
connect_timeout: 30
timeout: 30
Symptoms:
#temp_table as actual tablesSolution:
MSSQL temp tables (starting with #) are automatically filtered by default. If you see temp tables in lineage:
Verify temp table patterns are configured:
source:
config:
temporary_tables_pattern:
- ".*#.*" # Built-in default pattern
Add custom temp table patterns:
source:
config:
temporary_tables_pattern:
- ".*#.*" # Standard SQL Server temp tables
- ".*\.temp_.*" # Custom naming pattern
- ".*\.staging_.*" # ETL staging tables
Symptoms:
max_queries_to_extract queries processedUnderstanding the Problem:
When you hit the limit, only the top N queries by execution time are extracted. This means:
Performance Tuning Solutions:
Reduce query limit (if experiencing slowness):
source:
config:
max_queries_to_extract: 500 # Reduced from default 1000
Increase query limit (if you need more coverage and can afford the time):
source:
config:
max_queries_to_extract: 5000 # Increased from default 1000
Filter out noise with exclude patterns (see comprehensive list in Configuration section above)
Focus on frequently-executed queries:
source:
config:
min_query_calls: 10 # Only extract queries executed 10+ times
Check Query Store performance:
-- Check Query Store size and query count
SELECT
actual_state_desc,
readonly_reason,
current_storage_size_mb,
max_storage_size_mb,
query_count = (SELECT COUNT(*) FROM sys.query_store_query)
FROM sys.database_query_store_options;
Recommendations:
max_queries_to_extract: 1000 (default)Symptoms:
Cause: DMVs only contain queries currently in the plan cache. The cache clears on:
Solutions:
Enable Query Store (recommended) - See "Enable Query Store" section above
Execute representative queries before ingestion:
Check plan cache size:
SELECT
size_in_bytes/1024/1024 AS cache_size_mb,
name,
type
FROM sys.dm_os_memory_clerks
WHERE type = 'CACHESTORE_SQLCP';
Error messages:
ValidationError: max_queries_to_extract must be positive
ValidationError: query_exclude_patterns cannot exceed 100 patterns
Valid ranges:
max_queries_to_extract: 1 to 10000min_query_calls: 0 to any positive integerquery_exclude_patterns: Maximum 100 patterns, each up to 500 charactersError message:
RuntimeError: Failed to initialize SQL aggregator for query-based lineage,
but include_query_lineage: true was explicitly enabled
Possible causes:
include_usage_statistics: trueSolution:
include_usage_statistics, ensure your sink is configured with a DataHub GMS connectionFollow these steps to confirm everything is configured correctly:
Verify permissions:
-- Should return at least one row with VIEW SERVER STATE
SELECT * FROM fn_my_permissions(NULL, 'SERVER')
WHERE permission_name = 'VIEW SERVER STATE';
Verify Query Store has data (see "Verify Query Store Status" section above)
Test Query Store access:
-- Test Query Store access
SELECT TOP 1 query_id, query_sql_text
FROM sys.query_store_query_text;
-- Test DMV access
SELECT TOP 1 sql_handle
FROM sys.dm_exec_query_stats;
Run ingestion with debug logging:
datahub ingest -c recipe.yml --debug 2>&1 | tee ingestion.log
Check for success indicators in logs:
grep -i "extracted.*queries" ingestion.log
grep -i "processed.*queries" ingestion.log
grep -i "generated.*lineage workunits" ingestion.log
Verify in DataHub UI:
If you've tried the above steps and still experiencing issues:
Collect diagnostic information:
# Run ingestion with debug logging
datahub ingest -c recipe.yml --debug > ingestion.log 2>&1
# Check SQL Server version
# Check Query Store status
# Check user permissions
Check DataHub GitHub issues:
Ask for help:
If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
If you encounter permission errors:
The DataHub source will automatically handle fallback between methods and provide detailed error messages with specific permission requirements if issues occur.