docs/lineage/sql_parsing.md
Many data platforms are built on top of SQL, which means deeply understanding SQL queries is critical for understanding column-level lineage, usage, and more.
DataHub's SQL parser is built on top of sqlglot and adds a number of additional features to improve the accuracy of SQL parsing.
In our benchmarks, the DataHub SQL parser generates lineage with 97-99% accuracy and outperforms other SQL parsers by a wide margin.
We've published a blog post on some of the technical details of the parser: Extracting Column Lineage from SQL Queries.
If you're using a tool that DataHub already integrates with, check the documentation for that specific integration. Most of our integrations, including Snowflake, BigQuery, Redshift, dbt, Looker, PowerBI, Airflow, etc, use the SQL parser to generate column-level lineage and usage statistics.
If you’re using a different database system for which we don’t support column-level lineage out of the box, but you do have a database query log available, the SQL queries connector can generate column-level lineage and table/column usage statistics from the query log.
Our SDK provides a DataHubGraph.parse_sql_lineage() method for programmatically parsing SQL queries.
The resulting object contains a sql_parsing_result.debug_info.confidence_score field, which is a 0-1 value indicating the confidence of the parser.
There are also a number of utilities in the datahub.sql_parsing module. The SqlParsingAggregator is particularly useful, as it can also resolve lineage across temp tables and table renames/swaps.
Note that these utilities are not officially part of the DataHub SDK and hence do not have the same level of stability and support as the rest of the SDK.
SELECT, CREATE, INSERT, UPDATE, DELETE, and MERGE statementsSELECT (including SELECT INTO), CREATE VIEW, CREATE TABLE AS SELECT (CTAS), INSERT, and UPDATE statementsUNION ALL constructs - will merge lineage across the clauses of the UNIONSELECT * and similar expressions will automatically be expanded with the table schemas registered in DataHub. This includes support for platform instances.convert_urns_to_lowercase is enabled when the corresponding table schemas were ingested into DataHub.
proj.dataset.table_20230616 will be normalized to proj.dataset.table_yyyymmdd. This matches the behavior of our BigQuery ingestion connector, and hence will result in lineage linking up correctly.UDFs - We will generate lineage pointing at the columns that are inputs to the UDF, but will not be able to understand the UDF itself.UDFsjson_extract and similar functionsUNNEST - We will do a best-effort job, but cannot reliably generate column-level lineage in the presence of UNNEST constructs.SELECT IF (main.id is not null, main, extras).* FROM my_schema.main_users main FULL JOIN my_schema.external_users extras USING (id) in BigQuery.INSERT INTO (col1_new, col2_new) SELECT col1_old, col2_old FROM .... We only support INSERT INTO statements that either (1) don't specify a column list, or (2) specify a column list that matches the columns in the SELECT clause.MERGE INTO statements - We don't generate column-level lineage for these._partitiontime and _partitiondate pseudo-columns with a table name prefix e.g. my_table._partitiontime fails. However, unqualified references like _partitiontime and _partitiondate will be fine.WHERE, GROUP BY, ORDER BY, JOIN, HAVING, or PARTITION BY to be part of lineage. For example, SELECT col1, col2 FROM upstream_table WHERE col3 = 3 will not generate any lineage related to col3.SELECT * FROM identifier('my_db.my_schema.my_table'), since the identifier function is resolved at SQL runtime.