metadata-ingestion/docs/sources/glue/README.md
Glue is a data platform used to store and query analytical or operational data. Learn more in the official Glue documentation.
The DataHub integration for Glue covers core metadata entities such as datasets/tables/views, schema fields, and containers. It also captures table- and column-level lineage and stateful deletion detection.
:::tip If you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See here for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub. :::
| Source Concept | DataHub Concept | Notes |
|---|---|---|
"glue" | Data Platform | |
| Glue Database | Container | Subtype Database |
| Glue Table | Dataset | Subtype Table |
| Glue Job | Data Flow | |
| Glue Job Transform | Data Job | |
| Glue Job Data source | Dataset | |
| Glue Job Data sink | Dataset |
To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage.
DataHub extracts upstream lineage for Glue job nodes that read from JDBC databases. Two node styles are supported:
Glue Studio's visual editor stores connection references as connection_options.connectionName. DataHub calls the GetConnection API to resolve the connection and determine the platform and database.
Supported connection types:
Glue ConnectionType | DataHub Platform |
|---|---|
JDBC | Parsed from JDBC URL (see below) |
POSTGRESQL | postgres |
MYSQL | mysql |
REDSHIFT | redshift |
ORACLE | oracle |
SQLSERVER | mssql |
The table is read from connection_options.dbtable. If dbtable is absent, DataHub falls back to parsing connection_options.query (see SQL Query Lineage below).
Script-style nodes set connection_type to the database protocol and pass the JDBC URL inline via connection_options.url. Supported protocols:
connection_type | DataHub Platform | Default schema |
|---|---|---|
postgresql | postgres | public |
mysql | mysql | — |
mariadb | mysql | — |
redshift | redshift | public |
oracle | oracle | — |
sqlserver | mssql | dbo |
Example job script args that DataHub can parse:
datasource = glueContext.create_dynamic_frame.from_options(
connection_type="postgresql",
connection_options={
"url": "jdbc:postgresql://myhost:5432/mydb",
"dbtable": "public.orders",
# or: "query": "SELECT * FROM public.orders WHERE region = 'US'"
},
)
Given a dbtable value and the resolved (platform, database):
dbtable = "schema.table" → database.schema.tabledbtable = "table" (no schema) → database.<default_schema>.table if the platform has a default schema, otherwise database.tableWhen dbtable is absent and connection_options.query is set, DataHub uses sqlglot to extract table references from the SQL string.
Supported: Single-table queries, JOINs, CTEs, subqueries — all referenced tables are emitted as upstream datasets.
-- All three tables become upstream lineage inputs
SELECT o.id, c.name, p.price
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
Not supported: Queries that fail to parse, or queries with no table references (e.g. SELECT 1). These produce a warning and the node is skipped.
Note:
query-based lineage reflects the tables referenced in the SQL at ingestion time. Dynamic SQL, parameterized queries, or queries built at runtime cannot be statically analyzed.