metadata-ingestion/docs/sources/dlt/dlt_pre.md
The dlt module ingests pipeline metadata from dlt (data load tool) into DataHub. It reads dlt's local state directory directly — no live connection to dlt or the destination is required for basic metadata extraction. If the dlt Python package is installed, the connector uses the SDK for richer metadata; otherwise it falls back to parsing the YAML state files directly.
pipeline_name)orders__items)_dlt_loads (opt-in)dlt writes pipeline state to a local directory after each pipeline.run() call:
~/.dlt/pipelines/
<pipeline_name>/
schemas/
<schema_name>.schema.yaml # Table definitions with columns and types
state.json # Destination type, dataset name, pipeline state
pipelines_dir must be accessible from where DataHub ingestion runspipelines_dirdlt's default location. Works out of the box:
pipelines_dir: "~/.dlt/pipelines"
dlt runs in one job and DataHub ingestion runs in another. Both must use the same path or shared storage:
pipelines_dir: "/data/dlt-pipelines"
Many dlt users already set a PIPELINES_DIR environment variable:
pipelines_dir: "${PIPELINES_DIR:-~/.dlt/pipelines}"
dlt runs in one pod and DataHub in another. Mount the same PersistentVolumeClaim in both pods:
pipelines_dir: "/mnt/dlt-pipelines"
The connector reads local files only — no network permissions are needed for basic metadata extraction.
| Feature | Requirement |
|---|---|
| Pipeline metadata (DataFlow, DataJob, lineage) | Filesystem read access to pipelines_dir |
Run history (_dlt_loads) | dlt package installed + destination credentials in ~/.dlt/secrets.toml |