metadata-ingestion/docs/sources/informatica/informatica_pre.md
The informatica module ingests metadata from Informatica Cloud (IDMC) into DataHub. It extracts projects, folders, Mapping Tasks, and Taskflows, and resolves table-level lineage from the Mapping each Task references. Standalone Mappings (ones without a Mapping Task) and Mapplets are not emitted.
:::tip Quick Start
informatica_recipe.yml as a templatedatahub ingest -c informatica_recipe.yml:::
transform DataJob each; Taskflows as DataFlows with one orchestrate DataJob that chains the MTs in step orderorchestrate DataJob anchors the end of the chaincreatedBy/updatedBy| IDMC concept | DataHub entity | Subtype |
|---|---|---|
| Project | Container | Project |
| Folder | Container | Folder |
| Taskflow | DataFlow and one orchestrate DataJob | Taskflow / Taskflow Orchestration |
| Mapping Task | DataFlow and one transform DataJob | Mapping Task / Task Logic |
| Mapping | not emitted — see notes | — |
| Mapplet | not emitted — see notes | — |
| Source/target | Dataset (upstream/downstream lineage) | — |
Mapping Tasks are the runnable schedules in IDMC, and that's what we emit as
first-class entities. Each MT's inner transform DataJob carries the
dataJobInputOutput aspect with the source/target tables resolved from the
Mapping it references — so cross-source lineage lands on the thing users
actually schedule and operate.
Mappings without a Mapping Task are not emitted (they're not runnable on
their own). Mapplets are not emitted either — they're internal sub-mappings
included in other mappings. The referenced Mapping's friendly name, v2 id,
and v3 GUID are still surfaced as customProperties.mappingName /
mappingId / mappingV3Id on every MT so you can cross-reference back to
IDMC without leaving DataHub.
The Taskflow step order is resolved from the v3 Export API (.TASKFLOW.xml),
parsed from the IDMC taskflowModel <eventContainer> / <service> /
<link> graph. All Taskflow GUIDs for a single ingestion run are submitted
as one export job for efficiency.
Rather than emitting a separate DataJob per step, the connector collapses
step references into the MT they run and chains the MT transform DataJobs
directly via dataJobInputOutput.inputDatajobs. A single orchestrate
DataJob is emitted per Taskflow and anchored at the end of the chain:
inputDatajobs = [last MT], outputDatasets mirrors the last MT's outputs.
The resulting Taskflow lineage reads cleanly end to end:
input_dataset → MT1.transform → MT2.transform → … → MTn.transform → orchestrate → output_dataset
Non-data steps (command / decision / notification / …) don't participate in
the chain but are summarized in customProperties.stepSummary on the
orchestrate DataJob for auditing.
| Capability | IDMC privilege | Notes |
|---|---|---|
| Authenticate | Any active IDMC user | Uses the v2 login endpoint |
| List projects, folders, taskflows | Asset - read (or the Observer role) | Needed for all container/flow emission |
| List mappings / mapping tasks | Asset - read | Mapping Tasks are optional and skipped with a warning if 403 |
| Extract table-level lineage | Asset - export | Submits v3 export jobs; skip by setting extract_lineage: false |
| List connections | Connection - read | Needed for lineage to resolve to dataset URNs |
Set login_url to your IDMC pod's regional URL (not the API runtime URL — the connector discovers that from the login response):
| Region | login_url |
|---|---|
| US | https://dm-us.informaticacloud.com |
| US2 | https://dm2-us.informaticacloud.com |
| EMEA | https://dm-em.informaticacloud.com |
| APAC | https://dm-ap.informaticacloud.com |