Back to Datahub

README

metadata-ingestion/docs/sources/flink/README.md

1.6.01.8 KB
Original Source

Overview

Apache Flink is a distributed stream and batch processing framework. Learn more in the official Flink documentation.

The DataHub integration for Flink extracts job metadata, operator topology, and dataset lineage by connecting to the Flink JobManager REST API and optionally the SQL Gateway. It resolves table references to their actual platforms (Kafka, Postgres, Iceberg, etc.) via catalog introspection, and tracks job execution history as DataProcessInstances. Stateful ingestion is supported for stale entity removal.

Concept Mapping

Source ConceptDataHub ConceptNotes
Flink JobDataFlowOne DataFlow per Flink job
Flink OperatorDataJobGranularity depends on operator_granularity
Job ExecutionDataProcessInstanceWhen include_run_history is enabled
Kafka TopicDatasetResolved via lineage (DataStream or SQL/Table API)
JDBC TableDatasetResolved via SQL Gateway catalog introspection
Iceberg TableDatasetResolved via SQL Gateway or catalog_platform_map config