README - Datahub — ContextQMD

Overview

Apache Flink is a distributed stream and batch processing framework. Learn more in the official Flink documentation.

The DataHub integration for Flink extracts job metadata, operator topology, and dataset lineage by connecting to the Flink JobManager REST API and optionally the SQL Gateway. It resolves table references to their actual platforms (Kafka, Postgres, Iceberg, etc.) via catalog introspection, and tracks job execution history as DataProcessInstances. Stateful ingestion is supported for stale entity removal.

Concept Mapping

Source Concept	DataHub Concept	Notes
Flink Job	DataFlow	One DataFlow per Flink job
Flink Operator	DataJob	Granularity depends on `operator_granularity`
Job Execution	DataProcessInstance	When `include_run_history` is enabled
Kafka Topic	Dataset	Resolved via lineage (DataStream or SQL/Table API)
JDBC Table	Dataset	Resolved via SQL Gateway catalog introspection
Iceberg Table	Dataset	Resolved via SQL Gateway or `catalog_platform_map` config