metadata-ingestion/docs/sources/flink/README.md
Apache Flink is a distributed stream and batch processing framework. Learn more in the official Flink documentation.
The DataHub integration for Flink extracts job metadata, operator topology, and dataset lineage by connecting to the Flink JobManager REST API and optionally the SQL Gateway. It resolves table references to their actual platforms (Kafka, Postgres, Iceberg, etc.) via catalog introspection, and tracks job execution history as DataProcessInstances. Stateful ingestion is supported for stale entity removal.
| Source Concept | DataHub Concept | Notes |
|---|---|---|
| Flink Job | DataFlow | One DataFlow per Flink job |
| Flink Operator | DataJob | Granularity depends on operator_granularity |
| Job Execution | DataProcessInstance | When include_run_history is enabled |
| Kafka Topic | Dataset | Resolved via lineage (DataStream or SQL/Table API) |
| JDBC Table | Dataset | Resolved via SQL Gateway catalog introspection |
| Iceberg Table | Dataset | Resolved via SQL Gateway or catalog_platform_map config |