docs/metadata-ingestion-security.md
DataHub supports three ways to ingest metadata. They differ primarily in where credentials are stored and what network access is required.
| Credentials | Runs From | Network | Firewall/IP Allowlist Adjustments | |
|---|---|---|---|---|
| UI Ingestion | Encrypted in DataHub | DataHub's infrastructure | DataHub infrastructure connects to your data sources | Required for sources behind firewalls or with IP allowlists |
| CLI Ingestion | Local files/env vars | Wherever you execute it (personal machine, CI/CD, scheduler like Airflow/Cron, etc.) | CLI connects to your data sources, then sends metadata to DataHub | Depends on where CLI runs and if those machines already have connectivity |
| Remote Executor | Your infrastructure (AWS Secrets, K8s, etc.) | Deployed in your infrastructure (K8s, ECS, etc.) | Executor connects to your data sources, then sends metadata to DataHub (outbound only) | Depends on where executor runs and if those machines already have connectivity |
graph LR
UI[UI Ingestion
runs in DataHub infrastructure]
CLI[CLI Ingestion
runs where you execute it]
RE[Remote Executor
runs in your infrastructure]
DH[DataHub]
SRC[Data Sources]
UI -.->|managed by DataHub| DH
DH -->|connects to| SRC
CLI -->|sends metadata to| DH
CLI -->|extracts from| SRC
RE -->|sends metadata to| DH
RE -->|extracts from| SRC
Credentials are encrypted with an encryption key and stored in the DataHub database. DataHub manages the encryption and uses these credentials when connecting to your data sources.
Credentials are stored in your infrastructure via recipe files. Best practice: Always use environment variables rather than hardcoding credentials in recipe files. You can also integrate with local secret managers.
Integrates with enterprise secret management systems in your infrastructure, such as:
DataHub's infrastructure connects directly to your data sources. This requires configuring your sources to allow DataHub access, which is source-dependent:
The CLI runs wherever you execute it (personal machine, CI/CD, cloud instance, scheduler like Airflow/Cron). It first connects to your data sources to extract metadata, then sends that metadata to DataHub. Network requirements depend entirely on where the CLI runs and whether that machine already has connectivity to both your sources and DataHub.
Deployed as software in your infrastructure (Kubernetes, ECS, etc.) with access to both your data sources and DataHub. Like CLI, it connects to sources first, then sends metadata to DataHub.
Key advantage: Only makes outbound connections. You don't need to open inbound firewall ports or configure VPN access for external systems—the executor software runs entirely within your network perimeter.
Most organizations use a mix of all three approaches based on their specific needs. Here are common patterns:
Advantages: Simplest for both scheduling and scale. DataHub handles all infrastructure and orchestration.
Scheduling: Requires external scheduler (Airflow, Cron, Kubernetes CronJob, etc.)
Advantages: Runs in your infrastructure with your security controls, only requires outbound connectivity, integrates with your existing secret management.
The choice often depends on:
Note: These are guidelines, not strict rules. The best choice varies by organization and even by individual data source within an organization.