docs/ui-ingestion.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
DataHub helps you discover and understand your organization's data by automatically collecting information about your data sources. This process is called metadata ingestion, allowing DataHub to automatically pull in:
This makes it simple to connect to popular platforms like Snowflake, BigQuery, dbt, and more, schedule automatic updates, and manage credentials securely.
To manage metadata ingestion in DataHub, you need appropriate permissions.
:::note Ask DataHub for Ingestion (Public Beta - DataHub Cloud) Ask DataHub (Public Beta) is available within the ingestion creation and troubleshooting workflow for DataHub Cloud deployments. Get AI-powered assistance with configuration, filtering, troubleshooting, and best practices—right in your workflow. :::
Users can be granted the following privileges for full administrative access to all ingestion sources:
Manage Metadata Ingestion - Provides complete access to create, edit, run, and delete all ingestion sourcesManage Secrets - Allows creation and management of encrypted credentials used in ingestion configurationsThese privileges can be granted in two ways:
Manage Metadata Ingestion and Manage Secrets platform privileges to specific users or groupsFor more granular control, administrators can create Custom Policies that apply specifically to Ingestion Sources, allowing different users to have different levels of access:
Prerequisites:
VIEW_INGESTION_SOURCE_PRIVILEGES_ENABLED feature flag:::caution Important: Once this feature flag is enabled, any policies that apply to "All" resource types will now include Ingestion Sources, including the default read-only policies. This will make the Ingestion tab visible and potentially actionable depending on the applied privileges. Implement this with care if you have view-only policies that should not expose the Data Sources page. :::
Once you have the appropriate privileges, navigate to the Ingestion tab in DataHub.
<p align="center"> </p>On this page, you'll see a list of active Ingestion Sources. An Ingestion Source represents a configured connection to an external data system from which DataHub extracts metadata.
If you're just getting started, you won't have any sources configured. The following sections will guide you through creating your first ingestion source.
Begin by clicking + Create source to start the ingestion source creation process.
<p align="center"> </p>Next, select the type of data source you want to connect. DataHub provides pre-built templates for popular platforms including:
Select the template that matches your data source. If your specific platform isn't listed, you can choose Custom to configure a source manually, though this requires more technical knowledge.
After selecting your data source template, you'll configure how DataHub connects to and extracts metadata from your source.
<p align="center"> </p>Ask DataHub (Public Beta - Cloud only) provides contextual assistance throughout the ingestion configuration process
Name and Owners: First, provide a descriptive name for your ingestion source that will help you and your team identify it later. You can also assign Users and/or Groups as owners of this ingestion source. By default, you (the creator) will be assigned as an owner, but you can add additional owners or change this at any time after creation.
Connection Information: Next, you'll configure the connection details using a user-friendly form. The exact fields will vary depending on your chosen platform, but typically include:
Asset Filters: Configure what metadata to extract:
Ingestion Settings: Configure ingestion behavior including profiling, stale metadata handling, and other operational settings. The defaults represent best practices for most use cases.
:::note Ask DataHub for Configuration Help (Public Beta) Ask DataHub (Public Beta) can help you understand the behavior and options of each configuration setting. Get tailored recommendations for your data source and use case. :::
<p align="center"> </p>Ask DataHub (Public Beta - Cloud only) helps you understand configuration options and provides tailored recommendations for your data source
For production environments, sensitive information like passwords and API keys should be stored securely using DataHub's Secrets functionality.
To create a secret:
BIGQUERY_PRIVATE_KEY)Once created, secrets can be referenced in your ingestion configuration forms using the dropdown menus provided for credential fields.
:::caution Security Note
Users with the Manage Secrets privilege can retrieve plaintext secret values through DataHub's GraphQL API. Ensure secrets are only accessible to trusted administrators.
:::
Before proceeding, it's important to verify that DataHub can successfully connect to your data source. Most ingestion source forms include a Test Connection button that validates:
If the connection test fails, review your configuration and ensure that:
For users who need additional control, DataHub provides advanced configuration options accessible in the Advanced Settings section:
Configure how often DataHub should sync metadata from your source. You can enable or disable scheduled execution using the toggle (recommended: enabled). This ensures your metadata stays up-to-date without manual intervention.
<p align="center"> </p>If you prefer to run ingestion manually or on an ad-hoc basis, you can skip the scheduling step entirely.
Review your configuration to ensure all settings are correct. When you're ready, you have two options:
Once you're happy with your configurations, click your preferred save option to finalize your source.
Once you've created your Ingestion Source, you can run it by clicking the 'Play' button. Shortly after, you should see the 'Last Status' column of the ingestion source change to Running, indicating that DataHub has successfully queued the ingestion job.
When ingestion completes successfully, the status will show as Success in green.
The Run History tab shows you a complete history of all your ingestion runs. Here you can:
This makes it easy to track your ingestion performance and troubleshoot any issues over time.
<p align="center"> </p>After successful ingestion, you can view detailed information about what was extracted:
If an ingestion run is taking too long or appears to be stuck, you can cancel it by clicking the 'Stop' button on the running job.
<p align="center"> </p>This is useful when encountering issues like:
When an ingestion run fails, you'll see a failed status indicator in your sources list.
<p align="center"> </p>Common causes of ingestion failures include:
To diagnose ingestion failures, click on a run history status (Failed, Aborted) value to view and download comprehensive ingestion run logs.
<p align="center"> </p>The logs provide detailed information about:
If your DataHub instance has Metadata Service Authentication enabled, you'll need to provide a Personal Access Token in your configuration.
<p align="center"> </p>While the UI-based forms handle most common ingestion scenarios, advanced users may need direct access to YAML configuration for:
For these advanced use cases, DataHub supports direct YAML recipe configuration. For detailed information about YAML-based configuration, including syntax and examples, see the Recipe Overview Guide.
You can deploy recipes using the CLI as mentioned in the CLI documentation for uploading ingestion recipes.
datahub ingest deploy --name "My Test Ingestion Source" --schedule "5 * * * *" --time-zone "UTC" -c recipe.yaml
Create ingestion sources using DataHub's GraphQL API using the createIngestionSource mutation endpoint.
mutation {
createIngestionSource(
input: {
name: "My Test Ingestion Source"
type: "mysql"
description: "My ingestion source description"
schedule: { interval: "*/5 * * * *", timezone: "UTC" }
config: {
recipe: "{\"source\":{\"type\":\"mysql\",\"config\":{\"include_tables\":true,\"database\":null,\"password\":\"${MYSQL_PASSWORD}\",\"profiling\":{\"enabled\":false},\"host_port\":null,\"include_views\":true,\"username\":\"${MYSQL_USERNAME}\"}},\"pipeline_name\":\"urn:li:dataHubIngestionSource:f38bd060-4ea8-459c-8f24-a773286a2927\"}"
version: "0.8.18"
executorId: "mytestexecutor"
}
}
)
}
Note: Recipe must be double quotes escaped when using GraphQL
If you're running DataHub using datahub docker quickstart and experiencing connection failures, this may be due to network configuration issues. The ingestion executor might be unable to reach DataHub's backend services.
Try updating your ingestion configuration to use the Docker internal DNS name:
<p align="center"> </p>If your ingestion source shows a dash mark (-) status and never changes to 'Running', this could mean:
If clicking "Play" doesn't resolve the issue, DataHub Core users should diagnose their actions container:
docker psdocker logs <container-id>Consider using CLI-based ingestion when: