Docker Images

Prerequisites

You need to install docker and docker-compose (if using Linux; on Windows and Mac compose is included with Docker Desktop).

Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area.

If you prefer not to use Docker Desktop (which requires a license for commercial use), you can opt for free and open-source alternatives such as Podman Desktop or Rancher Desktop. To configure them, you can add the following aliases to your ~/.bashrc file:

bash

# podman
alias docker=podman
alias docker-compose="podman compose"

# Rancher (or nerdctl)
alias docker=nerdctl
alias docker-compose="nerdctl compose"

Quickstart

The easiest way to bring up and test DataHub is using DataHub Docker images which are continuously deployed to Docker Hub with every commit to repository.

You can easily download and run all these images and their dependencies with our quick start guide.

DataHub Docker Images:

Do not use latest or debug tags for any of the image as those are not supported and present only due to legacy reasons. Please use head or tags specific for versions like v0.8.40. For production we recommend using version specific tags not head.

acryldata/datahub-ingestion
acryldata/datahub-gms
acryldata/datahub-frontend-react
acryldata/datahub-mae-consumer
acryldata/datahub-mce-consumer
acryldata/datahub-upgrade (runs SystemUpdate; performs SQL and search index setup when DATAHUB_SQL_SETUP_ENABLED=true)
acryldata/datahub-actions. Do not use acryldata/acryl-datahub-actions as that is deprecated and no longer used.

Image Variants

datahub-ingestion and datahub-actions are available as full, slim, and locked (no Alpine variants). datahub-ingestion is Ubuntu 24.04–based. datahub-actions is built on Wolfi (cgr.dev/chainguard/wolfi-base; override the base with Docker build arg WOLFI_BASE_IMAGE if needed).

Variant	Image size	Use case
`full` (default)	Largest	All connectors, maximum compatibility
`slim`	Medium	Common connectors, good balance
`locked`	Medium	Air-gapped environments (pypi disabled)

Variant Tag Format

acryldata/datahub-ingestion:v0.x.y          # full (default)
acryldata/datahub-ingestion:v0.x.y-slim     # slim
acryldata/datahub-ingestion:v0.x.y-locked   # locked

datahub-ingestion Feature Matrix

Feature	full	slim	locked
Core CLI & REST/Kafka	Yes	Yes	Yes
S3 / GCS / Azure Blob	Yes	Yes	Yes
Snowflake	Yes	Yes	-
BigQuery	Yes	Yes	-
Redshift	Yes	Yes	-
MySQL / PostgreSQL	Yes	Yes	-
ClickHouse	Yes	Yes	-
dbt	Yes	Yes	-
Looker / LookML	Yes	Yes	-
Tableau / PowerBI	Yes	Yes	-
Superset	Yes	Yes	-
Glue	Yes	Yes	-
Spark lineage (JRE)	Yes	-	-
Oracle client	Yes	-	-
Runtime `pip install`	Yes	Yes	-

datahub-actions Feature Matrix

Feature	full	slim	locked
Core actions	Yes	Yes	Yes
Kafka / Executor	Yes	Yes	Yes
Slack / Teams	Yes	Yes	Yes
Tag / Term / Doc propagation	Yes	Yes	Yes
Snowflake tag propagation	Yes	Yes	Yes
Bundled CLI venvs	Yes	Yes	Yes
Runtime `pip install`	Yes	Yes	-

CI Testing Coverage

Image Variant	Tested in CI	Notes
`full`	Yes	Smoke tests run on every PR
`slim`	Yes	Smoke tests run on every PR
`locked`	Build only

Choosing the Right Variant

full: Use when you need maximum connector coverage or aren't sure what you'll need
slim: Recommended for most production deployments with standard cloud data stacks
locked: Required for air-gapped environments where runtime package installation is prohibited

Dependencies:

Elasticsearch (or OpenSearch)
MySQL (or PostgreSQL)
(Optional) Neo4j

SQL and search index setup are performed by the system update job (datahub-upgrade with -u SystemUpdate) when backing services are healthy; no separate setup containers are required.

Ingesting demo data.

If you want to test ingesting some data once DataHub is up, use the ./docker/ingestion/ingestion.sh script or datahub docker ingest-sample-data. See the quickstart guide for more details.

Using Docker Images During Development

See Using Docker Images During Development.

Building And Deploying Docker Images

We use GitHub Actions to build and continuously deploy our images. There should be no need to do this manually; a successful release on Github will automatically publish the images.

Building images

This is not our recommended development flow and most developers should be following the Using Docker Images During Development guide.

To build the full images (that we are going to publish), you need to run the following:

COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker compose -p datahub build

This is because we're relying on builtkit for multistage builds. It does not hurt also set DATAHUB_VERSION to something unique.

Community Built Images

As the open source project grows, community members would like to contribute additions to the docker images. Not all contributions to the images can be accepted because those changes are not useful for all community members, it will increase build times, add dependencies and possible security vulns. In those cases this section can be used to point to Dockerfiles hosted by the community which build on top of the images published by the DataHub core team along with any container registry links where the result of those images are maintained.

DataHub Docker Nuke Tasks

This document describes the nuke task system for cleaning up DataHub Docker containers and volumes.

Overview

The nuke tasks provide a way to completely remove DataHub containers and volumes for different project namespaces. This is useful for:

Cleaning up test environments
Resetting development setups
Isolating different project instances
Troubleshooting container issues

Available Tasks

All Configurations Have Nuke Tasks

Every quickstart configuration automatically gets a nuke task for targeted cleanup:

quickstartNuke - Removes containers and volumes for the default project namespace (datahub)
quickstartDebugNuke - Removes containers and volumes for the debug configuration (datahub)
quickstartCypressNuke - Removes containers and volumes for the cypress configuration (dh-cypress)
quickstartDebugMinNuke - Removes containers and volumes for the debug-min configuration (datahub)
quickstartDebugConsumersNuke - Removes containers and volumes for the debug-consumers configuration (datahub)
quickstartPgNuke - Removes containers and volumes for the postgres configuration (datahub)
quickstartPgDebugNuke - Removes containers and volumes for the debug-postgres configuration (datahub)
quickstartSlimNuke - Removes containers and volumes for the backend configuration (datahub)
quickstartSparkNuke - Removes containers and volumes for the spark configuration (datahub)
quickstartStorageNuke - Removes containers and volumes for the storage configuration (datahub)
quickstartBackendDebugNuke - Removes containers and volumes for the backend-debug configuration (datahub)

Project Namespace Behavior

Default project namespace (datahub): Most configurations use this, so their nuke tasks will clean up containers in the same namespace
Custom project namespace (dh-cypress): The cypress configuration uses its own namespace for isolation

Usage

Basic Usage

bash

# Remove containers and volumes for specific configurations
./gradlew quickstartDebugNuke      # For debug configuration
./gradlew quickstartCypressNuke    # For cypress configuration
./gradlew quickstartDebugMinNuke   # For debug-min configuration
./gradlew quickstartPgNuke         # For postgres configuration

# For general cleanup of all containers
./gradlew quickstartDown

When to Use Each Task

Use specific nuke tasks when:
- You want to clean up a specific configuration environment
- You need targeted cleanup without affecting other configurations
- You're working with a particular development setup (debug, postgres, cypress, etc.)
Use quickstartDown when:
- You want to stop all running containers
- You need a general cleanup option
- You want to ensure all environments are stopped

How It Works

Volume Management: Each nuke task sets removeVolumes = true for relevant configurations
Container Cleanup: Tasks are finalized by appropriate ComposeDownForced operations
Project Isolation: Each task operates within its own Docker Compose project namespace
Configuration Respect: Tasks respect the projectName settings in quickstart_configs

Configuration

The nuke tasks are automatically generated based on the quickstart_configs in docker/build.gradle. To add a new nuke task:

Add a configuration to quickstart_configs:

gradle

'quickstartCustom': [
    profile: 'debug',
    modules: [...],
    // Optional: custom project name for isolation
    additionalConfig: [
        projectName: 'dh-custom'
    ]
]

The task quickstartCustomNuke will be automatically created

Troubleshooting

Task Not Found

Ensure the configuration exists in quickstart_configs
Check that the task name follows the pattern {configName}Nuke

Containers Not Removed

Verify the project namespace is correct
Check that the configuration has the right projectName setting
Ensure the task is targeting the correct ComposeDownForced operations

Volume Persistence

Check if preserveVolumes is set to true in the configuration
Verify the removeVolumes setting is properly applied

Start services: ./gradlew quickstartDebug, ./gradlew quickstartCypress
Stop services: ./gradlew quickstartDown
Reload services: ./gradlew debugReload, ./gradlew cypressReload

Examples

Complete Development Environment Reset

bash

# Clean up specific debug environment
./gradlew quickstartDebugNuke

# Start fresh
./gradlew quickstartDebug

Cypress Environment Isolation

bash

# Clean up cypress environment
./gradlew quickstartCypressNuke

# Start fresh cypress environment
./gradlew quickstartCypress

Mixed Environment Management

bash

# Clean up only cypress (leaving main environment intact)
./gradlew quickstartCypressNuke

# Clean up only debug environment (leaving cypress intact)
./gradlew quickstartDebugNuke

# Clean up only postgres environment (leaving others intact)
./gradlew quickstartPgNuke

Deploying with Docker

Docker Images

Prerequisites

Quickstart

Image Variants

Variant Tag Format

datahub-ingestion Feature Matrix

datahub-actions Feature Matrix

CI Testing Coverage

Choosing the Right Variant

Ingesting demo data.

Using Docker Images During Development

Building And Deploying Docker Images

Building images

Community Built Images

DataHub Docker Nuke Tasks

Overview

Available Tasks

All Configurations Have Nuke Tasks

Project Namespace Behavior

Usage

Basic Usage

When to Use Each Task

How It Works

Configuration

Troubleshooting

Task Not Found

Containers Not Removed

Volume Persistence

Related Commands

Examples

Complete Development Environment Reset

Cypress Environment Isolation

Mixed Environment Management