docker/README.md
You need to install docker and docker-compose (if using Linux; on Windows and Mac compose is included with Docker Desktop).
Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area.
If you prefer not to use Docker Desktop (which requires a license for commercial use), you can opt for free and open-source alternatives such as Podman Desktop or Rancher Desktop. To configure them, you can add the following aliases to your ~/.bashrc file:
# podman
alias docker=podman
alias docker-compose="podman compose"
# Rancher (or nerdctl)
alias docker=nerdctl
alias docker-compose="nerdctl compose"
The easiest way to bring up and test DataHub is using DataHub Docker images which are continuously deployed to Docker Hub with every commit to repository.
You can easily download and run all these images and their dependencies with our quick start guide.
DataHub Docker Images:
Do not use latest or debug tags for any of the image as those are not supported and present only due to legacy reasons. Please use head or tags specific for versions like v0.8.40. For production we recommend using version specific tags not head.
DATAHUB_SQL_SETUP_ENABLED=true)acryldata/acryl-datahub-actions as that is deprecated and no longer used.datahub-ingestion and datahub-actions are available as full, slim, and locked (no Alpine variants). datahub-ingestion is Ubuntu 24.04–based. datahub-actions is built on Wolfi (cgr.dev/chainguard/wolfi-base; override the base with Docker build arg WOLFI_BASE_IMAGE if needed).
| Variant | Image size | Use case |
|---|---|---|
full (default) | Largest | All connectors, maximum compatibility |
slim | Medium | Common connectors, good balance |
locked | Medium | Air-gapped environments (pypi disabled) |
acryldata/datahub-ingestion:v0.x.y # full (default)
acryldata/datahub-ingestion:v0.x.y-slim # slim
acryldata/datahub-ingestion:v0.x.y-locked # locked
| Feature | full | slim | locked |
|---|---|---|---|
| Core CLI & REST/Kafka | Yes | Yes | Yes |
| S3 / GCS / Azure Blob | Yes | Yes | Yes |
| Snowflake | Yes | Yes | - |
| BigQuery | Yes | Yes | - |
| Redshift | Yes | Yes | - |
| MySQL / PostgreSQL | Yes | Yes | - |
| ClickHouse | Yes | Yes | - |
| dbt | Yes | Yes | - |
| Looker / LookML | Yes | Yes | - |
| Tableau / PowerBI | Yes | Yes | - |
| Superset | Yes | Yes | - |
| Glue | Yes | Yes | - |
| Spark lineage (JRE) | Yes | - | - |
| Oracle client | Yes | - | - |
Runtime pip install | Yes | Yes | - |
| Feature | full | slim | locked |
|---|---|---|---|
| Core actions | Yes | Yes | Yes |
| Kafka / Executor | Yes | Yes | Yes |
| Slack / Teams | Yes | Yes | Yes |
| Tag / Term / Doc propagation | Yes | Yes | Yes |
| Snowflake tag propagation | Yes | Yes | Yes |
| Bundled CLI venvs | Yes | Yes | Yes |
Runtime pip install | Yes | Yes | - |
| Image Variant | Tested in CI | Notes |
|---|---|---|
full | Yes | Smoke tests run on every PR |
slim | Yes | Smoke tests run on every PR |
locked | Build only |
full: Use when you need maximum connector coverage or aren't sure what you'll needslim: Recommended for most production deployments with standard cloud data stackslocked: Required for air-gapped environments where runtime package installation is prohibitedDependencies:
SQL and search index setup are performed by the system update job (datahub-upgrade with -u SystemUpdate) when backing services are healthy; no separate setup containers are required.
If you want to test ingesting some data once DataHub is up, use the ./docker/ingestion/ingestion.sh script or datahub docker ingest-sample-data. See the quickstart guide for more details.
See Using Docker Images During Development.
We use GitHub Actions to build and continuously deploy our images. There should be no need to do this manually; a successful release on Github will automatically publish the images.
This is not our recommended development flow and most developers should be following the Using Docker Images During Development guide.
To build the full images (that we are going to publish), you need to run the following:
COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker compose -p datahub build
This is because we're relying on builtkit for multistage builds. It does not hurt also set DATAHUB_VERSION to
something unique.
As the open source project grows, community members would like to contribute additions to the docker images. Not all contributions to the images can be accepted because those changes are not useful for all community members, it will increase build times, add dependencies and possible security vulns. In those cases this section can be used to point to Dockerfiles hosted by the community which build on top of the images published by the DataHub core team along with any container registry links where the result of those images are maintained.
This document describes the nuke task system for cleaning up DataHub Docker containers and volumes.
The nuke tasks provide a way to completely remove DataHub containers and volumes for different project namespaces. This is useful for:
Every quickstart configuration automatically gets a nuke task for targeted cleanup:
quickstartNuke - Removes containers and volumes for the default project namespace (datahub)quickstartDebugNuke - Removes containers and volumes for the debug configuration (datahub)quickstartCypressNuke - Removes containers and volumes for the cypress configuration (dh-cypress)quickstartDebugMinNuke - Removes containers and volumes for the debug-min configuration (datahub)quickstartDebugConsumersNuke - Removes containers and volumes for the debug-consumers configuration (datahub)quickstartPgNuke - Removes containers and volumes for the postgres configuration (datahub)quickstartPgDebugNuke - Removes containers and volumes for the debug-postgres configuration (datahub)quickstartSlimNuke - Removes containers and volumes for the backend configuration (datahub)quickstartSparkNuke - Removes containers and volumes for the spark configuration (datahub)quickstartStorageNuke - Removes containers and volumes for the storage configuration (datahub)quickstartBackendDebugNuke - Removes containers and volumes for the backend-debug configuration (datahub)datahub): Most configurations use this, so their nuke tasks will clean up containers in the same namespacedh-cypress): The cypress configuration uses its own namespace for isolation# Remove containers and volumes for specific configurations
./gradlew quickstartDebugNuke # For debug configuration
./gradlew quickstartCypressNuke # For cypress configuration
./gradlew quickstartDebugMinNuke # For debug-min configuration
./gradlew quickstartPgNuke # For postgres configuration
# For general cleanup of all containers
./gradlew quickstartDown
Use specific nuke tasks when:
Use quickstartDown when:
removeVolumes = true for relevant configurationsComposeDownForced operationsprojectName settings in quickstart_configsThe nuke tasks are automatically generated based on the quickstart_configs in docker/build.gradle. To add a new nuke task:
Add a configuration to quickstart_configs:
'quickstartCustom': [
profile: 'debug',
modules: [...],
// Optional: custom project name for isolation
additionalConfig: [
projectName: 'dh-custom'
]
]
The task quickstartCustomNuke will be automatically created
quickstart_configs{configName}NukeprojectName settingComposeDownForced operationspreserveVolumes is set to true in the configurationremoveVolumes setting is properly applied./gradlew quickstartDebug, ./gradlew quickstartCypress./gradlew quickstartDown./gradlew debugReload, ./gradlew cypressReload# Clean up specific debug environment
./gradlew quickstartDebugNuke
# Start fresh
./gradlew quickstartDebug
# Clean up cypress environment
./gradlew quickstartCypressNuke
# Start fresh cypress environment
./gradlew quickstartCypress
# Clean up only cypress (leaving main environment intact)
./gradlew quickstartCypressNuke
# Clean up only debug environment (leaving cypress intact)
./gradlew quickstartDebugNuke
# Clean up only postgres environment (leaving others intact)
./gradlew quickstartPgNuke