docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
This guide will walk you through the process of setting up Remote Executors in your environment, including:
A Remote Executor Pool provides a way to organize and manage your Remote Executors in DataHub. Here's how they work:
Before deploying a Remote Executor, ensure you have the following:
DataHub Cloud
<your-company>.acryl.io/gms). NOTE: you MUST include the trailing /gms when configuring the executor.Deployment Environment
Network Connectivity
The Remote Executor requires outbound HTTPS (port 443) connectivity only — no inbound connectivity is needed. Ensure the following endpoints are reachable from your deployment environment:
https://<your-company>.acryl.io/* — DataHub GMS APIhttps://sqs.*.amazonaws.com/* — AWS SQS, used for remote execution task dispatchhttps://pypi.org) or an alternate internal mirror, to download pip packages required by ingestion sourcesdocker.datahub.com)Registry Access
Complete the following steps to create a new Executor Pool from the DataHub Cloud UI:
Once you have created an Executor Pool in DataHub Cloud, you are now ready to deploy an Executor within your environment.
:::note Work with DataHub team to receive deployment templates specific to your environment (Helm charts, CloudFormation, or Terraform) for deploying Remote Executors in this Pool. :::
To access the private DataHub Cloud ECR registry, you'll need to provide your AWS account ID to DataHub Cloud. You can securely share your account ID through:
This step is required to grant your AWS account access to pull the Remote Executor container image.
The DataHub Team will provide a Cloudformation Template that you can run to provision an ECS cluster with a single remote ingestion task. It will also provision an AWS role for the task which grants the permissions necessary to read and delete from the private queue created for you, along with reading the secrets you've specified. At minimum, the template requires the following parameters:
<your-company>.acryl.io/gms)Optional parameters:
SECRET_NAME=SECRET_ARN (up to 10); separate multiple secrets by comma, e.g. SECRET_NAME_1=SECRET_ARN_1,SECRET_NAME_2,SECRET_ARN_2.ENV_VAR_NAME=ENV_VAR_VALUE (up to 10); separate multiple variable by comma, e.g. ENV_VAR_NAME_1=ENV_VAR_VALUE_1,ENV_VAR_NAME_2,ENV_VAR_VALUE_2.:::note
Configuring Secrets enables you to manage ingestion sources from the DataHub UI without storing credentials inside DataHub. Once defined, secrets can be referenced by name inside of your DataHub Ingestion Source configurations using the usual convention: ${SECRET_NAME}.
:::
Deploy Stack
# Using AWS CLI
aws --region us-east-1 cloudformation create-stack \
--stack-name datahub-remote-executor \
--template-body file://datahub-executor.ecs.template.yaml \
--capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM \
--parameters ParameterKey=ExecutorPoolId,ParameterValue="remote" \
ParameterKey=VPCID,ParameterValue="<your-vpc>" \
ParameterKey=SubnetID,ParameterValue="<your-subnet>" \
ParameterKey=DataHubBaseUrl,ParameterValue="https://<your-company>.acryl.io/gms" \
ParameterKey=DataHubAccessToken,ParameterValue="<your-remote-executor-access-token>"
Or use the CloudFormation Console
Configure Secrets (Optional)
# Create a secret in AWS Secrets Manager
aws secretsmanager create-secret \
--name my-source-secret \
--secret-string '{"username":"user","password":"pass"}'
To update your Remote Executor deployment (e.g., to deploy a new container version or modify configuration), you'll need to update your existing CloudFormation Stack. This process involves re-deploying the CloudFormation template with your updated parameters while preserving your existing resources.
Access CloudFormation
Update Template
Configure Parameters
ImageTag: Specify a new version if upgradingDatahubGmsURL: Verify your DataHub URL is correctReview and Deploy
:::note The update process will maintain your existing resources (e.g., secrets, IAM roles) while deploying the new configuration. Monitor the stack events to track the update progress. :::
The datahub-executor-worker Helm chart provides a streamlined way to deploy Remote Executors on any Kubernetes cluster, including Amazon EKS and Google GKE.
To access the private DataHub Cloud container registry, you'll need to work with your DataHub Cloud representative to set up the necessary permissions:
For AWS EKS: Provide the IAM principal that will pull from the ECR repository
Create the required secrets in your Kubernetes cluster:
# Create DataHub PAT secret (required)
# Generate token from Settings > Access Tokens in DataHub UI
kubectl create secret generic datahub-access-token-secret \
--from-literal=datahub-access-token-secret-key=<DATAHUB-ACCESS-TOKEN>
# Create source credentials (optional)
kubectl create secret generic datahub-secret-store \
--from-literal=REDSHIFT_PASSWORD=password \
--from-literal=SNOWFLAKE_PASSWORD=password
Add the DataHub Cloud Helm repository and install the chart:
# Add Helm repository
helm repo add acryl https://executor-helm.acryl.io
helm repo update
# Install the chart
helm install \
--set global.datahub.executor.pool_id="remote" \
--set global.datahub.gms.url="https://<your-company>.acryl.io/gms" \
acryl-executor-worker acryl/datahub-executor-worker
Required parameters:
global.datahub.executor.pool_id: Your Executor Pool IDglobal.datahub.gms.url: Your DataHub Cloud URL (must include /gms)Starting from DataHub Cloud v0.3.8.2, you can manage secrets using Kubernetes Secret CRDs. This enables runtime secret updates without executor restarts.
Create a Kubernetes secret:
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: datahub-secret-store
data:
REDSHIFT_PASSWORD: <base64-encoded-password>
SNOWFLAKE_PASSWORD: <base64-encoded-password>
Mount the secret in your values.yaml:
extraVolumes:
- name: datahub-secret-store
secret:
secretName: datahub-secret-store
extraVolumeMounts:
- mountPath: /mnt/secrets
name: datahub-secret-store
:::note Secret Configuration:
/mnt/secrets (override with DATAHUB_EXECUTOR_FILE_SECRET_BASEDIR)DATAHUB_EXECUTOR_FILE_SECRET_MAXLEN)${SECRET_NAME} syntax:::
Example ingestion recipe using mounted secrets:
source:
type: redshift
config:
host_port: "<redshift-host:port>"
username: connector_test
password: "${REDSHIFT_PASSWORD}"
# ... other configuration ...
For additional configuration options, refer to the values.yaml file in the Helm chart repository.
To update your Kubernetes deployment (e.g., to deploy a new image version or modify configuration), you'll need to upgrade your existing Helm release. This process involves upgrading the Helm release with any new parameters while preserving your existing parameters.
# Update Helm repository
helm repo update acryl
# Upgrade your existing Helm release
# See https://helm.sh/docs/helm/helm_upgrade/ for more options
helm upgrade \
--reuse-values \
--set <key>="<value>" \ # if any new options need to be set
acryl-executor-worker acryl/datahub-executor-worker
For configuration options, refer to the values.yaml file in the Helm chart repository.
Once you have successfully deployed the Executor in your environment, DataHub will automatically begin reporting Executor Status in the UI:
<p align="center"> </p>After you have created an Executor Pool and deployed the Executor within your environment, you are now ready to configure an Ingestion Source to run in that Pool.
:::note New Ingestion Sources will automatically use your designated Default Pool if you have assigned one. You can override this assignment when creating or editing an Ingestion Source at any time. :::
Executors use a weight-based queuing system to manage resource allocation efficiently:
The following environment variables can be configured to manage memory-intensive ingestion tasks, prevent resource contention, and ensure stable execution of resource-demanding processes:
DATAHUB_EXECUTOR_INGESTION_MAX_WORKERS (default: 4) - Maximum concurrent Ingestion tasksDATAHUB_EXECUTOR_MONITORS_MAX_WORKERS (default: 10) - Maximum concurrent Observe monitoring tasksEXECUTOR_TASK_MEMORY_LIMIT - Memory limit per task in kilobytes, configured per Ingestion Source under Extra Environment Variables. This setting helps prevent the executor's master process from being OOM-killed and protects against memory-leaking ingestion tasks. Example configuration:
{ "EXECUTOR_TASK_MEMORY_LIMIT": "128000000" }
EXECUTOR_TASK_WEIGHT - Task weight for resource allocation, configured per Ingestion Source under Extra Environment Variables. By default, each task is assigned a weight of 1/MAX_THREADS (e.g., 0.25 with 4 threads). The total weight of concurrent tasks cannot exceed 1.0. Example configuration for a resource-intensive task:
{ "EXECUTOR_TASK_WEIGHT": "1.0" }
Connection Failed
Secret Access Failed
Container Failed to Start
Do AWS Secrets Manager secrets automatically update in the executor?
No. Secrets are wired into the executor container at deployment time. The ECS Task needs to be restarted when secrets change.
How can I verify successful deployment?
For ECS deployments, check AWS Console:
Starting datahub executor workerThis indicates successful connection to DataHub Cloud.