contrib/metadata-model-extensions/datahub-demo-dataset-governance-validator/README.md
A demo custom validator that shows how to ensure all datasets in DataHub have required governance metadata (ownership, tags, and domain) before they can be created or updated. This is intended as an example and starting point for building your own validators.
This validator implements DataHub's AspectPayloadValidator interface to enforce governance requirements:
The validator operates at the batch level, validating multiple Metadata Change Proposals (MCPs) together for optimal performance.
Build a custom DataHub GMS image with the validator included:
./gradlew build
This creates build/libs/datahub-dataset-governance-validator-1.0.0.jar
Create Dockerfile:
# In this example, we start with the current stable DataHub image- this can be any GMS image however.
FROM acryldata/datahub-gms:v1.2.0
# Create plugins directory
RUN mkdir -p /etc/datahub/plugins/models/dataset-governance-validator/1.0.0
# Copy validator JAR and configuration
COPY build/libs/datahub-dataset-governance-validator-1.0.0.jar \
/etc/datahub/plugins/models/dataset-governance-validator/1.0.0/
COPY src/main/resources/entity-registry.yml \
/etc/datahub/plugins/models/dataset-governance-validator/1.0.0/
# DataHub will auto-discover plugins in this directory
# Build custom image (replace 'your-registry' with your Docker registry)
docker build -t your-registry/datahub-gms-with-validator:v1.2.0 .
# Push to registry
docker push your-registry/datahub-gms-with-validator:v1.2.0
services:
datahub-gms:
image: your-registry/datahub-gms-with-validator:v1.2.0 # Use your custom image
# ... rest of configuration
For development or testing environments, you can mount the validator as a volume:
./gradlew deployPlugin
This creates the plugin at ~/.datahub/plugins/models/dataset-governance-validator/1.0.0/
services:
datahub-gms:
image: acryldata/datahub-gms:v1.2.0
volumes:
- ~/.datahub/plugins:/etc/datahub/plugins/models
# ... rest of configuration
ConfigMap is also a valid strategy, however, it must be noted the size of the validator is limited to the maximum configmap size (1 MiB).
# Create ConfigMap with validator JAR and config
kubectl create configmap dataset-governance-validator \
--from-file=datahub-dataset-governance-validator-1.0.0.jar=build/libs/datahub-dataset-governance-validator-1.0.0.jar \
--from-file=entity-registry.yml=src/main/resources/entity-registry.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: datahub-gms
spec:
template:
spec:
containers:
- name: datahub-gms
image: acryldata/datahub-gms:v1.2.0
volumeMounts:
- name: validator-plugin
mountPath: /etc/datahub/plugins/models/dataset-governance-validator/1.0.0
volumes:
- name: validator-plugin
configMap:
name: dataset-governance-validator
apiVersion: apps/v1
kind: Deployment
metadata:
name: datahub-gms
spec:
template:
spec:
initContainers:
- name: install-validator
image: your-registry/validator-installer:latest
command: ["cp", "-r", "/validator/", "/shared/plugins/"]
volumeMounts:
- name: plugins-volume
mountPath: /shared/plugins
containers:
- name: datahub-gms
image: acryldata/datahub-gms:v1.2.0
volumeMounts:
- name: plugins-volume
mountPath: /etc/datahub/plugins/models
volumes:
- name: plugins-volume
emptyDir: {}
The validator is configured via src/main/resources/entity-registry.yml:
id: "dataset-governance-validator"
entities:
- name: dataset
category: core
aspects:
- name: schemaMetadata
plugins:
aspectPayloadValidators:
- className: "com.linkedin.metadata.aspect.plugins.validation.DatasetGovernanceValidator"
supportedOperations: ["CREATE", "UPSERT"]
supportedEntityAspectNames:
- entityName: "dataset"
aspectName: "*"
To modify validation requirements, edit DatasetGovernanceValidator.java:
// Change required aspects
private static final Set<String> REQUIRED_ASPECTS = Set.of(
Constants.OWNERSHIP_ASPECT_NAME, // "ownership"
Constants.GLOBAL_TAGS_ASPECT_NAME, // "globalTags"
Constants.DOMAINS_ASPECT_NAME, // "domains"
// Add more as needed:
// Constants.GLOSSARY_TERMS_ASPECT_NAME // "glossaryTerms"
);
Monitor GMS logs during startup:
# Docker
docker logs datahub-gms | grep -i "validator\|plugin"
# Kubernetes
kubectl logs deployment/datahub-gms | grep -i "validator\|plugin"
Expected output:
Enabled 1 plugins. [com.linkedin.metadata.aspect.plugins.validation.DatasetGovernanceValidator]
Run the Python test script:
# Set your DataHub token
export DATAHUB_TOKEN="your-token-here"
# Run the test script
python test_validator.py
The script will test both scenarios:
Plugin Not Loading
Validation Not Triggering
supportedOperations includes the operation being performedKey Benefits of Batch Validation:
DatasetGovernanceValidator.java - Main validator implementationentity-registry.yml - Plugin configurationbuild.gradle - Build configuration and deployment taskstest_validator.py - Python test script for validation testingLOCAL_DEVELOPER_STEP_BY_STEP_GUIDE.md - Detailed development setupDockerfile - Custom GMS image definition