docs/advanced/monitoring.md
Monitoring DataHub's system components is essential for maintaining operational excellence, troubleshooting performance issues, and ensuring system reliability. This comprehensive guide covers how to implement observability in DataHub through tracing and metrics, and how to extract valuable insights from your running instances.
Effective monitoring enables you to:
DataHub's observability strategy consists of two complementary approaches:
Metrics Collection
Purpose: Aggregate statistical data about system behavior over time Technology: Transitioning from DropWizard/JMX to Micrometer
Current State: DropWizard metrics exposed via JMX, collected by Prometheus Future Direction: Native Micrometer integration for Spring-based metrics Compatibility: Prometheus-compatible format with support for other metrics backends
Key Metrics Categories:
Distributed Tracing
Purpose: Track individual requests as they flow through multiple services and components Technology: OpenTelemetry-based instrumentation
Key Benefits:
DataHub provides comprehensive instrumentation for its GraphQL API through Micrometer metrics, enabling detailed performance monitoring and debugging capabilities. The instrumentation system offers flexible configuration options to balance between observability depth and performance overhead.
Traditional GraphQL monitoring only tells you "the search query is slow" but not why. Without path-level instrumentation, you're blind to which specific fields are causing performance bottlenecks in complex nested queries.
Consider this GraphQL query:
query getSearchResults {
search(input: { query: "sales data" }) {
searchResults {
entity {
... on Dataset {
name
owner {
# Path: /search/searchResults/entity/owner
corpUser {
displayName
}
}
lineage {
# Path: /search/searchResults/entity/lineage
upstreamCount
downstreamCount
upstreamEntities {
urn
name
}
}
schemaMetadata {
# Path: /search/searchResults/entity/schemaMetadata
fields {
fieldPath
description
}
}
}
}
}
}
}
With path-level metrics, you discover:
/search/searchResults/entity/owner - 50ms (fast, well-cached)/search/searchResults/entity/lineage - 2500ms (SLOW! hitting graph database)/search/searchResults/entity/schemaMetadata - 150ms (acceptable)Without path metrics: "Search query takes 3 seconds"
With path metrics: "Lineage resolution is the bottleneck"
Instead of guessing, you know exactly which resolver needs optimization. Maybe lineage needs better caching or pagination.
Identify expensive patterns like:
# These paths consistently slow:
/*/lineage/upstreamEntities/*
/*/siblings/*/platform
# Action: Add field-level caching or lazy loading
Different clients request different fields. Path instrumentation shows:
Spot resolver patterns that indicate N+1 problems:
/users/0/permissions - 10ms
/users/1/permissions - 10ms
/users/2/permissions - 10ms
... (100 more times)
Start targeted to minimize overhead:
# Focus on known slow operations
fieldLevelOperations: "searchAcrossEntities,getDataset"
# Target expensive resolver paths
fieldLevelPaths: "/**/lineage/**,/**/relationships/**,/**/privileges"
The GraphQL instrumentation is implemented through GraphQLTimingInstrumentation, which extends GraphQL Java's instrumentation framework. It provides:
Metric: graphql.request.duration
operation: Operation name (e.g., "getSearchResultsForMultiple")operation.type: Query, mutation, or subscriptionsuccess: true/false based on error presencefield.filtering: Filtering mode applied (DISABLED, ALL_FIELDS, BY_OPERATION, BY_PATH, BY_BOTH)Metric: graphql.request.errors
operation: Operation nameoperation.type: Query, mutation, or subscriptionMetric: graphql.field.duration
parent.type: GraphQL parent type (e.g., "Dataset", "User")field: Field name being resolvedoperation: Operation name contextsuccess: true/falsepath: Field path (optional, controlled by fieldLevelPathEnabled)Metric: graphql.field.errors
Metric: graphql.fields.instrumented
operation: Operation namefiltering.mode: Active filtering modegraphQL:
metrics:
# Master switch for all GraphQL metrics
enabled: ${GRAPHQL_METRICS_ENABLED:true}
# Enable field-level resolver metrics
fieldLevelEnabled: ${GRAPHQL_METRICS_FIELD_LEVEL_ENABLED:false}
Field-level metrics can add significant overhead for complex queries. DataHub provides multiple strategies to control which fields are instrumented:
Target specific GraphQL operations known to be slow or critical:
fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineageStructure"
Use path patterns to instrument specific parts of your schema:
fieldLevelPaths: "/search/results/**,/user/*/permissions,/**/lineage/*"
Path Pattern Syntax:
/user - Exact match for the user field/user/* - Direct children of user (e.g., /user/name, /user/email)/user/** - User field and all descendants at any depth/*/comments/* - Comments field under any parentWhen both operation and path filters are configured, only fields matching BOTH criteria are instrumented:
# Only instrument search results within specific operations
fieldLevelOperations: "searchAcrossEntities"
fieldLevelPaths: "/searchResults/**"
# Include field paths as metric tags (WARNING: high cardinality risk)
fieldLevelPathEnabled: false
# Include metrics for trivial property access
trivialDataFetchersEnabled: false
The instrumentation automatically determines the most efficient filtering mode:
Field-level instrumentation overhead varies by:
Start Conservative: Begin with field-level metrics disabled
fieldLevelEnabled: false
Target Known Issues: Enable selectively for problematic operations
fieldLevelEnabled: true
fieldLevelOperations: "slowSearchQuery,complexLineageQuery"
Use Path Patterns Wisely: Focus on expensive resolver paths
fieldLevelPaths: "/search/**,/**/lineage/**"
Avoid Path Tags in Production: High cardinality risk
fieldLevelPathEnabled: false # Keep this false
Monitor Instrumentation Overhead: Track the graphql.fields.instrumented metric
graphQL:
metrics:
enabled: true
fieldLevelEnabled: true
fieldLevelOperations: "" # All operations
fieldLevelPathEnabled: true # Include paths for debugging
trivialDataFetchersEnabled: true
graphQL:
metrics:
enabled: true
fieldLevelEnabled: true
fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineage"
fieldLevelPaths: "/search/results/*,/lineage/upstream/**,/lineage/downstream/**"
fieldLevelPathEnabled: false
trivialDataFetchersEnabled: false
graphQL:
metrics:
enabled: true
fieldLevelEnabled: false # Only request-level metrics
When investigating GraphQL performance issues:
fieldLevelOperations: "problematicQuery"
fieldLevelPathEnabled: true # Temporary only!
The GraphQL metrics integrate seamlessly with DataHub's monitoring infrastructure:
/actuator/prometheusExample Prometheus queries:
# Average request duration by operation
rate(graphql_request_duration_seconds_sum[5m])
/ rate(graphql_request_duration_seconds_count[5m])
# Field resolver p99 latency
histogram_quantile(0.99,
rate(graphql_field_duration_seconds_bucket[5m])
)
# Error rate by operation
rate(graphql_request_errors_total[5m])
DataHub provides comprehensive instrumentation for Kafka message consumption through Micrometer metrics, enabling real-time monitoring of message queue latency and consumer performance. This instrumentation is critical for maintaining data freshness SLAs and identifying processing bottlenecks across DataHub's event-driven architecture.
Traditional Kafka lag monitoring only tells you "we're behind by 10,000 messages" Without queue time metrics, you can't answer critical questions like "are we meeting our 5-minute data freshness SLA?" or "which consumer groups are experiencing delays?"
Consider these scenarios:
Variable Production Rate:
Burst Traffic Patterns:
Consumer Group Performance:
Kafka queue time instrumentation is implemented across all DataHub consumers:
Each consumer automatically records queue time metrics using the message's embedded timestamp.
Metric: kafka.message.queue.time
The timer automatically tracks:
Default Configuration:
kafka:
consumer:
metrics:
# Percentiles to calculate
percentiles: "0.5,0.95,0.99,0.999"
# Service Level Objective buckets (seconds)
slo: "300,1800,3600,10800,21600,43200" # 5m,30m,1h,3h,6h,12h
# Maximum expected queue time
maxExpectedValue: 86400 # 24 hours (seconds)
SLA Compliance Monitoring:
# Percentage of messages processed within 5-minute SLA
sum(rate(kafka_message_queue_time_seconds_bucket{le="300"}[5m])) by (topic)
/ sum(rate(kafka_message_queue_time_seconds_count[5m])) by (topic) * 100
Consumer Group Comparison:
# P99 queue time by consumer group
histogram_quantile(0.99,
sum by (consumer_group, le) (
rate(kafka_message_queue_time_seconds_bucket[5m])
)
)
Metric Cardinality:
The instrumentation is designed for low cardinality:
topic and consumer.groupOverhead Assessment:
The new Micrometer-based queue time metrics coexist with the legacy DropWizard kafkaLag histogram:
kafkaLag histogram via JMXkafka.message.queue.time timer via MicrometerThe new metrics provide:
DataHub provides comprehensive instrumentation for measuring the latency from initial request submission to post-MCL (Metadata Change Log) hook execution. This metric is crucial for understanding the end-to-end processing time of metadata changes, including both the time spent in Kafka queues and the time taken to process through the system to the final hooks.
Traditional metrics only show individual component performance. Request hook latency provides the complete picture of how long it takes for a metadata change to be fully processed through DataHub's pipeline:
This end-to-end view is essential for:
Hook latency metrics are configured separately from Kafka consumer metrics to allow fine-tuning based on your specific requirements:
datahub:
metrics:
# Measures the time from request to post-MCL hook execution
hookLatency:
# Percentiles to calculate for latency distribution
percentiles: "0.5,0.95,0.99,0.999"
# Service Level Objective buckets (seconds)
# These define the latency targets you want to track
slo: "300,1800,3000,10800,21600,43200" # 5m, 30m, 1h, 3h, 6h, 12h
# Maximum expected latency (seconds)
# Values above this are considered outliers
maxExpectedValue: 86000 # 24 hours
Metric: datahub.request.hook.queue.time
hook: Name of the MCL hook being executed (e.g., "IngestionSchedulerHook", "SiblingsHook")SLA Compliance by Hook:
Monitor which hooks are meeting their latency SLAs:
# Percentage of requests processed within 5-minute SLA per hook
sum(rate(datahub_request_hook_queue_time_seconds_bucket{le="300"}[5m])) by (hook)
/ sum(rate(datahub_request_hook_queue_time_seconds_count[5m])) by (hook) * 100
Hook Performance Comparison:
Identify which hooks have the highest latency:
# P99 latency by hook
histogram_quantile(0.99,
sum by (hook, le) (
rate(datahub_request_hook_queue_time_seconds_bucket[5m])
)
)
Latency Trends:
Track how hook latency changes over time:
# Average hook latency trend
avg by (hook) (
rate(datahub_request_hook_queue_time_seconds_sum[5m])
/ rate(datahub_request_hook_queue_time_seconds_count[5m])
)
The hook latency metric leverages the trace ID embedded in the system metadata of each request:
While Kafka queue time metrics (kafka.message.queue.time) measure the time messages spend in Kafka topics, request hook
latency metrics provide the complete picture:
Together, these metrics help identify where delays occur:
Emitted on all aspect writes (REST, GraphQL, MCP) to track sizes and detect oversized aspects.
Metrics:
aspectSizeValidation.prePatch.sizeDistribution - Size distribution of existing aspects (tags: aspectName, sizeBucket)aspectSizeValidation.postPatch.sizeDistribution - Size distribution of aspects being written (tags: aspectName, sizeBucket)aspectSizeValidation.prePatch.oversized - Oversized aspects found in database (tags: aspectName, remediation)aspectSizeValidation.postPatch.oversized - Oversized aspects rejected during writes (tags: aspectName, remediation)aspectSizeValidation.prePatch.warning - Aspects approaching limit in database (tags: aspectName)aspectSizeValidation.postPatch.warning - Aspects approaching limit during writes (tags: aspectName)Configuration:
See Aspect Size Validation for details.
datahub:
validation:
aspectSize:
metrics:
sizeBuckets: [1048576, 5242880, 10485760, 15728640]
Default buckets (1MB, 5MB, 10MB, 15MB) create ranges: 0-1MB, 1MB-5MB, 5MB-10MB, 10MB-15MB, 15MB+
Micrometer provides automatic instrumentation for cache implementations, offering deep insights into cache performance and efficiency. This instrumentation is crucial for DataHub, where caching significantly impacts query performance and system load.
When caches are registered with Micrometer, comprehensive metrics are automatically collected without code changes:
cache.size (Gauge) - Current number of entries in the cachecache.gets (Counter) - Cache access attempts, tagged with:
result=hit - Successful cache hitsresult=miss - Cache misses requiring backend fetchcache.puts (Counter) - Number of entries added to cachecache.evictions (Counter) - Number of entries evictedcache.eviction.weight (Counter) - Total weight of evicted entries (for size-based eviction)Calculate key performance indicators using Prometheus queries:
# Cache hit rate (should be >80% for hot caches)
sum(rate(cache_gets_total{result="hit"}[5m])) by (cache) /
sum(rate(cache_gets_total[5m])) by (cache)
# Cache miss rate
1 - (cache_hit_rate)
# Eviction rate (indicates cache pressure)
rate(cache_evictions_total[5m])
DataHub uses multiple cache layers, each automatically instrumented:
cache.client.entityClient:
enabled: true
maxBytes: 104857600 # 100MB
entityAspectTTLSeconds:
corpuser:
corpUserInfo: 20 # Short TTL for frequently changing data
corpUserKey: 300 # Longer TTL for stable data
structuredProperty:
propertyDefinition: 300
structuredPropertyKey: 86400 # 1 day for very stable data
cache.client.usageClient:
enabled: true
maxBytes: 52428800 # 50MB
defaultTTLSeconds: 86400 # 1 day
# Caches expensive usage calculations
cache.search.lineage:
ttlSeconds: 86400 # 1 day
Hit Rate by Cache Type
# Alert if hit rate drops below 70%
cache_hit_rate < 0.7
Memory Pressure
# High eviction rate relative to puts
rate(cache_evictions_total[5m]) / rate(cache_puts_total[5m]) > 0.1
Micrometer automatically instruments Java ThreadPoolExecutor instances, providing crucial visibility into concurrency
bottlenecks and resource utilization. For DataHub's concurrent operations, this monitoring is essential for maintaining
performance under load.
executor.pool.size (Gauge) - Current number of threads in poolexecutor.pool.core (Gauge) - Core (minimum) pool sizeexecutor.pool.max (Gauge) - Maximum allowed pool sizeexecutor.active (Gauge) - Threads actively executing tasksexecutor.queued (Gauge) - Tasks waiting in queueexecutor.queue.remaining (Gauge) - Available queue capacityexecutor.completed (Counter) - Total completed tasksexecutor.seconds (Timer) - Task execution time distributionexecutor.rejected (Counter) - Tasks rejected due to saturationgraphQL.concurrency:
separateThreadPool: true
corePoolSize: 20 # Base threads
maxPoolSize: 200 # Scale under load
keepAlive: 60 # Seconds before idle thread removal
# Handles complex GraphQL query resolution
entityClient.restli:
get:
batchConcurrency: 2 # Parallel batch processors
batchQueueSize: 500 # Task buffer
batchThreadKeepAlive: 60
ingest:
batchConcurrency: 2
batchQueueSize: 500
timeseriesAspectService.query:
concurrency: 10 # Parallel query threads
queueSize: 500 # Buffered queries
# Thread pool utilization (>0.8 indicates pressure)
executor_active / executor_pool_size > 0.8
# Queue filling up (>0.7 indicates backpressure)
executor_queued / (executor_queued + executor_queue_remaining) > 0.7
# Task rejections (should be zero)
rate(executor_rejected_total[1m]) > 0
# Thread starvation (all threads busy for extended period)
avg_over_time(executor_active[5m]) >= executor_pool_core
# Average task execution time
rate(executor_seconds_sum[5m]) / rate(executor_seconds_count[5m])
# Task throughput by executor
rate(executor_completed_total[5m])
| Symptom | Metric Pattern | Solution |
|---|---|---|
| High latency | executor_queued rising | Increase pool size |
| Rejections | executor_rejected > 0 | Increase queue size or pool max |
| Memory pressure | Many idle threads | Reduce keepAlive time |
| CPU waste | Low executor_active | Reduce core pool size |
Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks.
We enable tracing by using the OpenTelemetry java instrumentation library. This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries.
Using the agent we are able to
@WithSpan annotationYou can enable the agent by setting env variable ENABLE_OTEL to true for GMS and MAE/MCE consumers. In our
example docker-compose, we export metrics to a local Jaeger
instance by setting env variable OTEL_TRACES_EXPORTER to jaeger
and OTEL_EXPORTER_JAEGER_ENDPOINT to http://jaeger-all-in-one:14250, but you can easily change this behavior by
setting the correct env variables. Refer to
this doc for
all configs.
Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added
the @WithSpan annotation in various places to make the trace more readable. You should start to see traces in the
tracing collector of choice. Our example docker-compose deploys
an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.
We recommend using either grpc or http/protobuf, configured using OTEL_EXPORTER_OTLP_PROTOCOL. Avoid using http will not work as expected due to the size of
the generated spans.
DataHub is transitioning to Micrometer as its primary metrics framework, representing a significant upgrade in observability capabilities. Micrometer is a vendor-neutral application metrics facade that provides a simple, consistent API for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in.
Native Spring Integration
As DataHub uses Spring Boot, Micrometer provides seamless integration with:
Multi-Backend Support
Unlike the legacy DropWizard approach that primarily targets JMX, Micrometer natively supports:
Dimensional Metrics
Micrometer embraces modern dimensional metrics with labels/tags, enabling:
DataHub is undertaking a strategic transition from DropWizard metrics (exposed via JMX) to Micrometer, a modern vendor-neutral metrics facade. This transition aims to provide better cloud-native monitoring capabilities while maintaining backward compatibility for existing monitoring infrastructure.
What We Have Now:
Limitations:
What We're Building:
Key Decisions and Rationale:
Dual Registry Approach
Decision: Run both systems in parallel with tag-based routing
Rationale:
Prometheus as Primary Target
Decision: Focus on Prometheus for new metrics
Rationale:
Observation API Adoption
Decision: Promote Observation API for new instrumentation
Rationale:
Once fully adopted, Micrometer will transform DataHub's observability from a collection of separate tools into a unified platform. This means developers can focus on building features while getting comprehensive telemetry "for free."
Intelligent and Adaptive Monitoring
Developer and Operator Experience
We originally decided to use Dropwizard Metrics to export custom metrics to JMX,
and then use Prometheus-JMX exporter to export all JMX metrics to
Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use
their tool of choice. You can enable the agent by setting env variable ENABLE_PROMETHEUS to true for GMS and MAE/MCE
consumers. Refer to this example docker-compose for setting the
variables.
In our example docker-compose, we have configured prometheus to scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to listen to prometheus and create useful dashboards. By default, we provide two dashboards: JVM dashboard and DataHub dashboard.
In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)
To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. You can run the following to create a counter and increment.
metricUtils.counter(this.getClass(),"metricName").increment();
You can run the following to time a block of code.
try(Timer.Context ignored=metricUtils.timer(this.getClass(),"timerName").timer()){
...block of code
}
We provide some example configuration for enabling monitoring in this directory. Take a look at the docker-compose files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, Grafana).
You can add in the above docker-compose using the -f <<path-to-compose-file>> when running docker-compose commands.
For instance,
docker-compose \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
pull && \
docker-compose -p datahub \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
up
We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For
instance MONITORING=true ./docker/quickstart.sh will add the correct env variables to start collecting traces and
metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.
For monitoring healthiness of your DataHub service, /admin endpoint can be used.