Monitoring DataHub

Overview

Monitoring DataHub's system components is essential for maintaining operational excellence, troubleshooting performance issues, and ensuring system reliability. This comprehensive guide covers how to implement observability in DataHub through tracing and metrics, and how to extract valuable insights from your running instances.

Why Monitor DataHub?

Effective monitoring enables you to:

Identify Performance Bottlenecks: Pinpoint slow queries or API endpoints
Debug Issues Faster: Trace requests across distributed components to locate failures
Meet SLAs: Track and alert on key performance indicators

Observability Components

DataHub's observability strategy consists of two complementary approaches:

Metrics Collection

Purpose: Aggregate statistical data about system behavior over time Technology: Transitioning from DropWizard/JMX to Micrometer

Current State: DropWizard metrics exposed via JMX, collected by Prometheus Future Direction: Native Micrometer integration for Spring-based metrics Compatibility: Prometheus-compatible format with support for other metrics backends

Key Metrics Categories:
- Performance Metrics: Request latency, throughput, error rates
- Resource Metrics: CPU, memory utilization
- Application Metrics: Cache hit rates, queue depths, processing times
- Business Metrics: Entity counts, ingestion rates, search performance
Distributed Tracing

Purpose: Track individual requests as they flow through multiple services and components Technology: OpenTelemetry-based instrumentation
- Provides end-to-end visibility of request lifecycles
- Automatically instruments popular libraries (Kafka, JDBC, Elasticsearch)
- Supports multiple backend systems (Jaeger, Zipkin, etc.)
- Enables custom span creation with minimal code changes
Key Benefits:
- Visualize request flow across microservices
- Identify latency hotspots
- Understand service dependencies
- Debug complex distributed transactions

GraphQL Instrumentation (Micrometer)

Overview

DataHub provides comprehensive instrumentation for its GraphQL API through Micrometer metrics, enabling detailed performance monitoring and debugging capabilities. The instrumentation system offers flexible configuration options to balance between observability depth and performance overhead.

Why Path-Level GraphQL Instrumentation Matters

Traditional GraphQL monitoring only tells you "the search query is slow" but not why. Without path-level instrumentation, you're blind to which specific fields are causing performance bottlenecks in complex nested queries.

Real-World Example

Consider this GraphQL query:

graphql

query getSearchResults {
  search(input: { query: "sales data" }) {
    searchResults {
      entity {
        ... on Dataset {
          name
          owner {
            # Path: /search/searchResults/entity/owner
            corpUser {
              displayName
            }
          }
          lineage {
            # Path: /search/searchResults/entity/lineage
            upstreamCount
            downstreamCount
            upstreamEntities {
              urn
              name
            }
          }
          schemaMetadata {
            # Path: /search/searchResults/entity/schemaMetadata
            fields {
              fieldPath
              description
            }
          }
        }
      }
    }
  }
}

What Path-Level Instrumentation Reveals

With path-level metrics, you discover:

/search/searchResults/entity/owner - 50ms (fast, well-cached)
/search/searchResults/entity/lineage - 2500ms (SLOW! hitting graph database)
/search/searchResults/entity/schemaMetadata - 150ms (acceptable)

Without path metrics: "Search query takes 3 seconds"
With path metrics: "Lineage resolution is the bottleneck"

Key Benefits

1. Surgical Optimization

Instead of guessing, you know exactly which resolver needs optimization. Maybe lineage needs better caching or pagination.

2. Smart Query Patterns

Identify expensive patterns like:

yaml

# These paths consistently slow:
/*/lineage/upstreamEntities/*
/*/siblings/*/platform
# Action: Add field-level caching or lazy loading

3. Client-Specific Debugging

Different clients request different fields. Path instrumentation shows:

Web UI requests are slow (requesting everything)
API integrations timeout (requesting deep lineage)

4. N+1 Query Detection

Spot resolver patterns that indicate N+1 problems:

/users/0/permissions - 10ms
/users/1/permissions - 10ms
/users/2/permissions - 10ms
... (100 more times)

Configuration Strategy

Start targeted to minimize overhead:

yaml

# Focus on known slow operations
fieldLevelOperations: "searchAcrossEntities,getDataset"

# Target expensive resolver paths
fieldLevelPaths: "/**/lineage/**,/**/relationships/**,/**/privileges"

Architecture

The GraphQL instrumentation is implemented through GraphQLTimingInstrumentation, which extends GraphQL Java's instrumentation framework. It provides:

Request-level metrics: Overall query performance and error tracking
Field-level metrics: Detailed timing for individual field resolvers
Smart filtering: Configurable targeting of specific operations or field paths
Low overhead: Minimal performance impact through efficient instrumentation

Metrics Collected

Request-Level Metrics

Metric: graphql.request.duration

Type: Timer with percentiles (p50, p95, p99)
Tags:
- operation: Operation name (e.g., "getSearchResultsForMultiple")
- operation.type: Query, mutation, or subscription
- success: true/false based on error presence
- field.filtering: Filtering mode applied (DISABLED, ALL_FIELDS, BY_OPERATION, BY_PATH, BY_BOTH)
Use Case: Monitor overall GraphQL performance, identify slow operations

Metric: graphql.request.errors

Type: Counter
Tags:
- operation: Operation name
- operation.type: Query, mutation, or subscription
Use Case: Track error rates by operation

Field-Level Metrics

Metric: graphql.field.duration

Type: Timer with percentiles (p50, p95, p99)
Tags:
- parent.type: GraphQL parent type (e.g., "Dataset", "User")
- field: Field name being resolved
- operation: Operation name context
- success: true/false
- path: Field path (optional, controlled by fieldLevelPathEnabled)
Use Case: Identify slow field resolvers, optimize data fetching

Metric: graphql.field.errors

Type: Counter
Tags: Same as field duration (minus success tag)
Use Case: Track field-specific error patterns

Metric: graphql.fields.instrumented

Type: Counter
Tags:
- operation: Operation name
- filtering.mode: Active filtering mode
Use Case: Monitor instrumentation coverage and overhead

Configuration Guide

Master Controls

yaml

graphQL:
  metrics:
    # Master switch for all GraphQL metrics
    enabled: ${GRAPHQL_METRICS_ENABLED:true}

    # Enable field-level resolver metrics
    fieldLevelEnabled: ${GRAPHQL_METRICS_FIELD_LEVEL_ENABLED:false}

Selective Field Instrumentation

Field-level metrics can add significant overhead for complex queries. DataHub provides multiple strategies to control which fields are instrumented:

1. Operation-Based Filtering

Target specific GraphQL operations known to be slow or critical:

yaml

fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineageStructure"

2. Path-Based Filtering

Use path patterns to instrument specific parts of your schema:

yaml

fieldLevelPaths: "/search/results/**,/user/*/permissions,/**/lineage/*"

Path Pattern Syntax:

/user - Exact match for the user field
/user/* - Direct children of user (e.g., /user/name, /user/email)
/user/** - User field and all descendants at any depth
/*/comments/* - Comments field under any parent

3. Combined Filtering

When both operation and path filters are configured, only fields matching BOTH criteria are instrumented:

yaml

# Only instrument search results within specific operations
fieldLevelOperations: "searchAcrossEntities"
fieldLevelPaths: "/searchResults/**"

Advanced Options

yaml

# Include field paths as metric tags (WARNING: high cardinality risk)
fieldLevelPathEnabled: false

# Include metrics for trivial property access
trivialDataFetchersEnabled: false

Filtering Modes Explained

The instrumentation automatically determines the most efficient filtering mode:

DISABLED: Field-level metrics completely disabled
ALL_FIELDS: No filtering, all fields instrumented (highest overhead)
BY_OPERATION: Only instrument fields within specified operations
BY_PATH: Only instrument fields matching path patterns
BY_BOTH: Most restrictive - both operation and path must match

Performance Considerations

Impact Assessment

Field-level instrumentation overhead varies by:

Query complexity: More fields = more overhead
Resolver performance: Fast resolvers have higher relative overhead
Filtering effectiveness: Better targeting = less overhead

Best Practices

Start Conservative: Begin with field-level metrics disabled
yaml
```
fieldLevelEnabled: false
```

Target Known Issues: Enable selectively for problematic operations

yaml

fieldLevelEnabled: true
fieldLevelOperations: "slowSearchQuery,complexLineageQuery"

Use Path Patterns Wisely: Focus on expensive resolver paths
yaml
```
fieldLevelPaths: "/search/**,/**/lineage/**"
```
Avoid Path Tags in Production: High cardinality risk
yaml
```
fieldLevelPathEnabled: false # Keep this false
```
Monitor Instrumentation Overhead: Track the graphql.fields.instrumented metric

Example Configurations

Development Environment (Full Visibility)

yaml

graphQL:
  metrics:
    enabled: true
    fieldLevelEnabled: true
    fieldLevelOperations: "" # All operations
    fieldLevelPathEnabled: true # Include paths for debugging
    trivialDataFetchersEnabled: true

Production - Targeted Monitoring

yaml

graphQL:
  metrics:
    enabled: true
    fieldLevelEnabled: true
    fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineage"
    fieldLevelPaths: "/search/results/*,/lineage/upstream/**,/lineage/downstream/**"
    fieldLevelPathEnabled: false
    trivialDataFetchersEnabled: false

Production - Minimal Overhead

yaml

graphQL:
  metrics:
    enabled: true
    fieldLevelEnabled: false # Only request-level metrics

Debugging Slow Queries

When investigating GraphQL performance issues:

Enable request-level metrics first to identify slow operations
Temporarily enable field-level metrics for the slow operation:
yaml
```
fieldLevelOperations: "problematicQuery"
```
Analyze field duration metrics to find bottlenecks
Optionally enable path tags (briefly) for precise identification:
yaml
```
fieldLevelPathEnabled: true # Temporary only!
```
Optimize identified resolvers and disable detailed instrumentation

Integration with Monitoring Stack

The GraphQL metrics integrate seamlessly with DataHub's monitoring infrastructure:

Prometheus: Metrics exposed at /actuator/prometheus
Grafana: Create dashboards showing:
- Request rates and latencies by operation
- Error rates and types
- Field resolver performance heatmaps
- Top slow operations and fields

Example Prometheus queries:

promql

# Average request duration by operation
rate(graphql_request_duration_seconds_sum[5m])
/ rate(graphql_request_duration_seconds_count[5m])

# Field resolver p99 latency
histogram_quantile(0.99,
  rate(graphql_field_duration_seconds_bucket[5m])
)

# Error rate by operation
rate(graphql_request_errors_total[5m])

Kafka Consumer Instrumentation (Micrometer)

Overview

DataHub provides comprehensive instrumentation for Kafka message consumption through Micrometer metrics, enabling real-time monitoring of message queue latency and consumer performance. This instrumentation is critical for maintaining data freshness SLAs and identifying processing bottlenecks across DataHub's event-driven architecture.

Why Kafka Queue Time Monitoring Matters

Traditional Kafka lag monitoring only tells you "we're behind by 10,000 messages" Without queue time metrics, you can't answer critical questions like "are we meeting our 5-minute data freshness SLA?" or "which consumer groups are experiencing delays?"

Real-World Impact

Consider these scenarios:

Variable Production Rate:

Morning: 100 messages/second → 1000 message lag = 10 seconds old
Evening: 10 messages/second → 1000 message lag = 100 seconds old
Same lag count, vastly different business impact!

Burst Traffic Patterns:

Bulk ingestion creates 1M message backlog
Are these messages from the last hour (recoverable) or last 24 hours (SLA breach)?

Consumer Group Performance:

Real-time processors need < 1 minute latency
Analytics consumers can tolerate 1 hour latency
Different groups require different monitoring thresholds

Architecture

Kafka queue time instrumentation is implemented across all DataHub consumers:

MetadataChangeProposals (MCP) Processor - SQL entity updates
- BatchMetadataChangeProposals (MCP) Processor - Bulk SQL entity updates
MetadataChangeLog (MCL) Processor & Hooks - Elasticsearch & downstream aspect operations
DataHubUsageEventsProcessor - Usage analytics events
PlatformEventProcessor - Platform operations & external consumers

Each consumer automatically records queue time metrics using the message's embedded timestamp.

Metrics Collected

Core Metric

Metric: kafka.message.queue.time

Type: Timer with configurable percentiles and SLO buckets
Unit: Milliseconds
Tags:
- topic: Kafka topic name (e.g., "MetadataChangeProposal_v1")
- consumer.group: Consumer group ID (e.g., "generic-mce-consumer")
Use Case: Monitor end-to-end latency from message production to SQL transaction

Statistical Distribution

The timer automatically tracks:

Count: Total messages processed
Sum: Cumulative queue time
Max: Highest queue time observed
Percentiles: p50, p95, p99, p99.9 (configurable)
SLO Buckets: Percentage of messages meeting latency targets

Configuration Guide

Default Configuration:

yaml

kafka:
  consumer:
    metrics:
      # Percentiles to calculate
      percentiles: "0.5,0.95,0.99,0.999"

      # Service Level Objective buckets (seconds)
      slo: "300,1800,3600,10800,21600,43200" # 5m,30m,1h,3h,6h,12h

      # Maximum expected queue time
      maxExpectedValue: 86400 # 24 hours (seconds)

Key Monitoring Patterns

SLA Compliance Monitoring:

promql

# Percentage of messages processed within 5-minute SLA
sum(rate(kafka_message_queue_time_seconds_bucket{le="300"}[5m])) by (topic)
/ sum(rate(kafka_message_queue_time_seconds_count[5m])) by (topic) * 100

Consumer Group Comparison:

promql

# P99 queue time by consumer group
histogram_quantile(0.99,
  sum by (consumer_group, le) (
    rate(kafka_message_queue_time_seconds_bucket[5m])
  )
)

Performance Considerations

Metric Cardinality:

The instrumentation is designed for low cardinality:

Only two tags: topic and consumer.group
No partition-level tags (avoiding explosion with high partition counts)
No message-specific tags

Overhead Assessment:

CPU Impact: Minimal - single timestamp calculation per message
Memory Impact: ~5KB per topic/consumer-group combination
Network Impact: Negligible - metrics aggregated before export

Migration from Legacy Metrics

The new Micrometer-based queue time metrics coexist with the legacy DropWizard kafkaLag histogram:

Legacy: kafkaLag histogram via JMX
New: kafka.message.queue.time timer via Micrometer
Migration: Both metrics collected during transition period
Future: Legacy metrics will be deprecated in favor of Micrometer

The new metrics provide:

Better percentile accuracy
SLO bucket tracking
Multi-backend support
Dimensional tagging

DataHub Request Hook Latency Instrumentation (Micrometer)

Overview

DataHub provides comprehensive instrumentation for measuring the latency from initial request submission to post-MCL (Metadata Change Log) hook execution. This metric is crucial for understanding the end-to-end processing time of metadata changes, including both the time spent in Kafka queues and the time taken to process through the system to the final hooks.

Why Hook Latency Monitoring Matters

Traditional metrics only show individual component performance. Request hook latency provides the complete picture of how long it takes for a metadata change to be fully processed through DataHub's pipeline:

Request Submission: When a metadata change request is initially submitted
Queue Time: Time spent in Kafka topics waiting to be consumed
Processing Time: Time for the change to be persisted and processed
Hook Execution: Final execution of MCL hooks

This end-to-end view is essential for:

Meeting data freshness SLAs
Identifying bottlenecks in the metadata pipeline
Understanding the impact of system load on processing times
Ensuring timely updates to downstream systems

Configuration

Hook latency metrics are configured separately from Kafka consumer metrics to allow fine-tuning based on your specific requirements:

yaml

datahub:
  metrics:
    # Measures the time from request to post-MCL hook execution
    hookLatency:
      # Percentiles to calculate for latency distribution
      percentiles: "0.5,0.95,0.99,0.999"

      # Service Level Objective buckets (seconds)
      # These define the latency targets you want to track
      slo: "300,1800,3000,10800,21600,43200" # 5m, 30m, 1h, 3h, 6h, 12h

      # Maximum expected latency (seconds)
      # Values above this are considered outliers
      maxExpectedValue: 86000 # 24 hours

Metrics Collected

Core Metric

Metric: datahub.request.hook.queue.time

Type: Timer with configurable percentiles and SLO buckets
Unit: Milliseconds
Tags:
- hook: Name of the MCL hook being executed (e.g., "IngestionSchedulerHook", "SiblingsHook")
Use Case: Monitor the complete latency from request submission to hook exe

Key Monitoring Patterns

SLA Compliance by Hook:

Monitor which hooks are meeting their latency SLAs:

promql

# Percentage of requests processed within 5-minute SLA per hook
sum(rate(datahub_request_hook_queue_time_seconds_bucket{le="300"}[5m])) by (hook)
/ sum(rate(datahub_request_hook_queue_time_seconds_count[5m])) by (hook) * 100

Hook Performance Comparison:

Identify which hooks have the highest latency:

promql

# P99 latency by hook
histogram_quantile(0.99,
  sum by (hook, le) (
    rate(datahub_request_hook_queue_time_seconds_bucket[5m])
  )
)

Latency Trends:

Track how hook latency changes over time:

promql

# Average hook latency trend
avg by (hook) (
  rate(datahub_request_hook_queue_time_seconds_sum[5m])
  / rate(datahub_request_hook_queue_time_seconds_count[5m])
)

Implementation Details

The hook latency metric leverages the trace ID embedded in the system metadata of each request:

Trace ID Generation: Each request generates a unique trace ID with an embedded timestamp
Propagation: The trace ID flows through the entire processing pipeline via system metadata
Measurement: When an MCL hook executes, the metric calculates the time difference between the current time and the trace ID timestamp
Recording: The latency is recorded as a timer metric with the hook name as a tag

Performance Considerations

Overhead: Minimal - only requires trace ID extraction and time calculation per hook execution
Cardinality: Low - only one tag (hook name) with typically < 20 unique values
Accuracy: High - measures actual wall-clock time from request to hook execution

Relationship to Kafka Queue Time Metrics

While Kafka queue time metrics (kafka.message.queue.time) measure the time messages spend in Kafka topics, request hook latency metrics provide the complete picture:

Kafka Queue Time: Time from message production to consumption
Hook Latency: Time from initial request to final hook execution

Together, these metrics help identify where delays occur:

High Kafka queue time but low hook latency: Bottleneck in Kafka consumption
Low Kafka queue time but high hook latency: Bottleneck in processing or persistence
Both high: System-wide performance issues

Aspect Size Validation Metrics

Emitted on all aspect writes (REST, GraphQL, MCP) to track sizes and detect oversized aspects.

Metrics:

aspectSizeValidation.prePatch.sizeDistribution - Size distribution of existing aspects (tags: aspectName, sizeBucket)
aspectSizeValidation.postPatch.sizeDistribution - Size distribution of aspects being written (tags: aspectName, sizeBucket)
aspectSizeValidation.prePatch.oversized - Oversized aspects found in database (tags: aspectName, remediation)
aspectSizeValidation.postPatch.oversized - Oversized aspects rejected during writes (tags: aspectName, remediation)
aspectSizeValidation.prePatch.warning - Aspects approaching limit in database (tags: aspectName)
aspectSizeValidation.postPatch.warning - Aspects approaching limit during writes (tags: aspectName)

Configuration:

See Aspect Size Validation for details.

yaml

datahub:
  validation:
    aspectSize:
      metrics:
        sizeBuckets: [1048576, 5242880, 10485760, 15728640]

Default buckets (1MB, 5MB, 10MB, 15MB) create ranges: 0-1MB, 1MB-5MB, 5MB-10MB, 10MB-15MB, 15MB+

Cache Monitoring (Micrometer)

Overview

Micrometer provides automatic instrumentation for cache implementations, offering deep insights into cache performance and efficiency. This instrumentation is crucial for DataHub, where caching significantly impacts query performance and system load.

Automatic Cache Metrics

When caches are registered with Micrometer, comprehensive metrics are automatically collected without code changes:

Core Metrics

cache.size (Gauge) - Current number of entries in the cache
cache.gets (Counter) - Cache access attempts, tagged with:
- result=hit - Successful cache hits
- result=miss - Cache misses requiring backend fetch
cache.puts (Counter) - Number of entries added to cache
cache.evictions (Counter) - Number of entries evicted
cache.eviction.weight (Counter) - Total weight of evicted entries (for size-based eviction)

Derived Metrics

Calculate key performance indicators using Prometheus queries:

promql

# Cache hit rate (should be >80% for hot caches)
sum(rate(cache_gets_total{result="hit"}[5m])) by (cache) /
sum(rate(cache_gets_total[5m])) by (cache)

# Cache miss rate
1 - (cache_hit_rate)

# Eviction rate (indicates cache pressure)
rate(cache_evictions_total[5m])

DataHub Cache Configuration

DataHub uses multiple cache layers, each automatically instrumented:

1. Entity Client Cache

yaml

cache.client.entityClient:
  enabled: true
  maxBytes: 104857600 # 100MB
  entityAspectTTLSeconds:
    corpuser:
      corpUserInfo: 20 # Short TTL for frequently changing data
      corpUserKey: 300 # Longer TTL for stable data
    structuredProperty:
      propertyDefinition: 300
      structuredPropertyKey: 86400 # 1 day for very stable data

2. Usage Statistics Cache

yaml

cache.client.usageClient:
  enabled: true
  maxBytes: 52428800 # 50MB
  defaultTTLSeconds: 86400 # 1 day
  # Caches expensive usage calculations

3. Search & Lineage Cache

yaml

cache.search.lineage:
  ttlSeconds: 86400 # 1 day

Monitoring Best Practices

Key Indicators to Watch

Hit Rate by Cache Type

promql

# Alert if hit rate drops below 70%
cache_hit_rate < 0.7

Memory Pressure

promql

# High eviction rate relative to puts
rate(cache_evictions_total[5m]) / rate(cache_puts_total[5m]) > 0.1

Thread Pool Executor Monitoring (Micrometer)

Overview

Micrometer automatically instruments Java ThreadPoolExecutor instances, providing crucial visibility into concurrency bottlenecks and resource utilization. For DataHub's concurrent operations, this monitoring is essential for maintaining performance under load.

Automatic Executor Metrics

Pool State Metrics

executor.pool.size (Gauge) - Current number of threads in pool
executor.pool.core (Gauge) - Core (minimum) pool size
executor.pool.max (Gauge) - Maximum allowed pool size
executor.active (Gauge) - Threads actively executing tasks

Queue Metrics

executor.queued (Gauge) - Tasks waiting in queue
executor.queue.remaining (Gauge) - Available queue capacity

Performance Metrics

executor.completed (Counter) - Total completed tasks
executor.seconds (Timer) - Task execution time distribution
executor.rejected (Counter) - Tasks rejected due to saturation

DataHub Executor Configurations

1. GraphQL Query Executor

yaml

graphQL.concurrency:
  separateThreadPool: true
  corePoolSize: 20 # Base threads
  maxPoolSize: 200 # Scale under load
  keepAlive: 60 # Seconds before idle thread removal
  # Handles complex GraphQL query resolution

2. Batch Processing Executors

yaml

entityClient.restli:
  get:
    batchConcurrency: 2 # Parallel batch processors
    batchQueueSize: 500 # Task buffer
    batchThreadKeepAlive: 60
  ingest:
    batchConcurrency: 2
    batchQueueSize: 500

3. Search & Analytics Executors

yaml

timeseriesAspectService.query:
  concurrency: 10 # Parallel query threads
  queueSize: 500 # Buffered queries

Critical Monitoring Patterns

Saturation Detection

promql

# Thread pool utilization (>0.8 indicates pressure)
executor_active / executor_pool_size > 0.8

# Queue filling up (>0.7 indicates backpressure)
executor_queued / (executor_queued + executor_queue_remaining) > 0.7

Rejection & Starvation

promql

# Task rejections (should be zero)
rate(executor_rejected_total[1m]) > 0

# Thread starvation (all threads busy for extended period)
avg_over_time(executor_active[5m]) >= executor_pool_core

Performance Analysis

promql

# Average task execution time
rate(executor_seconds_sum[5m]) / rate(executor_seconds_count[5m])

# Task throughput by executor
rate(executor_completed_total[5m])

Tuning Guidelines

Symptoms & Solutions

Symptom	Metric Pattern	Solution
High latency	`executor_queued` rising	Increase pool size
Rejections	`executor_rejected` > 0	Increase queue size or pool max
Memory pressure	Many idle threads	Reduce `keepAlive` time
CPU waste	Low `executor_active`	Reduce core pool size

Capacity Planning

Measure baseline: Monitor under normal load
Stress test: Identify saturation points
Set alerts:
- Warning at 70% utilization
- Critical at 90% utilization
Auto-scale: Consider dynamic pool sizing based on queue depth

Distributed Tracing

Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks.

We enable tracing by using the OpenTelemetry java instrumentation library. This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries.

Using the agent we are able to

Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools
Get traces for Kafka, JDBC, and Elasticsearch without any additional code
Track traces of any function with a simple @WithSpan annotation

You can enable the agent by setting env variable ENABLE_OTEL to true for GMS and MAE/MCE consumers. In our example docker-compose, we export metrics to a local Jaeger instance by setting env variable OTEL_TRACES_EXPORTER to jaeger and OTEL_EXPORTER_JAEGER_ENDPOINT to http://jaeger-all-in-one:14250, but you can easily change this behavior by setting the correct env variables. Refer to this doc for all configs.

Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added the @WithSpan annotation in various places to make the trace more readable. You should start to see traces in the tracing collector of choice. Our example docker-compose deploys an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.

Configuration Note

We recommend using either grpc or http/protobuf, configured using OTEL_EXPORTER_OTLP_PROTOCOL. Avoid using http will not work as expected due to the size of the generated spans.

Micrometer

DataHub is transitioning to Micrometer as its primary metrics framework, representing a significant upgrade in observability capabilities. Micrometer is a vendor-neutral application metrics facade that provides a simple, consistent API for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in.

Why Micrometer?

Native Spring Integration

As DataHub uses Spring Boot, Micrometer provides seamless integration with:
- Auto-configuration of common metrics
- Built-in metrics for HTTP requests, JVM, caches, and more
- Spring Boot Actuator endpoints for metrics exposure
- Automatic instrumentation of Spring components
Multi-Backend Support

Unlike the legacy DropWizard approach that primarily targets JMX, Micrometer natively supports:
- Prometheus (recommended for cloud-native deployments)
- JMX (for backward compatibility)
- StatsD
- CloudWatch
- Datadog
- New Relic
- And many more...
Dimensional Metrics

Micrometer embraces modern dimensional metrics with labels/tags, enabling:
- Rich querying and aggregation capabilities
- Better cardinality control
- More flexible dashboards and alerts
- Natural integration with cloud-native monitoring systems

Micrometer Transition Plan

DataHub is undertaking a strategic transition from DropWizard metrics (exposed via JMX) to Micrometer, a modern vendor-neutral metrics facade. This transition aims to provide better cloud-native monitoring capabilities while maintaining backward compatibility for existing monitoring infrastructure.

Current State

What We Have Now:

Primary System: DropWizard metrics exposed through JMX
Collection Method: Prometheus-JMX exporter scrapes JMX metrics
Dashboards: Grafana dashboards consuming JMX-sourced metrics
Code Pattern: MetricUtils class for creating counters and timers
Integration: Basic Spring integration with manual metric creation

Limitations:

JMX-centric approach limits monitoring backend options
No unified observability (separate instrumentation for metrics and traces)
No support for dimensional metrics and tags
Manual instrumentation required for most components
Legacy naming conventions without proper tagging

Transition State

What We're Building:

Primary System: Micrometer with native Prometheus support
Collection Method: Direct Prometheus scraping via /actuator/prometheus
Unified Telemetry: Single instrumentation point for both metrics and traces
Modern Patterns: Dimensional metrics with rich tagging
Multi-Backend: Support for Prometheus, StatsD, CloudWatch, Datadog, etc.
Auto-Instrumentation: Automatic metrics for Spring components

Key Decisions and Rationale:

Dual Registry Approach

Decision: Run both systems in parallel with tag-based routing

Rationale:
- Zero downtime or disruption
- Gradual migration at component level
- Easy rollback if issues arise
Prometheus as Primary Target

Decision: Focus on Prometheus for new metrics

Rationale:
- Industry standard for cloud-native applications
- Rich query language and ecosystem
- Better suited for dimensional metrics
Observation API Adoption

Decision: Promote Observation API for new instrumentation

Rationale:
- Single instrumentation for metrics + traces
- Reduced code complexity
- Consistent naming across telemetry types

Future State

Once fully adopted, Micrometer will transform DataHub's observability from a collection of separate tools into a unified platform. This means developers can focus on building features while getting comprehensive telemetry "for free."

Intelligent and Adaptive Monitoring

Dynamic Instrumentation: Enable detailed metrics for specific entities or operations on-demand without code changes
Environment-Aware Metrics: Automatically route metrics to Prometheus in Kubernetes, CloudWatch in AWS, or Azure Monitor in Azure
Built-in SLO Tracking: Define Service Level Objectives declaratively and automatically track error budgets

Developer and Operator Experience

Adding @Observed to a method automatically generates latency percentiles, error rates, and distributed trace spans
Every service exposes golden signals (latency, traffic, errors, saturation) out-of-the-box
Business metrics (entity ingestion rates, search performance) seamlessly correlate with system metrics
Self-documenting telemetry where metrics, traces, and logs tell a coherent operational story

DropWizard & JMX

We originally decided to use Dropwizard Metrics to export custom metrics to JMX, and then use Prometheus-JMX exporter to export all JMX metrics to Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use their tool of choice. You can enable the agent by setting env variable ENABLE_PROMETHEUS to true for GMS and MAE/MCE consumers. Refer to this example docker-compose for setting the variables.

In our example docker-compose, we have configured prometheus to scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to listen to prometheus and create useful dashboards. By default, we provide two dashboards: JVM dashboard and DataHub dashboard.

In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)

To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. You can run the following to create a counter and increment.

java

metricUtils.counter(this.getClass(),"metricName").increment();

You can run the following to time a block of code.

java

try(Timer.Context ignored=metricUtils.timer(this.getClass(),"timerName").timer()){
    ...block of code
    }

Enable monitoring through docker-compose

We provide some example configuration for enabling monitoring in this directory. Take a look at the docker-compose files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, Grafana).

You can add in the above docker-compose using the -f <<path-to-compose-file>> when running docker-compose commands. For instance,

shell

docker-compose \
  -f quickstart/docker-compose.quickstart.yml \
  -f monitoring/docker-compose.monitoring.yml \
  pull && \
docker-compose -p datahub \
  -f quickstart/docker-compose.quickstart.yml \
  -f monitoring/docker-compose.monitoring.yml \
  up

We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For instance MONITORING=true ./docker/quickstart.sh will add the correct env variables to start collecting traces and metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.

Health check endpoint

For monitoring healthiness of your DataHub service, /admin endpoint can be used.