Back to Datahub

Datahub Gc Post

metadata-ingestion/docs/sources/datahubgc/datahub-gc_post.md

1.5.0.34.4 KB
Original Source

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Index Cleanup

Manages Elasticsearch indices in DataHub, particularly focusing on time-series data.

Configuration
yaml
source:
  type: datahub-gc
  config:
    truncate_indices: true
    truncate_index_older_than_days: 30
    truncation_watch_until: 10000
    truncation_sleep_between_seconds: 30
Features
  • Truncates old Elasticsearch indices for the following timeseries aspects:
    • DatasetOperations
    • DatasetUsageStatistics
    • ChartUsageStatistics
    • DashboardUsageStatistics
    • QueryUsageStatistics
    • Timeseries Aspects
  • Monitors truncation progress
  • Implements safe deletion with monitoring thresholds
  • Supports gradual truncation with sleep intervals

Expired Token Cleanup

Manages access tokens in DataHub to maintain security and prevent token accumulation.

Configuration
yaml
source:
  type: datahub-gc
  config:
    cleanup_expired_tokens: true
Features
  • Automatically identifies and revokes expired access tokens
  • Processes tokens in batches for efficiency
  • Maintains system security by removing outdated credentials
  • Reports number of tokens revoked
  • Uses GraphQL API for token management

Data Process Cleanup

Manages the lifecycle of data processes, jobs, and their instances (DPIs) within DataHub.

Features
  • Cleans up Data Process Instances (DPIs) based on age and count
  • Can remove empty DataJobs and DataFlows
  • Supports both soft and hard deletion
  • Uses parallel processing for efficient cleanup
  • Maintains configurable retention policies
Configuration
yaml
source:
  type: datahub-gc
  config:
    dataprocess_cleanup:
      enabled: true
      retention_days: 10
      keep_last_n: 5
      delete_empty_data_jobs: false
      delete_empty_data_flows: false
      hard_delete_entities: false
      batch_size: 500
      max_workers: 10
      delay: 0.25
Limitations
  • Maximum 9000 DPIs per job for performance

Execution Request Cleanup

Manages DataHub execution request records to prevent accumulation of historical execution data.

Features
  • Maintains execution history per ingestion source
  • Preserves minimum number of recent requests
  • Removes old requests beyond retention period
  • Special handling for running/pending requests
  • Automatic cleanup of corrupted records
Configuration
yaml
source:
  type: datahub-gc
  config:
    execution_request_cleanup:
      enabled: true
      keep_history_min_count: 10
      keep_history_max_count: 1000
      keep_history_max_days: 30
      batch_read_size: 100
      runtime_limit_seconds: 3600
      max_read_errors: 10

Soft-Deleted Entities Cleanup

Manages the permanent removal of soft-deleted entities after a retention period.

Features
  • Permanently removes soft-deleted entities after retention period
  • Handles entity references cleanup
  • Special handling for query entities
  • Supports filtering by entity type, platform, or environment
  • Concurrent processing with safety limits
Configuration
yaml
source:
  type: datahub-gc
  config:
    soft_deleted_entities_cleanup:
      enabled: true
      retention_days: 10
      batch_size: 500
      max_workers: 10
      delay: 0.25
      entity_types: null # Optional list of entity types to clean
      platform: null # Optional platform filter
      env: null # Optional environment filter
      query: null # Optional custom query filter
      limit_entities_delete: 25000
      futures_max_at_time: 1000
      runtime_limit_seconds: 7200
Performance Considerations
  • Concurrent processing using thread pools
  • Configurable batch sizes for optimal performance
  • Rate limiting through configurable delays
  • Maximum limits on concurrent operations

Reporting

Each cleanup task maintains detailed reports including:

  • Number of entities processed
  • Number of entities removed
  • Errors encountered
  • Sample of affected entities
  • Runtime statistics
  • Task-specific metrics

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.