metadata-models/docs/entities/incident.md
Incidents represent data quality issues, operational problems, or any other type of issue that affects data assets in DataHub. They provide a structured way to track, manage, and resolve problems across datasets, dashboards, charts, data flows, data jobs, and schema fields. Incidents help teams maintain data reliability by documenting problems, assigning responsibility, tracking resolution progress, and maintaining an audit trail of data quality events.
Incidents are uniquely identified by a generated UUID string. Unlike most other DataHub entities that derive their identity from external systems, incidents are created within DataHub and assigned a unique identifier at creation time.
The URN structure for an incident is:
urn:li:incident:<uuid>
Example:
urn:li:incident:a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d
The UUID is automatically generated by the system when an incident is raised, ensuring global uniqueness across all incidents in the DataHub instance.
Incidents can be categorized by type to help teams understand the nature of the problem. DataHub supports several predefined incident types as well as custom types:
customType string to describe the incident category.Incidents follow a lifecycle from creation through resolution, tracked through status and stage fields:
The top-level state indicates whether an incident is active or resolved:
Incidents can be assigned to specific stages that represent where they are in the resolution process:
The status also includes a message field for providing context about the current state and a lastUpdated timestamp tracking when the status was last modified.
Incidents can be assigned a priority to help teams triage and focus on the most critical issues:
The priority field is stored as an integer (0-3) in the data model, allowing for programmatic sorting and filtering.
Incidents can be assigned to one or more users or groups responsible for investigating and resolving the issue. Each assignee includes:
Multiple assignees can collaborate on resolving a single incident, making it easy to involve cross-functional teams.
A key feature of incidents is the ability to link them to one or more affected data assets. The entities field contains an array of URNs referencing the assets impacted by the incident. Supported entity types include:
This linkage allows users to see all incidents affecting a particular asset and understand the scope of an incident across multiple assets.
The source field tracks how the incident was created:
sourceUrn field contains the URN of the assertion that triggered the incident.This distinction helps teams understand which incidents require manual investigation versus those generated by automated monitoring.
Incidents maintain detailed temporal information:
This temporal data helps teams understand incident timelines, calculate mean time to detection (MTTD), and mean time to resolution (MTTR).
Like other DataHub entities, incidents can be tagged using the globalTags aspect. Tags help categorize and filter incidents, making it easier to find related issues or analyze incident patterns by category.
The following example demonstrates creating a new incident and associating it with a dataset that has a data quality issue.
<details> <summary>Python SDK: Create a basic incident</summary>{{ inline /metadata-ingestion/examples/library/incident_create.py show_path_as_comment }}
As incidents progress through their lifecycle, you'll need to update their status to reflect the current state and stage.
<details> <summary>Python SDK: Update incident status and stage</summary>{{ inline /metadata-ingestion/examples/library/incident_update_status.py show_path_as_comment }}
Tags can be added to incidents to categorize them by team, system, severity, or any other organizational dimension.
<details> <summary>Python SDK: Add a tag to an incident</summary>{{ inline /metadata-ingestion/examples/library/incident_add_tag.py show_path_as_comment }}
After creating incidents, you can retrieve them using the DataHub REST API to integrate with external monitoring or ticketing systems.
<details> <summary>Query incident using REST API</summary>{{ inline /metadata-ingestion/examples/library/incident_query_rest_api.py show_path_as_comment }}
Incidents are tightly integrated with DataHub's assertion framework. When assertions (data quality checks) fail and are configured to raise incidents, they automatically create incident entities. These incidents:
sourceUrn fieldThis integration provides automatic incident creation for monitored data quality checks.
DataHub entities that can have incidents (datasets, dashboards, charts, dataFlows, dataJobs, schemaFields) include an incidentsSummary aspect. This aspect provides:
This summary appears in the UI on asset pages, giving users immediate visibility into data quality issues.
The DataHub GraphQL API provides several operations for working with incidents:
These operations are used by the DataHub UI and can be called directly by external applications.
Incident operations respect DataHub's authorization model. Users must have the EDIT_ENTITY_INCIDENTS privilege on an entity to:
This ensures that only users with appropriate permissions can manage incidents for sensitive data assets.
Incidents factor into the overall health status of DataHub entities. Assets with active CRITICAL or HIGH priority incidents may be marked as unhealthy in the UI, helping users quickly identify problematic data assets.
While the data model supports incidents affecting multiple entities (via the entities array), some GraphQL resolvers currently have limitations when working with multi-entity incidents. Specifically, the UpdateIncidentStatusResolver currently only checks authorization against the first entity in the array. This is noted in the code as a TODO for future enhancement.
When creating incidents, it's recommended to:
The priority field is stored as an integer (0-3) rather than as an enum in the PDL model. This was noted in the schema comments as a potential area for future improvement. The GraphQL layer provides an enum interface (CRITICAL, HIGH, MEDIUM, LOW) that maps to these integer values, but the underlying storage uses integers.
When working with the low-level SDK, use the integer values:
Incidents created automatically by assertion failures cannot have their source field changed to MANUAL, and vice versa. The source field is set at creation time and reflects the origin of the incident. This distinction is important for reporting and analytics, as it helps teams understand the effectiveness of automated monitoring versus manual incident reporting.
While there is no explicit length limit on the status message field in the schema, UI components may truncate very long messages. It's recommended to keep status messages concise (under 500 characters) and use the incident description field for longer explanations.
Incidents are not automatically deleted when their affected entities are removed. This preserves the historical record of data quality issues even after assets are deprecated or deleted. However, this can lead to orphaned incidents that reference non-existent entities. It's recommended to implement cleanup processes for incidents linked to deleted assets if this becomes an issue in your organization.