Incident

Incidents represent data quality issues, operational problems, or any other type of issue that affects data assets in DataHub. They provide a structured way to track, manage, and resolve problems across datasets, dashboards, charts, data flows, data jobs, and schema fields. Incidents help teams maintain data reliability by documenting problems, assigning responsibility, tracking resolution progress, and maintaining an audit trail of data quality events.

Identity

Incidents are uniquely identified by a generated UUID string. Unlike most other DataHub entities that derive their identity from external systems, incidents are created within DataHub and assigned a unique identifier at creation time.

The URN structure for an incident is:

urn:li:incident:<uuid>

Example:

urn:li:incident:a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d

The UUID is automatically generated by the system when an incident is raised, ensuring global uniqueness across all incidents in the DataHub instance.

Important Capabilities

Incident Types

Incidents can be categorized by type to help teams understand the nature of the problem. DataHub supports several predefined incident types as well as custom types:

FRESHNESS: Triggered when data is not updated within expected time windows. Often raised by freshness assertions that detect stale data.
VOLUME: Raised when data volume falls outside expected ranges (too much or too little data). Typically generated by volume assertions.
FIELD: Indicates issues with specific field values, such as null values, invalid formats, or values outside acceptable ranges. Associated with field-level assertions.
SQL: Triggered by SQL-based assertions that validate data using custom queries.
DATA_SCHEMA: Raised when schema changes are detected, such as column additions, removals, or type changes.
OPERATIONAL: General operational incidents such as pipeline failures, permission issues, or system errors.
CUSTOM: User-defined incident types for organization-specific problems. When using CUSTOM type, you must provide a customType string to describe the incident category.

Incident Status and Lifecycle

Incidents follow a lifecycle from creation through resolution, tracked through status and stage fields:

Status State

The top-level state indicates whether an incident is active or resolved:

ACTIVE: The incident is ongoing and requires attention or action.
RESOLVED: The incident has been addressed and is no longer active.

Lifecycle Stages

Incidents can be assigned to specific stages that represent where they are in the resolution process:

TRIAGE: The impact and priority of the incident is being actively assessed. This is typically the first stage for newly reported incidents.
INVESTIGATION: The root cause of the incident is being investigated by the assigned team.
WORK_IN_PROGRESS: The incident is in the remediation stage, with active work happening to resolve the issue.
FIXED: The incident has been resolved through corrective action (completed remediation).
NO_ACTION_REQUIRED: The incident is resolved with no action required, for example if it was a false positive, expected behavior, or resolved itself.

The status also includes a message field for providing context about the current state and a lastUpdated timestamp tracking when the status was last modified.

Priority Levels

Incidents can be assigned a priority to help teams triage and focus on the most critical issues:

CRITICAL (priority 0): Severe issues requiring immediate attention that significantly impact business operations or data quality.
HIGH (priority 1): Important issues that should be addressed promptly but are not immediately blocking.
MEDIUM (priority 2): Moderate issues that should be addressed in the normal course of work.
LOW (priority 3): Minor issues that can be addressed when time permits.

The priority field is stored as an integer (0-3) in the data model, allowing for programmatic sorting and filtering.

Assignees

Incidents can be assigned to one or more users or groups responsible for investigating and resolving the issue. Each assignee includes:

actor: The URN of the user (corpUser) or group (corpGroup) assigned to the incident.
assignedAt: An audit stamp capturing who made the assignment and when it occurred.

Multiple assignees can collaborate on resolving a single incident, making it easy to involve cross-functional teams.

Affected Entities

A key feature of incidents is the ability to link them to one or more affected data assets. The entities field contains an array of URNs referencing the assets impacted by the incident. Supported entity types include:

dataset: Tables, views, streams, or other data collections
chart: Data visualizations
dashboard: Dashboard pages containing multiple charts
dataFlow: Pipelines or workflows
dataJob: Individual tasks or jobs within a pipeline
schemaField: Specific fields/columns within a dataset

This linkage allows users to see all incidents affecting a particular asset and understand the scope of an incident across multiple assets.

Incident Source

The source field tracks how the incident was created:

MANUAL: The incident was manually created by a user through the UI or API.
ASSERTION_FAILURE: The incident was automatically raised by a failed assertion. In this case, the sourceUrn field contains the URN of the assertion that triggered the incident.

This distinction helps teams understand which incidents require manual investigation versus those generated by automated monitoring.

Temporal Tracking

Incidents maintain detailed temporal information:

startedAt: The time when the incident actually began (may be earlier than when it was reported).
created: An audit stamp tracking who created the incident and when it was first reported.
lastUpdated: An audit stamp on the status tracking the most recent status change.

This temporal data helps teams understand incident timelines, calculate mean time to detection (MTTD), and mean time to resolution (MTTR).

Code Examples

Create an Incident

The following example demonstrates creating a new incident and associating it with a dataset that has a data quality issue.

<details> <summary>Python SDK: Create a basic incident</summary>

python

{{ inline /metadata-ingestion/examples/library/incident_create.py show_path_as_comment }}

</details>

Update Incident Status

As incidents progress through their lifecycle, you'll need to update their status to reflect the current state and stage.

<details> <summary>Python SDK: Update incident status and stage</summary>

python

{{ inline /metadata-ingestion/examples/library/incident_update_status.py show_path_as_comment }}

</details>

Add Tags to an Incident

Tags can be added to incidents to categorize them by team, system, severity, or any other organizational dimension.

<details> <summary>Python SDK: Add a tag to an incident</summary>

python

{{ inline /metadata-ingestion/examples/library/incident_add_tag.py show_path_as_comment }}

</details>

Query Incident via REST API

After creating incidents, you can retrieve them using the DataHub REST API to integrate with external monitoring or ticketing systems.

<details> <summary>Query incident using REST API</summary>

python

{{ inline /metadata-ingestion/examples/library/incident_query_rest_api.py show_path_as_comment }}

</details>

Integration Points

Relationship with Assertions

Incidents are tightly integrated with DataHub's assertion framework. When assertions (data quality checks) fail and are configured to raise incidents, they automatically create incident entities. These incidents:

Reference the assertion that triggered them via the sourceUrn field
Inherit the type from the assertion (FRESHNESS, VOLUME, FIELD, SQL, DATA_SCHEMA)
Link to the assets being monitored by the assertion
Can be configured at the assertion level to control whether failures generate incidents

This integration provides automatic incident creation for monitored data quality checks.

Incidents Summary on Assets

DataHub entities that can have incidents (datasets, dashboards, charts, dataFlows, dataJobs, schemaFields) include an incidentsSummary aspect. This aspect provides:

A count of active incidents affecting the entity
A count of resolved incidents
The priority breakdown of active incidents
Quick access to incident details without querying the incident entities directly

This summary appears in the UI on asset pages, giving users immediate visibility into data quality issues.

GraphQL Operations

The DataHub GraphQL API provides several operations for working with incidents:

raiseIncident: Creates a new incident with specified type, priority, status, and affected entities
updateIncident: Updates incident properties including title, description, status, priority, assignees, and affected entities
updateIncidentStatus: Specifically updates the status state and stage of an incident
entityIncidents: Queries all incidents affecting a particular entity

These operations are used by the DataHub UI and can be called directly by external applications.

Authorization

Incident operations respect DataHub's authorization model. Users must have the EDIT_ENTITY_INCIDENTS privilege on an entity to:

Create incidents affecting that entity
Update incidents linked to that entity
Change the status of incidents affecting that entity

This ensures that only users with appropriate permissions can manage incidents for sensitive data assets.

Health Status

Incidents factor into the overall health status of DataHub entities. Assets with active CRITICAL or HIGH priority incidents may be marked as unhealthy in the UI, helping users quickly identify problematic data assets.

Notable Exceptions

Single vs. Multiple Affected Entities

While the data model supports incidents affecting multiple entities (via the entities array), some GraphQL resolvers currently have limitations when working with multi-entity incidents. Specifically, the UpdateIncidentStatusResolver currently only checks authorization against the first entity in the array. This is noted in the code as a TODO for future enhancement.

When creating incidents, it's recommended to:

Use multiple entities when they're all affected by the same root cause (e.g., all downstream datasets affected by an upstream data quality issue)
Be aware that users need appropriate permissions on all affected entities to update the incident
Consider the UI implications of multi-entity incidents when displaying incident details

Priority Field Type

The priority field is stored as an integer (0-3) rather than as an enum in the PDL model. This was noted in the schema comments as a potential area for future improvement. The GraphQL layer provides an enum interface (CRITICAL, HIGH, MEDIUM, LOW) that maps to these integer values, but the underlying storage uses integers.

When working with the low-level SDK, use the integer values:

0 = CRITICAL
1 = HIGH
2 = MEDIUM
3 = LOW

Automatic vs. Manual Incidents

Incidents created automatically by assertion failures cannot have their source field changed to MANUAL, and vice versa. The source field is set at creation time and reflects the origin of the incident. This distinction is important for reporting and analytics, as it helps teams understand the effectiveness of automated monitoring versus manual incident reporting.

Status Message Length

While there is no explicit length limit on the status message field in the schema, UI components may truncate very long messages. It's recommended to keep status messages concise (under 500 characters) and use the incident description field for longer explanations.

Incident Retention

Incidents are not automatically deleted when their affected entities are removed. This preserves the historical record of data quality issues even after assets are deprecated or deleted. However, this can lead to orphaned incidents that reference non-existent entities. It's recommended to implement cleanup processes for incidents linked to deleted assets if this becomes an issue in your organization.