metadata-models/docs/entities/assertion.md
The assertion entity represents a data quality rule that can be applied to one or more datasets. Assertions are the foundation of DataHub's data quality framework, enabling organizations to define, monitor, and enforce expectations about their data. They encompass various types of checks including field-level validation, volume monitoring, freshness tracking, schema validation, and custom SQL-based rules.
Assertions can originate from multiple sources: they can be defined natively within DataHub, ingested from external data quality tools (such as Great Expectations, dbt tests, or Snowflake Data Quality), or inferred by ML-based systems. Each assertion tracks its evaluation history over time, maintaining a complete audit trail of passes, failures, and errors.
An Assertion is uniquely identified by an assertionId, which is a globally unique identifier that remains constant across runs of the assertion. The URN format is:
urn:li:assertion:<assertionId>
The assertionId is typically a generated GUID that uniquely identifies the assertion definition. For example:
urn:li:assertion:432475190cc846f2894b5b3aa4d55af2
The logic for generating stable assertion IDs differs based on the source of the assertion:
The key requirement is that the same assertion definition should always produce the same assertionId, enabling DataHub to track the assertion's history over time even as it's re-evaluated.
DataHub supports several types of assertions, each designed to validate different aspects of data quality:
Field assertions validate individual columns or fields within a dataset. They come in two sub-types:
Field Values Assertions: Validate that each value in a column meets certain criteria. For example:
Field Metric Assertions: Validate aggregated statistics about a column. For example:
{{ inline /metadata-ingestion/examples/library/assertion_create_field_uniqueness.py show_path_as_comment }}
Volume assertions monitor the amount of data in a dataset. They support several sub-types:
Volume assertions are critical for detecting data pipeline failures, incomplete loads, or unexpected data growth.
<details> <summary>Python SDK: Create a row count volume assertion</summary>{{ inline /metadata-ingestion/examples/library/assertion_create_volume_rows.py show_path_as_comment }}
Freshness assertions ensure data is updated within expected time windows. Two types are supported:
Freshness assertions define a schedule that specifies when updates should occur (e.g., daily by 9 AM, every 4 hours) and what tolerance is acceptable.
<details> <summary>Python SDK: Create a dataset change freshness assertion</summary>{{ inline /metadata-ingestion/examples/library/assertion_create_freshness.py show_path_as_comment }}
Schema assertions validate that a dataset's structure matches expectations. They verify:
Schema assertions are valuable for detecting breaking changes in upstream data sources.
<details> <summary>Python SDK: Create a schema assertion</summary>{{ inline /metadata-ingestion/examples/library/assertion_create_schema.py show_path_as_comment }}
SQL assertions allow custom validation logic using arbitrary SQL queries. Two types:
SQL assertions provide maximum flexibility for complex validation scenarios that don't fit other assertion types, such as cross-table referential integrity checks or business rule validation.
<details> <summary>Python SDK: Create a SQL metric assertion</summary>{{ inline /metadata-ingestion/examples/library/assertion_create_sql_metric.py show_path_as_comment }}
Custom assertions provide an extension point for assertion types not directly modeled in DataHub. They're useful when:
The assertionInfo aspect includes an AssertionSource that identifies the origin of the assertion:
External assertions should have a corresponding dataPlatformInstance aspect that identifies the specific platform instance they originated from.
Assertion evaluations are tracked using the assertionRunEvent timeseries aspect. Each evaluation creates a new event with:
Run events enable tracking assertion health over time, identifying trends, and debugging failures.
The assertionActions aspect defines automated responses to assertion outcomes:
Common actions include:
Like other DataHub entities, assertions support standard metadata capabilities:
{{ inline /metadata-ingestion/examples/library/assertion_add_tag.py show_path_as_comment }}
Assertions use a standard set of operators for comparisons:
Numeric: BETWEEN, LESS_THAN, LESS_THAN_OR_EQUAL_TO, GREATER_THAN, GREATER_THAN_OR_EQUAL_TO, EQUAL_TO, NOT_EQUAL_TO
String: CONTAIN, START_WITH, END_WITH, REGEX_MATCH, IN, NOT_IN
Boolean: IS_TRUE, IS_FALSE, NULL, NOT_NULL
Native: _NATIVE_ for platform-specific operators
Parameters are provided via AssertionStdParameters:
value: Single value for most operatorsminValue, maxValue: Range endpoints for BETWEENNUMBER, STRING, SETField and volume assertions can apply aggregation functions before evaluation:
Statistical: MEAN, MEDIAN, STDDEV, MIN, MAX, SUM
Count-based: ROW_COUNT, COLUMN_COUNT, UNIQUE_COUNT, NULL_COUNT
Proportional: UNIQUE_PROPORTION, NULL_PROPORTION
Identity: IDENTITY (no aggregation), COLUMNS (all columns)
Assertions have a strong relationship with datasets through the Asserts relationship:
Datasets maintain a reverse relationship, showing all assertions that validate them. This enables users to understand the quality checks applied to any dataset.
Freshness assertions can target data jobs (pipelines) to ensure they execute on schedule. When a FreshnessAssertionInfo has type=DATA_JOB_RUN, the entity field references a dataJob URN rather than a dataset.
External assertions maintain a relationship to their source platform through the dataPlatformInstance aspect. This enables:
Assertions are fully accessible via DataHub's GraphQL API:
Key GraphQL types:
Assertion: The main assertion entityAssertionInfo: Assertion definition and typeAssertionRunEvent: Evaluation resultsAssertionSource: Origin metadataDataHub's dbt integration automatically converts dbt tests into assertions:
The Great Expectations integration maps expectations to assertion types:
Each expectation suite becomes a collection of assertions in DataHub.
Snowflake DMF (Data Metric Functions) rules are ingested as assertions:
The DATASET assertion type is a legacy format that predates the more specific field, volume, freshness, and schema assertion types. It uses DatasetAssertionInfo with a generic structure. New integrations should use the more specific assertion types (FIELD, VOLUME, FRESHNESS, DATA_SCHEMA, SQL) as they provide better type safety and UI rendering.
While assertions track pass/fail status, DataHub also supports more detailed metrics through the AssertionResult object:
actualAggValue: The actual value observed (for numeric assertions)externalUrl: Link to detailed results in the source systemnativeResults: Platform-specific result detailsThis enables richer debugging and understanding of why assertions fail.
DataHub tracks when assertions run through assertionRunEvent timeseries data, but does not directly schedule assertion evaluations. Scheduling is handled by:
DataHub provides monitoring and alerting based on the assertion run events, regardless of the scheduling mechanism.
DataHub has two related concepts:
Test results are lightweight pass/fail indicators without the full expressiveness of assertions. Use assertions for production data quality monitoring and test results for simple ingestion-time validation.