docs/assertions/open-assertions-spec.md
DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like Snowflake DMFs, dbt tests, Great Expectations or DataHub Cloud natively.
Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub.
Currently, the DataHub Open Assertions Specification supports the following integrations:
And is looking for contributions to build out support for the following integrations:
Below, we'll look at how to define assertions in YAML, and then provide an usage overview for each support integration.
The following assertion types are currently supported by the DataHub YAML Assertion spec:
Each assertion type aims to validate a different aspect of structured table (e.g. on a data warehouse or data lake), from structure to size to column integrity to custom metrics.
In this section, we'll go over examples of defining each.
Freshness Assertions allow you to verify that your data was updated within the expected timeframe. Below you'll find examples of defining different types of freshness assertions via YAML.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: freshness
lookback_interval: "6 hours"
last_modified_field: updated_at
schedule:
type: interval
interval: "6 hours" # Run every 6 hours
This assertion checks that the purchase_events table in the test_db.public schema was updated within the last 6 hours
by issuing a Query to the table which validates determines whether an update was made using the updated_at column in the past 6 hours.
To use this check, we must specify the field that contains the last modified timestamp of a given row.
The lookback_interval field is used to specify the "lookback window" for the assertion, whereas the schedule field is used to specify how often the assertion should be run.
This allows you to schedule the assertion to run at a different frequency than the lookback window, for example
to detect stale data as soon as it becomes "stale" by inspecting it more frequently.
Currently, the only supported sourceType for Freshness Assertions is LAST_MODIFIED_FIELD. In the future,
we may support additional source types, such as HIGH_WATERMARK, along with data source-specific types such as
AUDIT_LOG and INFORMATION_SCHEMA.
Volume Assertions allow you to verify that the number of records in your dataset meets your expectations. Below you'll find examples of defining different types of volume assertions via YAML.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: "row_count"
condition:
type: between
min: 1000
max: 10000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
This assertion checks that the purchase_events table in the test_db.public schema has between 1000 and 10000 records.
Using the condition field, you can specify the type of comparison to be made, and the min and max fields to specify the range of values to compare against.
Using the filters field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the schedule field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
The only metric currently supported is row_count.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: "row_count"
condition:
type: less_than_or_equal_to
value: 1000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: "row_count"
condition:
type: greater_than_or_equal_to
value: 1000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
The full set of supported volume assertion conditions include:
equal_tonot_equal_togreater_thangreater_than_or_equal_toless_thanless_than_or_equal_tobetweenColumn Assertions allow you to verify that the values in a column meet your expectations. Below you'll find examples of defining different types of column assertions via YAML.
The specification currently supports 2 types of Column Assertions:
We'll go over examples of each below.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: amount
condition:
type: between
min: 0
max: 10
exclude_nulls: True
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
This assertion checks that all values for the amount column in the purchase_events table in the test_db.public schema have values between 0 and 10.
Using the field field, you can specify the column to be asserted on, and using the condition field, you can specify the type of comparison to be made,
and the min and max fields to specify the range of values to compare against.
Using the schedule field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
Using the filters field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the exclude_nulls field, you can specify whether to exclude NULL values from the assertion, meaning that
NULL will simply be ignored if encountered, as opposed to failing the check.
Using the failure_threshold, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed.
The validate a VARCHAR / STRING column that should contain one of a set of values:
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: product_id
condition:
type: in
value:
- "product_1"
- "product_2"
- "product_3"
exclude_nulls: False
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
The validate a string column contains valid email addresses:
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: email_address
condition:
type: matches_regex
value: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
exclude_nulls: False
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
The full set of supported field value conditions include:
innot_inis_nullis_not_nullequal_tonot_equal_togreater_than # Numeric Onlygreater_than_or_equal_to # Numeric Onlyless_than # Numeric Onlyless_than_or_equal_to # Numeric Onlybetween # Numeric Onlymatches_regex # String Onlynot_empty # String Onlylength_greater_than # String Onlylength_less_than # String Onlylength_between # String Onlyversion: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: col_date
metric: null_count
condition:
type: equal_to
value: 0
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
This assertion ensures that the col_date column in the purchase_events table in the test_db.public schema has no NULL values.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: id
metric: unique_percentage
condition:
type: equal_to
value: 100
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
This assertion ensures that the id column in the purchase_events table in the test_db.public schema
has no duplicates, by checking that the unique percentage is 100%.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: name
metric: empty_percentage
condition:
type: equal_to
value: 0
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
This assertion ensures that the name column in the purchase_events table in the test_db.public schema is never empty, by checking that the empty percentage is 0%.
The full set of supported field metrics include:
null_countnull_percentageunique_countunique_percentageempty_countempty_percentageminmaxmeanmedianstddevnegative_countnegative_percentagezero_countzero_percentageThe full set of supported field metric conditions include:
equal_tonot_equal_togreater_thangreater_than_or_equal_toless_thanless_than_or_equal_tobetweenCustom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations. The only condition is that the SQL query must return a single value, which will be compared against the expected value. Below you'll find examples of defining different types of custom SQL assertions via YAML.
SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types, and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: sql
statement: |
SELECT COUNT(*)
FROM test_db.public.purchase_events AS pe
LEFT JOIN test_db.public.products AS p
ON pe.product_id = p.id
WHERE p.id IS NULL
condition:
type: equal_to
value: 0
schedule:
type: interval
interval: "6 hours" # Run every 6 hours
This assertion checks that the purchase_events table in the test_db.public schema has no rows where the product_id column does not have a corresponding id in the products table.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: sql
statement: |
SELECT COUNT(*) FROM test_db.public.purchase_events
- (SELECT COUNT(*) FROM test_db.public.purchase_events_raw) AS row_count_difference
condition:
type: equal_to
value: 0
schedule:
type: interval
interval: "6 hours" # Run every 6 hours
This assertion checks that the number of rows in the purchase_events exactly matches the number of rows in an upstream purchase_events_raw table
by subtracting the row count of the raw table from the row count of the processed table.
The full set of supported custom SQL assertion conditions include:
equal_tonot_equal_togreater_thangreater_than_or_equal_toless_thanless_than_or_equal_tobetweenSchema Assertions allow you to define custom SQL queries to verify your data meets your expectations. Below you'll find examples of defining different types of custom SQL assertions via YAML.
The specification currently supports 2 types of Schema Assertions:
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: schema
condition:
type: exact_match
columns:
- name: id
type: INTEGER
- name: product_id
type: STRING
- name: amount
type: DECIMAL
- name: updated_at
type: TIMESTAMP
schedule:
type: interval
interval: "6 hours" # Run every 6 hours
This assertion checks that the purchase_events table in the test_db.public schema has the exact schema as specified, with the exact column names and data types.
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: schema
condition:
type: contains
columns:
- name: id
type: integer
- name: product_id
type: string
- name: amount
type: number
schedule:
type: interval
interval: "6 hours" # Run every 6 hours
This assertion checks that the purchase_events table in the test_db.public schema contains all of the columns specified in the expected schema, with the exact column names and data types.
The actual schema can also contain additional columns not specified in the expected schema.
The following high-level data types are currently supported by the Schema Assertion spec: