docs/reference/dqm.md
Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules. Validation could be applied during:
Its goal is to address several complex data problems, namely:
To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset. How exactly profile equivalency should be measured is up to the user.
The validation process consists of the following steps:
Feast with Great Expectations support can be installed via
pip install 'feast[ge]'
Currently, Feast supports only Great Expectation's ExpectationSuite as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an ExpectationSuite.
Great Expectations supports automatic profiling as well as manually specifying expectations:
from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite
from feast.dqm.profilers.ge_profiler import ge_profiler
@ge_profiler
def automatic_profiler(dataset: Dataset) -> ExpectationSuite:
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
return UserConfigurableProfiler(
profile_dataset=dataset,
ignored_columns=['conv_rate'],
value_set_threshold='few'
).build_suite()
However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations:
@ge_profiler
def manual_profiler(dataset: Dataset) -> ExpectationSuite:
dataset.expect_column_max_to_be_between("column", 1, 2)
return dataset.get_expectation_suite()
During retrieval of historical features, validation_reference can be passed as a parameter to methods .to_df(validation_reference=...) or .to_arrow(validation_reference=...) of RetrievalJob.
If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned.
Otherwise, feast.dqm.errors.ValidationFailed exception would be raised. It will consist of all details for expectations that didn't pass.
from feast import FeatureStore
fs = FeatureStore(".")
job = fs.get_historical_features(...)
job.to_df(
validation_reference=fs
.get_saved_dataset("my_reference_dataset")
.as_reference(profiler=manual_profiler)
)