docs/adr/ADR-0011-data-quality-monitoring.md
Accepted
Data quality issues can significantly impact ML model performance. Several complex data problems needed to be addressed:
Feast needed a mechanism to validate data at retrieval time to catch these issues before they affect model training or serving.
Introduce a Data Quality Monitoring (DQM) module that validates datasets against user-curated rules, initially targeting historical retrieval (training dataset generation).
The validation process uses a reference dataset and a profiler pattern:
The initial implementation uses Great Expectations as the validation engine:
from feast.dqm.profilers.ge_profiler import ge_profiler
from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite
@ge_profiler
def my_profiler(dataset: Dataset) -> ExpectationSuite:
dataset.expect_column_max_to_be_between("column", 1, 2)
dataset.expect_column_values_to_not_be_null("important_feature")
return dataset.get_expectation_suite()
Validation is triggered during historical feature retrieval via a validation_reference parameter:
from feast import FeatureStore
store = FeatureStore(".")
job = store.get_historical_features(...)
df = job.to_df(
validation_reference=store
.get_saved_dataset("my_reference_dataset")
.as_reference(profiler=my_profiler)
)
If validation fails, a ValidationFailed exception is raised with details for all expectations that didn't pass. If validation succeeds, the materialized dataset is returned normally.
.to_df() or .to_arrow()), not during ingestion.feast[ge]).sdk/python/feast/dqm/, sdk/python/feast/saved_dataset.py