Back to Whylogs

Simple Constraints - Examples and Usage

python/examples/basic/Constraints_Suite.ipynb

1.6.416.1 KB
Original Source

🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Simple Constraints - Examples and Usage

This is a whylogs v1 example. For the analog feature in v0, please refer to this example

In this example, we'll show how to define a number of simple constraints and examples on how to use them. For the basics on how to build your own set of constraints, see the example - Data Validation with Metric Constraints.

The constraints are listed according to the metric namespace used when defining them. For each category, we will create helper functions for simple and popular constraints. Each helper function has a brief explanation in its docstring. After defining the helper functions, we'll show a simple example on how to build the constraints out of the functions and visualize them as a report with the visualization module.

Note: The constraints shown here are still experimental and subject to further changes. Stay tuned for upgrades!

Completeness Constraints

constraintparameterssemanticmetric
no_missing_valuescolumn nameChecks that are no missing values in the columnCounts
null_values_below_numbercolumn name, numberNumber of null values must be below given number.Counts
null_percentage_below_numbercolumn name, numberPercentage of null values must be below given number.Counts

Consistency Constraints

constraintparameterssemanticmetric
greater_than_numbercolumn nameMinimum value of given column must be above defined number.Distribution
smaller_than_numbercolumn name, numberMaximum value of given column must be below defined number.Distribution
is_in_rangecolumn name, lower, upperChecks that all of column's values are in defined range (inclusive).Distribution
is_non_negativecolumn nameChecks if a column is non negative.Distribution
n_most_common_items_in_setcolumn name, reference setChecks if the top n most common items appear in the dataset.Frequent Items
frequent_strings_in_reference_setcolumn name, reference setChecks if a set of variables appear in the frequent strings for a string column.Frequent Items
count_below_numbercolumn name, numberChecks if elements in a column are below given number.Counts
distinct_number_in_rangecolumn name, lower, upperChecks if number of distinct categories is between lower and upper values (inclusive).Cardinality
column_is_nullable_integralcolumn nameCheck if column contains only records of specific datatype.Types
column_is_nullable_booleancolumn nameCheck if column contains only records of specific datatype.Types
column_is_nullable_fractionalcolumn nameCheck if column contains only records of specific datatype.Types
column_is_nullable_objectcolumn nameCheck if column contains only records of specific datatype.Types
column_is_nullable_stringcolumn nameCheck if column contains only records of specific datatype.Types

Condition Count Constraints

Please refer to the example Metric Constraints with Condition Count Metrics for examples on how to use these constraints.

constraintparameterssemanticmetric
condition_meetscolumn name, condition_nameFails if condition not met at least once.Condition Count
condition_never_meetscolumn name, condition_nameFails if condition is met at least onceCondition Count
condition_count_belowcolumn name, condition_name, max_countFails if condition is met more than max countCondition Count

Statistics Constraints

constraintparameterssemanticmetric
mean_between_rangecolumn name, lower, upperMean must be between range defined by lower and upper bounds.Distribution
stddev_between_rangecolumn name, lower, upperStandard deviarion must be between range defined by lower and upper bounds.Distribution
quantile_between_rangecolumn name, quantile, lower, upperQ-th quantile value must be withing the range defined by lower and upper boundaries.Distribution

Table of Contents

Installing whylogs and importing modules <a class="anchor" id="pre"></a>

If you haven't already, install whylogs:

python
# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[viz]'

Then, let's import the helper functions needed to define the constraints:

python
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import (
    greater_than_number,
    is_in_range,
    is_non_negative,
    mean_between_range,
    smaller_than_number,
    stddev_between_range,
    quantile_between_range
)

Examples - Distribution Metrics Constraints

python
import whylogs as why
import pandas as pd
data = {
    "animal": ["cat", "hawk", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, 4, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}

results = why.log(pd.DataFrame(data))
profile_view = results.view()
python
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(greater_than_number(column_name="weight", number=0.14))
builder.add_constraint(mean_between_range(column_name="weight", lower=2, upper=3))
builder.add_constraint(smaller_than_number(column_name="weight", number=20.5))
builder.add_constraint(stddev_between_range(column_name="weight", lower=1, upper=3))
builder.add_constraint(quantile_between_range(column_name="weight", quantile=0.5, lower=1.5, upper=2.0))
builder.add_constraint(is_in_range(column_name="weight", lower=1.1, upper=3.2))
builder.add_constraint(is_in_range(column_name="legs", lower=0, upper=6))
builder.add_constraint(is_non_negative(column_name="legs"))

# animal has missing distribution metrics. this will pass if skip_missing = True and fail otherwise.
builder.add_constraint(
    quantile_between_range(
        column_name="animal", 
        quantile=0.5, 
        lower=1.5, 
        upper=2.0, 
        skip_missing=False
    )
)

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

Frequent Items/Frequent Strings Constraints <a class="anchor" id="frequent"></a>

python
from whylogs.core.constraints.factories import n_most_common_items_in_set, frequent_strings_in_reference_set

Examples - Frequent Items/Frequent Strings Constraints

python
import whylogs as why
import pandas as pd
data = {
    "animal": ["cat", "snake", "snake", "cat", "mosquito"],
    "legs": [0, 1, 2, 3, 4],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}

results = why.log(pd.DataFrame(data))
profile_view = results.view()
python
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
reference_set = {"cat","snake"}
builder.add_constraint(frequent_strings_in_reference_set(column_name="animal", reference_set=reference_set))
builder.add_constraint(n_most_common_items_in_set(column_name="animal",n=2,reference_set=reference_set))

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

Counters Constraints <a class="anchor" id="counts"></a>

python
from whylogs.core.constraints.factories import no_missing_values, count_below_number, null_percentage_below_number, null_values_below_number

Examples - Counters Constraints

python
import whylogs as why
import pandas as pd
data = {
    "animal": ["cat", "snake", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, None, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}

results = why.log(pd.DataFrame(data))
profile_view = results.view()
python
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(count_below_number(column_name="legs", number=10))
builder.add_constraint(null_percentage_below_number(column_name="legs", number=0.05))
builder.add_constraint(null_values_below_number(column_name="legs", number=1))
builder.add_constraint(no_missing_values(column_name="legs"))
builder.add_constraint(no_missing_values(column_name="animal"))

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

Cardinality Constraints <a class="anchor" id="card"></a>

python
from whylogs.core.constraints.factories import distinct_number_in_range

Examples - Cardinality Constraints

python
import whylogs as why
import pandas as pd
data = {
    "animal": ["cat", "snake", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, None, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}

results = why.log(pd.DataFrame(data))
profile_view = results.view()
python
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(distinct_number_in_range(column_name = "animal", lower = 3, upper = 6))

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

Types Metrics <a class="anchor" id="types"></a>

Examples - Types Metrics

python
import whylogs as why
import pandas as pd

data = {
    "animal": ["cat", "snake", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, None, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
    "flies": [False, False, "False", False, True],
    "obj": [{"a":1}, None, {"a":1}, {"a":1}, {"a":1}]
}
df = pd.DataFrame(data)
results = why.log(df)
profile_view = results.view()

Check Nullable Types

python

from whylogs.core.constraints.factories import ( 
    column_is_nullable_integral,
    column_is_nullable_boolean, 
    column_is_nullable_fractional,
    column_is_nullable_object,
    column_is_nullable_string,
)
from whylogs.core.constraints import ConstraintsBuilder


builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(column_is_nullable_string(column_name="animal"))
builder.add_constraint(column_is_nullable_integral(column_name="legs"))
builder.add_constraint(column_is_nullable_fractional(column_name="weight"))
builder.add_constraint(column_is_nullable_boolean(column_name="flies"))
builder.add_constraint(column_is_nullable_object(column_name="obj"))

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

The constraints above will pass if all values are of a given type. Null values are accepted.

Note that for legs, the constraints failed. That is because whylogs leverages pandas' dtypes when it is available, and when a None is present, the column is considered to be fractional, even though the remaining values were originally integers.

Combined Constraints <a class="anchor" id="comb"></a>

Examples - Combined Metrics

To create a constraint that checks for a non-nullable type, we combine two separate constraints:

  • column is nullable datatype
  • null values below 1
python
import whylogs as why
import pandas as pd

data = {
    "animal": ["cat", "snake", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, None, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
    "flies": [False, False, "False", False, True],
    "obj": [{"a":1}, None, {"a":1}, {"a":1}, {"a":1}]
}
df = pd.DataFrame(data)
results = why.log(df)
profile_view = results.view()

Check Non-nullable Types

python
from whylogs.core.constraints.factories import ( 
    column_is_nullable_integral,
    column_is_nullable_boolean, 
    column_is_nullable_fractional,
    column_is_nullable_object,
    column_is_nullable_string,
    null_values_below_number,
)
from whylogs.core.constraints import ConstraintsBuilder


builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(column_is_nullable_string(column_name="animal"))
builder.add_constraint(null_values_below_number(column_name="animal",number=1))

# The combination of these metrics makes a check of non-nullable integral
builder.add_constraint(column_is_nullable_integral(column_name="legs"))
builder.add_constraint(null_values_below_number(column_name="legs",number=1))

# The combination of these metrics makes a check of non-nullable fractional
builder.add_constraint(column_is_nullable_fractional(column_name="weight"))
builder.add_constraint(null_values_below_number(column_name="weight",number=1))

# The combination of these metrics makes a check of non-nullable boolean
builder.add_constraint(column_is_nullable_boolean(column_name="flies"))
builder.add_constraint(null_values_below_number(column_name="flies",number=1))

# The combination of these metrics makes a check of non-nullable object
builder.add_constraint(column_is_nullable_object(column_name="obj"))
builder.add_constraint(null_values_below_number(column_name="obj",number=1))

constraints = builder.build()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)