Back to Whylogs

Creating Metric Constraints on Condition Count Metrics

python/examples/advanced/Metric_Constraints_with_Condition_Count_Metrics.ipynb

1.6.49.2 KB
Original Source

Creating Metric Constraints on Condition Count Metrics

whylogs profiles contain summarized information about our data. This means that it's a lossy process, and once we get the profiles, we don't have access anymore to the complete set of data.

This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there are no textual information that matches a credit card number or email information. Or maybe you're interested in ensuring that there are no even numbers in a certain column. How do we do that if we don't have access to the complete data?

The answer is that you need to define a Condition Count Metric to be tracked before logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you'll have that information to check against the constraints you'll create.

In this example, you'll learn how to:

  • Define additional Condition Count Metrics
  • Define actions to be triggered whenever those conditions are met during the logging process.
  • Use the Condition Count Metrics to create constraints against said conditions

If you want more information on Condition Count Metrics, you can see this example and also the documentation for Data Validation

Installing whylogs

python
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Context

Let's assume we have a DataFrame for which we wish to log standard metrics through whylogs' default logging process. But additionally, we want specific information on two columns:

  • url: Regex pattern validation: the values in this column should always start with https:://www.mydomain.com/profile
  • subscription_date: Date Format validation: the values in this column should be a string with a date format of %Y-%m-%d

In addition, we consider these cases to be critical, so we wish to make certain actions whenever the condition fails. In this example we will:

  • Send an alert in Slack whenever subscription_date fails the condition
  • Send an alert in Slack and pull a symbolic Andon Cord whenever url is not from the domain we expect

Let's first create a simple DataFrame to demonstrate:

python
import pandas as pd
data = {
        "name": ["Alice", "Bob", "Charles"],
        "age": [31,0,25],
        "url": ["https://www.mydomain.com/profile/123", "www.wrongdomain.com", "http://mydomain.com/unsecure"],
        "subscription_date": ["2021-12-28","2019-29-11","04/08/2021"],
    }

df = pd.DataFrame(data)

In this case, both url and subscription_date has 2 values out of 3 that are not what we expect.

Defining the Relations

Let's first define the relations that will actually check whether the value passes our constraint. For the date format validation, we'll use the datetime module in a user defined function. As for the Regex pattern matching, we will use whylogs' Predicates along with regular expressions, which allows us to build simple relations intuitively.

python
import datetime
from typing import Any
from whylogs.core.relations import Predicate


def date_format(x: Any) -> bool:
    date_format = '%Y-%m-%d'
    try:
        datetime.datetime.strptime(x, date_format)
        return True
    except ValueError:
        return False

# matches accept a regex expression
matches_domain_url = Predicate().matches("^https:\/\/www.mydomain.com\/profile")

Defining the Actions

Next, we need to define the actions that will be triggered whenever the conditions fail.

We will define two placeholder functions that, in a real scenario, would execute the defined actions.

python
from typing import Any

def pull_andon_cord(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Pulling andon cord....")
    # Do something here to respond to the constraint violation
    return

def send_slack_alert(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Sending slack alert....")
    # Do something here to respond to the constraint violation
    return

Conditions = Relations + Actions

Conditions are defined by the combination of a relation and a set of actions. Now that we have both relations and actions, we can create two sets of conditions - in this example, each set contain a single condition, but we could have multiple.

python
from whylogs.core.metrics.condition_count_metric import Condition

has_date_format = {
    "Y-m-d format": Condition(date_format, actions=[send_slack_alert]),
}

regex_conditions = {"url_matches_domain": Condition(matches_domain_url, actions=[pull_andon_cord,send_slack_alert])}

ints_conditions = {
    "integer_zeros": Condition(Predicate().equals(0)),
}

Passing the conditions to the Logger

Now, we need to let the logger aware of our Conditions. This can be done by creating a custom schema object that will be passed to why.log().

To create the schema object, we will use the Declarative Schema, which is an auxiliary class that will enable us to create a schema in a simple way.

In this case, we want our schema to start with the default behavior (standard metrics for the default datatypes). Then, we want to add two condition count metrics based on the conditions we defined earlier and the name of the column we want to bind those conditions to. We can do so by calling the schema's add_condition_count_metric method:

python
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.schema import DeclarativeSchema

schema = DeclarativeSchema(STANDARD_RESOLVER)

schema.add_resolver_spec(column_name="subscription_date", metrics=[ConditionCountMetricSpec(has_date_format)])
schema.add_resolver_spec(column_name="url", metrics=[ConditionCountMetricSpec(regex_conditions)])
schema.add_resolver_spec(column_name="age", metrics=[ConditionCountMetricSpec(ints_conditions)])

Now, let's pass the schema to why.log() and start logging our data:

python
import whylogs as why
profile_view = why.log(df, schema=schema).profile().view()

You can see that during the logging process, our actions were triggered whenever the condition failed. We can see the name of the failed condition and the specific value that triggered it.

We see the actions were triggered, but we also expect the Condition Count Metrics to be generated. Let's see if this is the case:

python
profile_view.to_pandas()

At the far right of our summary dataframe, you can find the Condition Count Metrics: the Y-m-d format condition was met only once of a total of 3. The same happens for the url_matches_domain. Note that for columns where the condition was not defined, a NaN is displayed.

Creating Metric Constraints based on Condition Count Metrics

So far, we created Condition Count Metrics for both of the desired conditions. During the logging process, the set of actions defined for each of the conditions were triggered whenever the conditions failed to be met.

Now, we wish to create Metric Constraints on top of the Condition Count Metrics, so we can generate a Constraints Report. This can be done by using the condition_meets helper constraint. You only need to specify the column name and the name of the condition you want to check:

python
from whylogs.core.constraints.factories import condition_meets, condition_never_meets, condition_count_below
from whylogs.core.constraints import ConstraintsBuilder

builder = ConstraintsBuilder(dataset_profile_view=profile_view)

builder.add_constraint(condition_meets(column_name="subscription_date", condition_name="Y-m-d format"))
builder.add_constraint(condition_never_meets(column_name="url", condition_name="url_matches_domain"))
builder.add_constraint(condition_count_below(column_name="age", condition_name="integer_zeros", max_count=1))

constraints = builder.build()
constraints.generate_constraints_report()

The condition_meets constraint will fail if the said condition is not met at least once. In other words, if condition_count/condition_name is smaller than condition_count/total.

The condition_never_meets constraint will fail if the said condition is met at least once. In other words, if condition_count/condition_name is greater than 0.

The condition_count_below constraint will fail if the said condition is met more than a specified number of times.

Visualizing the Report

You can visualize the Constraints Report as usual by calling NotebookProfileVisualizer's constraints_report:

python
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

By hovering on the status, you can view the number of times the condition failed, and the total number of times the condition was checked.