python/examples/advanced/Condition_Count_Metrics.ipynb
🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
By default, whylogs tracks several metrics, such as type counts, distribution metrics, cardinality and frequent items. Those are general metrics that are useful for a lot of use cases, but often we need metrics tailored for our application.
Condition Count Metrics gives you the flexibility to define your own customized metrics. It will return the results as counters, which is the number of times the condition was met for a given column. With it, you can define conditions such as regex matching for strings, equalities or inequalities for numerical features, and even define your own function to check for any given condition.
In this example, we will cover:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Let's import all the dependencies for this example upfront:
import pandas as pd
from typing import Any
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Not, Predicate
from whylogs.core.schema import DeclarativeSchema
Suppose we have textual columns in our data in which we want to make sure certain elements are present / not present.
For example, for privacy and security issues, we might be interested in tracking the number of times a credit card number appears on a given column, or if we have sensitive email information in another column.
With whylogs, we can define metrics that will count the number of occurences a certain regex pattern is met for a given column.
Let's create a simple dataframe.
In this scenario, the emails column should have only a valid email, nothing else. As for the trascriptions column, we want to make sure existing credit card number was properly masked or removed.
data = {
"emails": ["my email is [email protected]","[email protected]","[email protected]","not an email"],
"transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)
The conditions are defined through a whylogs' Condition object. There are several different ways of assembling a condition. In the following example, we will define two different regex patterns, one for each column. Since we can define multiple conditions for a single column, we'll assemble the conditions into dictionaries, where the key is the condition name. Each dictionary will be later attached to the relevant column.
emails_conditions = {
"containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}
transcriptions_conditions = {
"containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}
whylogs must be aware of those conditions while profiling the data. We can do that by creating a Standard Schema, and then simply adding the conditions to the schema with add_resolver_spec. That way, we can pass our enhanced schema when calling why.log() later.
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
Note: The regex expressions are for demonstrational purposes only. These expressions are not general - there will be emails and credit cards whose patterns will not be met by the expression.
Now, we only need to pass our schema when logging our data. Let's also take a look at the metrics, to make sure everythins was tracked correctly:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
Let's check the numbers:
For emails feature, only one occurence was met for containsEmail. That is expected, because the only valid row is the third one ("[email protected]"). Others either don't contain an email, are invalid emails or have extra text that are not an email (note we're using fullmatch as the predicate for the email condition).
For transcriptions column, we also have only one match. That is well, since only the first row has a match with the given pattern, and others either don't have a credit card number or are properly "hidden". Note that in this case we want to check for the pattern inside a broader text, so we're using .* before the pattern, so the text doesn't have to start with the pattern (whylogs' Predicate.matches uses python's re.compile().match() under the hood.)
The available relations for regex matching are the ones used in this example:
matchesfullmatchFor this one, let's create integer and floats columns:
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
As before, we will create our set of conditions for each column and pass both to our schema:
ints_conditions = {
"equals42": Condition(Predicate().equals(42)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
floats_conditions = {
"equals42.2": Condition(Predicate().equals(42.2)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])
Let's log and check the metrics:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]
We can simply check the original data to verify that the metrics are correct.
We used equals, less_than and greater_than in this example, but here's the complete list of available relations:
equals - equal toless_than - less thanless_or_equals - less than or equal togreater_than - greater thangreater_or_equals - greater than or equal tonot_equal - not equal toYou can also combine relations with logical operators such as AND, OR and NOT.
Let's stick with the numerical features to show how you can combine relations to assemble conditions such as:
conditions = {
"between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
"outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
"not_42": Condition(Not(Predicate().equals(42))), # could also use X.not_equal(42) or X.not_.equals(42)
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]
Available logical operators are:
and_or_not_NotNote that and_, or_, and not_ are methods called on a Predicate and passed another Predicate, while Not is a function that takes a single Predicate argument.
Even though we showed these operators with numerical features, this also works with regex matching conditions shown previously.
If none of the previously conditions are suited to your use case, you are free to define your own custom function to create your own metrics.
Let's see a simple example: suppose we want to check if a certain number is even.
We can define a even predicate function, as simple as:
def even(x: Any) -> bool:
return x % 2 == 0
And then we proceed as usual, defining our condition and adding it to the schema:
We only have to pass the name of the function to conditions as a Condition object, like below:
conditions = {
"isEven": Condition(Predicate().is_(even)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isEven', 'condition_count/total']]
For user-defined functions, the sky's the limit for what you can do.
Let's think of another simple cenario for NLP. Suppose our model assumes text to be a certain way. Maybe it was trained and expects:
Let's check these conditions for the data below:
data = {
"transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
"ints": [0,1,2,3],
}
df = pd.DataFrame(data=data)
Once again, let's define our function:
def preprocessed(x: Any) -> bool:
stopwords = ["i", "me", "myself"]
if not isinstance(x, str):
return False
# should have only lowercase letters and space (no digits)
if not all(c.islower() or c.isspace() for c in x):
return False
# should not contain any words in our stopwords list
if any(c in stopwords for c in x.split()):
return False
return True
Since this is an example, our
stopwordslist is only a placeholder for the real thing.
The rest is the same as before:
conditions = {
"isPreprocessed": Condition(Predicate().is_(preprocessed)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed', 'condition_count/total']]
For the transcriptions feature, we can see that only the second row is properly preprocessed ("am alice and am xx years old"). The first one contained uppercase characters, the third contained a stopword and the last one contained digits. For the integers column, isPreprocessed returns 0, since it's not a string value.
You can combine this example with other whylogs' features to cover even more scenarios.
Here are some pointers for some possible use cases:
containsCreditCardNumber should always be 0). Check the Metric Constraints with Condition Count Metrics example!Here are the complete code snippets - just to make it easier to copy/paste!
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
data = {
"emails": ["my email is [email protected]","[email protected]","[email protected]","not an email"],
"transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)
emails_conditions = {
"containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}
transcriptions_conditions = {
"containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
ints_conditions = {
"equals42": Condition(Predicate().equals(42)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
floats_conditions = {
"equals42.2": Condition(Predicate().equals(42.2)),
"lessthan5": Condition(Predicate().less_than(5)),
"morethan40": Condition(Predicate().greater_than(40)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]
import pandas as pd
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.relations import Not
data = {
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
conditions = {
"between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
"outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
"not_42": Condition(Not(Predicate().equals(42))), # could also use X.not_equal(42) or X.not_.equals(42)
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]
import pandas as pd
from typing import Any
import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
def even(x: Any) -> bool:
return x % 2 == 0
def preprocessed(x: Any) -> bool:
stopwords = ["i", "me", "myself"]
if not isinstance(x, str):
return False
# should have only lowercase letters and space (no digits)
if not all(c.islower() or c.isspace() for c in x):
return False
# should not contain any words in our stopwords list
if any(c in stopwords for c in x.split()):
return False
return True
data = {
"transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
"ints_column": [1,12,42,4],
"floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)
transcriptions_conditions = {
"isPreprocessed": Condition(Predicate().is_(preprocessed)),
}
numerical_conditions = {
"isEven": Condition(Predicate().is_(even)),
}
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed','condition_count/isEven', 'condition_count/total']]