🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

String Tracking - Unicode Range and String Length

By default, columns of type str will have the following metrics, when logged with whylogs:

Counts
Types
Frequent Items/Frequent Strings
Cardinality

In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as mean, stddev and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length.

In this example, we're interested in tracking two specific ranges of characters:

ASCII Digits (unicode range 48-57)
Latin alphabet (unicode range 97-122)

For more info on the unicode list of characters, check this Wikipedia Article

Installing whylogs

If you haven't already, install whylogs:

python

# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Creating the Data

Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:

onlyDigits: Column of strings that contain only digit characters
onlyAlpha: Column of strings that contain only latin letters (no digits)
mixed: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbols

python

import whylogs as why
import pandas as pd
data = {
    "onlyDigits": ["12", "83", "1", "992", "7"],
    "onlyAlpha": ["Alice", "Bob", "Chelsea", "Danny", "Eddie"],
    "mixed": ["[email protected]","ADK-1171","Copacabana 272 - Rio de Janeiro","21º C Friday - Sao Paulo, Brasil","18127819ASW"]
}
df = pd.DataFrame(data)
df.head()

Configuring the Metrics in the DatasetSchema

whylogs uses Resolvers in order to define the set of metrics tracked for a column name or data type. In this case, we'll create a custom Resolver to apply the UnicodeRangeMetric to all of the columns.

If you're interested in seeing how you can add or remove different metrics according to the column type or column name, please refer to this example on Schema Configuration

python

from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics.unicode_range import UnicodeRangeMetric
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from typing import Dict
from whylogs.core.metrics import Metric, MetricConfig

class UnicodeResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}

Resolvers are passed to whylogs through a DatasetSchema, so we'll have to create a custom Schema as well.

We'll just have to:

Pass the UnicodeResolver created previously
Since we're interested in changing the default character ranges, we'll also pass a Metric Configuration with the desired ranges

python

config = MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})
schema = DatasetSchema(resolvers=UnicodeResolver(), default_configs=config)

If a default MetricConfig is not passed, it would use the default unicode ranges, which would track the default ranges such as: emoticons, control characters and extended latin.

We can now log the dataframe and pass our schema when calling log:

python

import whylogs as why

prof_results = why.log(df, schema=DatasetSchema(resolvers=UnicodeResolver(), default_configs=MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})))
prof = prof_results.profile()

Unicode Range and String Length Metrics

Let's take a look at the Profile View:

python

profile_view_df = prof.view().to_pandas()
profile_view_df

You can see there's a lot of different metrics for each of the original dataframe's columns. In the unicode_range metric, we'll have additional sub metrics. In this case, we have metrics for:

digits: distribution metrics for characters inside the unicode's digit range
alpha: distribution metrics for characters inside the unicode's lowercase letters range
UNKNOWN: distribution metrics for character that fall anywhere outside the predefined range (digits and alpha)
string_length: distribution metrics for overall string length

For each of these submetrics, we have metric components such as:

mean: the calculated mean for the column
stddev: the calculated standard deviation for the column
n: the total number of record for the column
max, min: maximum and minimum values for the column
q_xx: the xx-th quantile value of the data’s distribution
median: the median for the column

For instance, let's check the mean for alpha

python

profile_view_df['unicode_range/alpha:distribution/mean']

The above values shows a mean of 0 for onlyDigits - which is expected, since we don't have any letters in this column, only digits. We also have a mean of 5 for onlyAlpha, which will coincide of the string's length mean for the same column, since we only have letters characters in this columns. For mixed the mean is 12.8, and we can indeed see that this column has a higher count of letter character than the previous columns.

You might notice that, even though we defined the range for only lowercase letters, uppercase characters also are included when calculating the metrics. That happens because the strings are all lowercased during preprocessing before tracking the strings.

Let's now check the UNKNOWN namespace:

python

profile_view_df['unicode_range/UNKNOWN:distribution/mean']

Since we have only digits and letters in onlyDigit and onlyAlpha, there are no characters outside of the defined ranges, yielding means of 0. In the mixed, however, this value is non-zero, since there are characters such as ., -, º, and whitespaces, that are not in any of the predefined ranges.

The last namespace string_lenth, contains metrics for the string's length:

python

profile_view_df['unicode_range/string_length:distribution/min']

The string_length doesn't take into account any particular range. It containts aggregate metrics for the overall string length of each column. In this case, we're seeing the minimum value for the 3 columns: 1 for onlyDigits, 3 for onlyAlpha and 8 for mixed. Since the dataframe used here is very small, we can easily check the original data and verify that these metrics are indeed correct.

Conclusion

Feel free to define your own ranges of interest and combine the UnicodeRange metrics with other standard metrics as you see fit!

The resulting profiles can be:

merged together
stored locally or in the cloud (AWS' S3)

or used for other purposes, such as:

Setting constraints for data quality validation
Visualizing and comparing profiles
Sent to monitoring and observability platforms

Be sure to check the other examples at whylogs' Documentation!

String Tracking - Unicode Range and String Length

🚩 Create a free WhyLabs account to get more value out of whylogs!

String Tracking - Unicode Range and String Length

Installing whylogs

Creating the Data

Configuring the Metrics in the DatasetSchema

Unicode Range and String Length Metrics

Conclusion