Back to Whylogs

String Tracking - Unicode Range and String Length

python/examples/advanced/String_Tracking.ipynb

1.6.47.8 KB
Original Source

🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

String Tracking - Unicode Range and String Length

By default, columns of type str will have the following metrics, when logged with whylogs:

  • Counts
  • Types
  • Frequent Items/Frequent Strings
  • Cardinality

In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as mean, stddev and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length.

In this example, we're interested in tracking two specific ranges of characters:

  • ASCII Digits (unicode range 48-57)
  • Latin alphabet (unicode range 97-122)

For more info on the unicode list of characters, check this Wikipedia Article

Installing whylogs

If you haven't already, install whylogs:

python
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Creating the Data

Let's create a simple dataframe to demonstrate. To better visualize how the metrics work, we'll create 3 columns:

  • onlyDigits: Column of strings that contain only digit characters
  • onlyAlpha: Column of strings that contain only latin letters (no digits)
  • mixed: Column of strings that contain, digits, letters and other types of charachters, like punctuation and symbols
python
import whylogs as why
import pandas as pd
data = {
    "onlyDigits": ["12", "83", "1", "992", "7"],
    "onlyAlpha": ["Alice", "Bob", "Chelsea", "Danny", "Eddie"],
    "mixed": ["[email protected]","ADK-1171","Copacabana 272 - Rio de Janeiro","21º C Friday - Sao Paulo, Brasil","18127819ASW"]
}
df = pd.DataFrame(data)
df.head()

Configuring the Metrics in the DatasetSchema

whylogs uses Resolvers in order to define the set of metrics tracked for a column name or data type. In this case, we'll create a custom Resolver to apply the UnicodeRangeMetric to all of the columns.

If you're interested in seeing how you can add or remove different metrics according to the column type or column name, please refer to this example on Schema Configuration

python
from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics.unicode_range import UnicodeRangeMetric
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from typing import Dict
from whylogs.core.metrics import Metric, MetricConfig

class UnicodeResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}

Resolvers are passed to whylogs through a DatasetSchema, so we'll have to create a custom Schema as well.

We'll just have to:

  • Pass the UnicodeResolver created previously
  • Since we're interested in changing the default character ranges, we'll also pass a Metric Configuration with the desired ranges
python
config = MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})
schema = DatasetSchema(resolvers=UnicodeResolver(), default_configs=config)

If a default MetricConfig is not passed, it would use the default unicode ranges, which would track the default ranges such as: emoticons, control characters and extended latin.

We can now log the dataframe and pass our schema when calling log:

python
import whylogs as why

prof_results = why.log(df, schema=DatasetSchema(resolvers=UnicodeResolver(), default_configs=MetricConfig(unicode_ranges={"digits": (48, 57), "alpha": (97, 122)})))
prof = prof_results.profile()

Unicode Range and String Length Metrics

Let's take a look at the Profile View:

python
profile_view_df = prof.view().to_pandas()
profile_view_df

You can see there's a lot of different metrics for each of the original dataframe's columns. In the unicode_range metric, we'll have additional sub metrics. In this case, we have metrics for:

  • digits: distribution metrics for characters inside the unicode's digit range
  • alpha: distribution metrics for characters inside the unicode's lowercase letters range
  • UNKNOWN: distribution metrics for character that fall anywhere outside the predefined range (digits and alpha)
  • string_length: distribution metrics for overall string length

For each of these submetrics, we have metric components such as:

  • mean: the calculated mean for the column
  • stddev: the calculated standard deviation for the column
  • n: the total number of record for the column
  • max, min: maximum and minimum values for the column
  • q_xx: the xx-th quantile value of the data’s distribution
  • median: the median for the column

For instance, let's check the mean for alpha

python
profile_view_df['unicode_range/alpha:distribution/mean']

The above values shows a mean of 0 for onlyDigits - which is expected, since we don't have any letters in this column, only digits. We also have a mean of 5 for onlyAlpha, which will coincide of the string's length mean for the same column, since we only have letters characters in this columns. For mixed the mean is 12.8, and we can indeed see that this column has a higher count of letter character than the previous columns.

You might notice that, even though we defined the range for only lowercase letters, uppercase characters also are included when calculating the metrics. That happens because the strings are all lowercased during preprocessing before tracking the strings.

Let's now check the UNKNOWN namespace:

python
profile_view_df['unicode_range/UNKNOWN:distribution/mean']

Since we have only digits and letters in onlyDigit and onlyAlpha, there are no characters outside of the defined ranges, yielding means of 0. In the mixed, however, this value is non-zero, since there are characters such as ., -, º, and whitespaces, that are not in any of the predefined ranges.

The last namespace string_lenth, contains metrics for the string's length:

python
profile_view_df['unicode_range/string_length:distribution/min']

The string_length doesn't take into account any particular range. It containts aggregate metrics for the overall string length of each column. In this case, we're seeing the minimum value for the 3 columns: 1 for onlyDigits, 3 for onlyAlpha and 8 for mixed. Since the dataframe used here is very small, we can easily check the original data and verify that these metrics are indeed correct.

Conclusion

Feel free to define your own ranges of interest and combine the UnicodeRange metrics with other standard metrics as you see fit!

The resulting profiles can be:

  • merged together
  • stored locally or in the cloud (AWS' S3)

or used for other purposes, such as:

  • Setting constraints for data quality validation
  • Visualizing and comparing profiles
  • Sent to monitoring and observability platforms

Be sure to check the other examples at whylogs' Documentation!