Credit Risk Data Preparation

Predicting credit risk is an important task for financial institutions. If a bank can accurately determine the probability that a borrower will pay back a future loan, then they can make better decisions on loan terms and approvals. Getting credit risk right is critical to offering good financial services, and getting credit risk wrong could mean going out of business.

AI models have played a central role in modern credit risk assessment systems. In this example, we develop a credit risk model to predict whether a future loan will be good or bad, given some context data (presumably supplied from the loan application). We use the modeling process to demonstrate how Feast can be used to facilitate the serving of data for training and inference use-cases.

In this notebook, we prepare the data.

Setup

The following code assumes that you have read the example README.md file, and that you have setup an environment where the code can be run. Please make sure you have addressed the prerequisite needs.

python

# Import Python libraries
import os
import warnings
import datetime as dt
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

python

# suppress warning messages for example flow (don't run if you want to see warnings)
warnings.filterwarnings('ignore')

python

# Seed for reproducibility
SEED = 142

Pull the Data

The data we will use to train the model is from the OpenML dataset credit-g, obtained from a 1994 German study. More details on the data can be found in the DESC attribute and details map (see below).

python

data = fetch_openml(name="credit-g", version=1, parser='auto')

python

print(data.DESCR)

python

print("Original data url: ".ljust(20), data.details["original_data_url"])
print("Paper url: ".ljust(20), data.details["paper_url"])

High-Level Data Inspection

Let's inspect the data to see high level details like data types and size. We also want to make sure there are no glaring issues (like a large number of null values).

python

df = data.frame
df.info()

We see that there are 21 columns, each with 1000 non-null values. The first 20 columns are contextual fields with Dtype of category or int64, while the last field is actually the target variable, class, which we wish to predict.

From the description (above), the class tells us whether a loan to a customer was "good" or "bad". We are anticipating that patterns in the contextual data, as well as their relationship to the class outcomes, can give insight into loan classification. In the following notebooks, we will build a loan classification model that seeks to encode these patterns and relationships in its weights, such that given a new loan application (context data), the model can predict whether the loan (if approved) will be good or bad in the future.

Data Preparation For Demonstrating Feast

At this point, it's important to bring up that Feast was developed primarily to work with production data. Feast requires datasets to have entities (in our case, IDs) and timestamps, which it uses in joins. Feast can support joining data on multiple entities (like primary keys in SQL), as well as "created" timestamps and "event" timestamps. However, in this example, we'll keep things more simple.

In a real loan application scenario, the application fields (in a database) would be associated with a timestamp, while the actual loan outcome (label) would be determined much later and recorded separately with a different timestamp.

In order to demonstrate Feast capabilities, such as point-in-time joins, we will mock IDs and timestamps for this data. For IDs, we will use the original dataframe index values. For the timestamps, we will generate random values between "Tue Sep 24 12:00:00 2023" and "Wed Oct 9 12:00:00 2023".

python

# Make index into "ID" column
df = df.reset_index(names=["ID"])

python

# Add mock timestamps
time_format = "%a %b %d %H:%M:%S %Y"
date = dt.datetime.strptime("Wed Oct  9 12:00:00 2023", time_format)
end = int(date.timestamp())
start = int((date - dt.timedelta(days=15)).timestamp())  # 'Tue Sep 24 12:00:00 2023'

def make_tstamp(date):
    dtime = dt.datetime.fromtimestamp(date).ctime()
    return dtime
    
# (seed set for reproducibility)
np.random.seed(SEED)
df["application_timestamp"] = pd.to_datetime([
    make_tstamp(d) for d in np.random.randint(start, end, len(df))
])

Verify that the newly created "ID" and "application_timestamp" fields were added to the data as expected.

python

# Check data (first few records, transposed for readability)
df.head(3).T

We'll also generate counterpart IDs and timestamps on the label data. In a real-life scenario, the label data would come separate and later relative to the loan application data. To mimic this, let's create a labels dataset with an "outcome_timestamp" column with a variable lag from the application timestamp of 30 to 90 days.

python

# Add (lagged) label timestamps (30 to 90 days)
def lag_delta(data, seed):
    np.random.seed(seed)
    delta_days = np.random.randint(30, 90, len(data))
    delta_hours = np.random.randint(0, 24, len(data))
    delta = np.array([dt.timedelta(days=int(delta_days[i]), hours=int(delta_hours[i])) for i in range(len(data))])
    return delta

labels = df[["ID", "class"]]
labels["outcome_timestamp"] = pd.to_datetime(df.application_timestamp + lag_delta(df, SEED))

python

# Check labels
labels.head(3)

You can verify that the outcome timestamp has a difference of 30 to 90 days from the "application_timestamp" (above).

Save Data

Now that we have our data prepared, let's save it to local parquet files in the data directory (parquet is one of the file formats supported by Feast).

One more step we will add is splitting the context data column-wise and saving it in two files. This step is contrived--we don't usually split data when we don't need to--but it will allow us to demonstrate later how Feast can easily join datasets (a common need in Data Science projects).

python

# Create the data directory if it doesn't exist
os.makedirs("Feature_Store/data", exist_ok=True)

# Split columns and save context data
a_cols = [
    'ID', 'checking_status', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_status', 'employment', 'application_timestamp',
    'installment_commitment', 'personal_status', 'other_parties',
]
b_cols = [
    'ID', 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
    'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
    'foreign_worker', 'application_timestamp'
]

df[a_cols].to_parquet("Feature_Store/data/data_a.parquet", engine="pyarrow")
df[b_cols].to_parquet("Feature_Store/data/data_b.parquet", engine="pyarrow")

# Save label data
labels.to_parquet("Feature_Store/data/labels.parquet", engine="pyarrow")

We have saved the following files to the Feature_Store/data directory:

data_a.parquet (training data, a columns)
data_b.parquet (training data, b columns)
labels.parquet (label outcomes)

With the feature data prepared, we are ready to setup and deploy the feature store.

Continue with the 02_Deploying_the_Feature_Store.ipynb notebook.