examples/01_prepare_data/data_transform.ipynb
<i>Copyright (c) Recommenders contributors.</i>
<i>Licensed under the MIT License.</i>
It is usually observed in the real-world datasets that users may have different types of interactions with items. In addition, same types of interactions (e.g., click an item on the website, view a movie, etc.) may also appear more than once in the history. Given that this is a typical problem in practical recommendation system design, the notebook shares data transformation techniques that can be used for different scenarios.
Specifically, the discussion in this notebook is only applicable to collaborative filtering algorithms.
import sys
import numpy as np
import pandas as pd
print(f"System version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
Two dummy datasets are created to illustrate the ideas in the notebook.
In the "explicit feedback" scenario, interactions between users and items are numerical / ordinal ratings or binary preferences such as like or dislike. These types of interactions are termed as explicit feedback.
The following shows a dummy data for the explicit rating type of feedback. In the data,
data1 = pd.DataFrame({
"UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
"ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
"Rating": [4, 4, 3, 3, 3, 4, 5, 4, 5, 5, 5, 5, 5, 5, 4],
"Timestamp": [
'2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
'2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
'2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
]
})
data1
Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. For example, a user may puchase something on a website, click an item on a mobile app, or order food from a restaurant. This information may reflect users' preference towards the items in an implicit manner.
As follows, a data set is created to illustrate the implicit feedback scenario.
In the data,
data2 = pd.DataFrame({
"UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
"ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
"Type": [
'click', 'click', 'click', 'click', 'purchase',
'click', 'purchase', 'add', 'purchase', 'purchase',
'click', 'click', 'add', 'purchase', 'click'
],
"Timestamp": [
'2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
'2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
'2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
]
})
data2
Many collaborative filtering algorithms are built on a user-item sparse matrix. This requires that the input data for building the recommender should contain unique user-item pairs.
For explicit feedback datasets, this can simply be done by deduplicating the repeated user-item-rating tuples.
data1 = data1.drop_duplicates()
data1
In the implicit feedback use cases, there are several methods to perform the deduplication, depending on the requirements of the actual business user cases.
Usually, data is aggregated by user to generate some scores that represent preferences (in some algorithms like SAR, the score is called affinity score, for simplicity reason, hereafter the scores are termed as affinity).
It is worth mentioning that in such case, the affinity scores are different from the ratings in the explicit data set, in terms of value distribution. This is usually termed as an ordinal regression problem, which has been studied in Koren's paper. In this case, the algorithm used for training a recommender should be carefully chosen to consider the distribution of the affinity scores rather than discrete integer values.
The most simple technique is to count times of interactions between user and item for producing affinity scores. The following shows the aggregation of counts of user-item interactions in data2 regardless the interaction type.
data2_count = data2.groupby(['UserId', 'ItemId']).agg({'Timestamp': 'count'}).reset_index()
data2_count.columns = ['UserId', 'ItemId', 'Affinity']
data2_count
It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively. A weighted-count can be done as the following
# Add column of weights
data2_w = data2.copy()
conditions = [
data2_w['Type'] == 'click',
data2_w['Type'] == 'add',
data2_w['Type'] == 'purchase'
]
choices = [1, 2, 3]
data2_w['Weight'] = np.select(conditions, choices, default='black')
# Convert to numeric type.
data2_w['Weight'] = pd.to_numeric(data2_w['Weight'])
# Do count with weight.
data2_wcount = data2_w.groupby(['UserId', 'ItemId'])['Weight'].sum().reset_index()
data2_wcount.columns = ['UserId', 'ItemId', 'Affinity']
data2_wcount
In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in SAR. Formula for getting affinity score for each user-item pair is
$$a_{ij}=\sum_k w_k \left(\frac{1}{2}\right)^{\frac{t_0-t_k}{T}} $$
where $a_{ij}$ is the affinity score, $w_k$ is the interaction weight, $t_0$ is a reference time, $t_k$ is the timestamp for the $k$-th interaction, and $T$ is a hyperparameter that controls the speed of decay.
The following shows how SAR applies time decay in aggregating counts for the implicit feedback scenario.
In this case, we use 5 days as the half-life parameter, and use the latest time in the dataset as the time reference.
T = 5
t_ref = pd.to_datetime(data2_w['Timestamp']).max()
# Calculate the weighted count with time decay.
data2_w['Timedecay'] = data2_w.apply(
lambda x: x['Weight'] * np.power(0.5, (t_ref - pd.to_datetime(x['Timestamp'])).days / T),
axis=1
)
data2_w
Affinity scores of user-item pairs can be calculated then by summing the 'Timedecay' column values.
data2_wt = data2_w.groupby(['UserId', 'ItemId'])['Timedecay'].sum().reset_index()
data2_wt.columns = ['UserId', 'ItemId', 'Affinity']
data2_wt
The above aggregation is based on assumptions that user-item interactions can be interpreted as preferences by taking the factors like "number of interation times", "weights", "time decay", etc. Sometimes these assumptions are biased, and only the interactions themselves matter. That is, the original dataset with implicit interaction records can be binarized into one that has only 1 or 0, indicating if a user has interacted with an item, respectively.
For example, the following generates data that contains existing interactions between users and items.
data2_b = data2[['UserId', 'ItemId']].copy()
data2_b['Feedback'] = 1
data2_b = data2_b.drop_duplicates()
data2_b
"Negative sampling" is a technique that samples negative feedback. Similar to the aggregation techniques, negative feedback cna be defined differently in different scenarios. In this case, for example, we can regard the items that a user has not interacted as those that the user does not like. This may be a strong assumption in many user cases, but it is reasonable to build a model when the interaction times between user and item are not that many.
The following shows that, on top of data2_b, there are another 2 negative samples are generated which are tagged with "0" in the "Feedback" column.
users = data2['UserId'].unique()
items = data2['ItemId'].unique()
interaction_lst = []
for user in users:
for item in items:
interaction_lst.append([user, item, 0])
data_all = pd.DataFrame(data=interaction_lst, columns=["UserId", "ItemId", "FeedbackAll"])
data_all
data2_ns = pd.merge(data_all, data2_b, on=['UserId', 'ItemId'], how='outer').fillna(0).drop('FeedbackAll', axis=1)
data2_ns
Also note that sometimes the negative sampling may also impact the count-based aggregation scheme. That is, the count may start from 0 instead of 1, and 0 means there is no interaction between the user and item.