Back to Recommenders

Data transformation (collaborative filtering)

examples/01_prepare_data/data_transform.ipynb

1.2.110.5 KB
Original Source

<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

Data transformation (collaborative filtering)

It is usually observed in the real-world datasets that users may have different types of interactions with items. In addition, same types of interactions (e.g., click an item on the website, view a movie, etc.) may also appear more than once in the history. Given that this is a typical problem in practical recommendation system design, the notebook shares data transformation techniques that can be used for different scenarios.

Specifically, the discussion in this notebook is only applicable to collaborative filtering algorithms.

0 Global settings

python
import sys
import numpy as np
import pandas as pd

print(f"System version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

1 Data creation

Two dummy datasets are created to illustrate the ideas in the notebook.

1.1 Explicit feedback

In the "explicit feedback" scenario, interactions between users and items are numerical / ordinal ratings or binary preferences such as like or dislike. These types of interactions are termed as explicit feedback.

The following shows a dummy data for the explicit rating type of feedback. In the data,

  • There are 3 users whose IDs are 1, 2, 3.
  • There are 3 items whose IDs are 1, 2, 3.
  • Items are rated by users only once. So even when users interact with items at different timestamps, the ratings are kept the same. This is seen in some use cases such as movie recommendations, where users' ratings do not change dramatically over a short period of time.
  • Timestamps of when the ratings are given are also recorded.
python
data1 = pd.DataFrame({
    "UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
    "ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
    "Rating": [4, 4, 3, 3, 3, 4, 5, 4, 5, 5, 5, 5, 5, 5, 4],
    "Timestamp": [
        '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
        '2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
        '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
    ]
})
python
data1

1.2 Implicit feedback

Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. For example, a user may puchase something on a website, click an item on a mobile app, or order food from a restaurant. This information may reflect users' preference towards the items in an implicit manner.

As follows, a data set is created to illustrate the implicit feedback scenario.

In the data,

  • There are 3 users whose IDs are 1, 2, 3.
  • There are 3 items whose IDs are 1, 2, 3.
  • There are no ratings or explicit feedback given by the users. Sometimes there are the types of events. In this dummy dataset, for illustration purposes, there are three types for the interactions between users and items, that is, click, add and purchase, meaning "click on the item", "add the item into cart" and "purchase the item", respectively.
  • Sometimes there is other contextual or associative information available for the types of interactions. E.g., "time-spent on visiting a site before clicking" etc. For simplicity, only the type of interactions is considered in this notebook.
  • The timestamp of each interaction is also given.
python
data2 = pd.DataFrame({
    "UserId": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
    "ItemId": [1, 1, 2, 2, 2, 1, 2, 1, 2, 3, 3, 3, 3, 3, 1],
    "Type": [
        'click', 'click', 'click', 'click', 'purchase',
        'click', 'purchase', 'add', 'purchase', 'purchase',
        'click', 'click', 'add', 'purchase', 'click'
    ],
    "Timestamp": [
        '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
        '2000-01-01', '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03',
        '2000-01-01', '2000-01-03', '2000-01-03', '2000-01-03', '2000-01-04'
    ]
})
python
data2

2 Data transformation

Many collaborative filtering algorithms are built on a user-item sparse matrix. This requires that the input data for building the recommender should contain unique user-item pairs.

For explicit feedback datasets, this can simply be done by deduplicating the repeated user-item-rating tuples.

python
data1 = data1.drop_duplicates()
python
data1

In the implicit feedback use cases, there are several methods to perform the deduplication, depending on the requirements of the actual business user cases.

2.1 Data aggregation

Usually, data is aggregated by user to generate some scores that represent preferences (in some algorithms like SAR, the score is called affinity score, for simplicity reason, hereafter the scores are termed as affinity).

It is worth mentioning that in such case, the affinity scores are different from the ratings in the explicit data set, in terms of value distribution. This is usually termed as an ordinal regression problem, which has been studied in Koren's paper. In this case, the algorithm used for training a recommender should be carefully chosen to consider the distribution of the affinity scores rather than discrete integer values.

2.2.1 Count

The most simple technique is to count times of interactions between user and item for producing affinity scores. The following shows the aggregation of counts of user-item interactions in data2 regardless the interaction type.

python
data2_count = data2.groupby(['UserId', 'ItemId']).agg({'Timestamp': 'count'}).reset_index()
data2_count.columns = ['UserId', 'ItemId', 'Affinity']
python
data2_count

2.2.1 Weighted count

It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively. A weighted-count can be done as the following

python
# Add column of weights
data2_w = data2.copy()

conditions = [
    data2_w['Type'] == 'click',
    data2_w['Type'] == 'add',
    data2_w['Type'] == 'purchase'
]

choices = [1, 2, 3]

data2_w['Weight'] = np.select(conditions, choices, default='black')

# Convert to numeric type.
data2_w['Weight'] = pd.to_numeric(data2_w['Weight'])
python
# Do count with weight.
data2_wcount = data2_w.groupby(['UserId', 'ItemId'])['Weight'].sum().reset_index()
data2_wcount.columns = ['UserId', 'ItemId', 'Affinity']
python
data2_wcount

2.2.2 Time dependent count

In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in SAR. Formula for getting affinity score for each user-item pair is

$$a_{ij}=\sum_k w_k \left(\frac{1}{2}\right)^{\frac{t_0-t_k}{T}} $$

where $a_{ij}$ is the affinity score, $w_k$ is the interaction weight, $t_0$ is a reference time, $t_k$ is the timestamp for the $k$-th interaction, and $T$ is a hyperparameter that controls the speed of decay.

The following shows how SAR applies time decay in aggregating counts for the implicit feedback scenario.

In this case, we use 5 days as the half-life parameter, and use the latest time in the dataset as the time reference.

python
T = 5

t_ref = pd.to_datetime(data2_w['Timestamp']).max()
python
# Calculate the weighted count with time decay.

data2_w['Timedecay'] = data2_w.apply(
    lambda x: x['Weight'] * np.power(0.5, (t_ref - pd.to_datetime(x['Timestamp'])).days / T), 
    axis=1
)
python
data2_w

Affinity scores of user-item pairs can be calculated then by summing the 'Timedecay' column values.

python
data2_wt = data2_w.groupby(['UserId', 'ItemId'])['Timedecay'].sum().reset_index()
data2_wt.columns = ['UserId', 'ItemId', 'Affinity']
python
data2_wt

2.2 Negative sampling

The above aggregation is based on assumptions that user-item interactions can be interpreted as preferences by taking the factors like "number of interation times", "weights", "time decay", etc. Sometimes these assumptions are biased, and only the interactions themselves matter. That is, the original dataset with implicit interaction records can be binarized into one that has only 1 or 0, indicating if a user has interacted with an item, respectively.

For example, the following generates data that contains existing interactions between users and items.

python
data2_b = data2[['UserId', 'ItemId']].copy()
data2_b['Feedback'] = 1
data2_b = data2_b.drop_duplicates()
python
data2_b

"Negative sampling" is a technique that samples negative feedback. Similar to the aggregation techniques, negative feedback cna be defined differently in different scenarios. In this case, for example, we can regard the items that a user has not interacted as those that the user does not like. This may be a strong assumption in many user cases, but it is reasonable to build a model when the interaction times between user and item are not that many.

The following shows that, on top of data2_b, there are another 2 negative samples are generated which are tagged with "0" in the "Feedback" column.

python
users = data2['UserId'].unique()
items = data2['ItemId'].unique()
python
interaction_lst = []
for user in users:
    for item in items:
        interaction_lst.append([user, item, 0])

data_all = pd.DataFrame(data=interaction_lst, columns=["UserId", "ItemId", "FeedbackAll"])
python
data_all
python
data2_ns = pd.merge(data_all, data2_b, on=['UserId', 'ItemId'], how='outer').fillna(0).drop('FeedbackAll', axis=1)
python
data2_ns

Also note that sometimes the negative sampling may also impact the count-based aggregation scheme. That is, the count may start from 0 instead of 1, and 0 means there is no interaction between the user and item.

References

  1. X. He et al, Neural Collaborative Filtering, WWW 2017.
  2. Y. Hu et al, Collaborative filtering for implicit feedback datasets, ICDM 2008.
  3. Simple Algorithm for Recommendation (SAR). See notebook sar_deep_dive.ipynb.
  4. Y. Koren and J. Sill, OrdRec: an ordinal model for predicting personalized item rating distributions, RecSys 2011.