examples/01_prepare_data/data_split.ipynb
<i>Copyright (c) Recommenders contributors.</i>
<i>Licensed under the MIT License.</i>
Data splitting is one of the most vital tasks in assessing recommendation systems. Splitting strategy greatly affects the evaluation protocol so that it should always be taken into careful consideration by practitioners.
The code hereafter explains how one applies different splitting strategies for specific scenarios.
import sys
import pyspark
import pandas as pd
from datetime import datetime, timedelta
from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.download_utils import maybe_download
from recommenders.datasets.python_splitters import (
python_random_split,
python_chrono_split,
python_stratified_split
)
from recommenders.datasets.spark_splitters import spark_random_split
print(f"System version: {sys.version}")
print(f"Pyspark version: {pyspark.__version__}")
DATA_URL = "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"
DATA_PATH = "ml-100k.data"
COL_USER = "UserId"
COL_ITEM = "MovieId"
COL_RATING = "Rating"
COL_PREDICTION = "Rating"
COL_TIMESTAMP = "Timestamp"
For illustration purpose, the data used in the examples below is the MovieLens-100K dataset.
filepath = maybe_download(DATA_URL, DATA_PATH)
data = pd.read_csv(filepath, sep="\t", names=[COL_USER, COL_ITEM, COL_RATING, COL_TIMESTAMP])
A glimpse at the data
data.head()
A little more...
data.describe()
And, more...
print(
"Total number of ratings are\t{}".format(data.shape[0]),
"Total number of users are\t{}".format(data[COL_USER].nunique()),
"Total number of items are\t{}".format(data[COL_ITEM].nunique()),
sep="\n"
)
Original timestamps are converted to ISO format.
data[COL_TIMESTAMP]= data.apply(
lambda x: datetime.strftime(datetime(1970, 1, 1, 0, 0, 0) + timedelta(seconds=x[COL_TIMESTAMP].item()), "%Y-%m-%d %H:%M:%S"),
axis=1
)
data.head()
Experimentation protocol is usually set up to favor a reasonable evaluation for a specific recommendation scenario. For example,
Random split simply takes in a data set and outputs the splits of the data, given the split ratios.
data_train, data_test = python_random_split(data, ratio=0.7)
data_train.shape[0], data_test.shape[0]
Sometimes a multi-split is needed.
data_train, data_validate, data_test = python_random_split(data, ratio=[0.6, 0.2, 0.2])
data_train.shape[0], data_validate.shape[0], data_test.shape[0]
Ratios can be integers as well.
data_train, data_validate, data_test = python_random_split(data, ratio=[3, 1, 1])
For producing the same results.
data_train.shape[0], data_validate.shape[0], data_test.shape[0]
Chronogically splitting method takes in a dataset and splits it on timestamp.
Chrono splitting can be either by "user" or "item". For example, if it is by "user" and the splitting ratio is 0.7, it means that first 70% ratings for each user in the data will be put into one split while the other 30% is in another. It is worth noting that a chronological split is not "random" because splitting is timestamp-dependent.
data_train, data_test = python_chrono_split(
data, ratio=0.7, filter_by="user",
col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)
Take a look at the results for one particular user:
data_train[data_train[COL_USER] == 1].tail(10)
data_test[data_test[COL_USER] == 1].head(10)
Timestamps of train data are all precedent to those in test data.
A min-rating filter is applied to data before it is split by using chronological splitter. The reason of doing this is that, for multi-split, there should be sufficient number of ratings for user/item in the data.
For example, the following means splitting only applies to users that have at least 10 ratings.
data_train, data_test = python_chrono_split(
data, filter_by="user", min_rating=10, ratio=0.7,
col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)
Number of rows in the yielded splits of data may not sum to the original ones as users with fewer than 10 ratings are filtered out in the splitting.
data_train.shape[0] + data_test.shape[0], data.shape[0]
Chronogically splitting method takes in a dataset and splits it by either user or item. The split is stratified so that the same set of users or items will appear in both training and testing data sets.
Similar to chronological splitter, filter_by and min_rating_filter also apply to the stratified splitter.
The following example shows the split of the sample data with a ratio of 0.7, and for each user there should be at least 10 ratings.
data_train, data_test = python_stratified_split(
data, filter_by="user", min_rating=10, ratio=0.7,
col_user=COL_USER, col_item=COL_ITEM
)
data_train.shape[0] + data_test.shape[0], data.shape[0]
Spark DataFrame is used for scalable splitting. This allows splitting operation performed on large dataset that is distributed across Spark cluster.
For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same MovieLens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting.
spark = start_or_get_spark()
data_spark = spark.read.csv(filepath)
data_spark_train, data_spark_test = spark_random_split(data_spark, ratio=0.7)
Interestingly, it was noticed that Spark random split does not guarantee a deterministic result. This sometimes leads to issues when data is relatively small while users seek for a precision split.
data_spark_train.count(), data_spark_test.count()
spark.stop()