examples/03_evaluate/evaluation.ipynb
<i>Copyright (c) Recommenders contributors.</i>
<i>Licensed under the MIT License.</i>
Evaluation with offline metrics is pivotal to assess the quality of a recommender before it goes into production. Usually, evaluation metrics are carefully chosen based on the actual application scenario of a recommendation system. It is hence important to data scientists and AI developers that build recommendation systems to understand how each evaluation metric is calculated and what it is for.
This notebook deep dives into several commonly used evaluation metrics, and illustrates how these metrics are used in practice. The metrics covered in this notebook are merely for off-line evaluations.
Most of the functions used in the notebook can be found in the recommenders directory.
import sys
import pandas as pd
import pyspark
import sklearn
from sklearn.preprocessing import minmax_scale
from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.evaluation.spark_evaluation import SparkRankingEvaluation, SparkRatingEvaluation
from recommenders.evaluation.python_evaluation import auc, logloss
print(f"System version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"PySpark version: {pyspark.__version__}")
print(f"Scikit Learn version: {sklearn.__version__}")
Note to successfully run Spark codes with the Jupyter kernel, one needs to correctly set the environment variables of PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON that point to Python executables with the desired version. Detailed information can be found in the setup instruction document SETUP.md.
COL_USER = "UserId"
COL_ITEM = "MovieId"
COL_RATING = "Rating"
COL_PREDICTION = "Rating"
HEADER = {
"col_user": COL_USER,
"col_item": COL_ITEM,
"col_rating": COL_RATING,
"col_prediction": COL_PREDICTION,
}
For illustration purpose, a dummy data set is created for demonstrating how different evaluation metrics work.
The data has the schema that can be frequently found in a recommendation problem, that is, each row in the dataset is a (user, item, rating) tuple, where "rating" can be an ordinal rating score (e.g., discrete integers of 1, 2, 3, etc.) or an numerical float number that quantitatively indicates the preference of the user towards that item.
For simplicity reason, the column of rating in the dummy dataset we use in the example represent some ordinal ratings.
df_true = pd.DataFrame(
{
COL_USER: [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
COL_ITEM: [1, 2, 3, 1, 4, 5, 6, 7, 2, 5, 6, 8, 9, 10, 11, 12, 13, 14],
COL_RATING: [5, 4, 3, 5, 5, 3, 3, 1, 5, 5, 5, 4, 4, 3, 3, 3, 2, 1],
}
)
df_pred = pd.DataFrame(
{
COL_USER: [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
COL_ITEM: [3, 10, 12, 10, 3, 5, 11, 13, 4, 10, 7, 13, 1, 3, 5, 2, 11, 14],
COL_PREDICTION: [14, 13, 12, 14, 13, 12, 11, 10, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5]
}
)
Take a look at ratings of the user with ID "1" in the dummy dataset.
df_true[df_true[COL_USER] == 1]
df_pred[df_pred[COL_USER] == 1]
Spark framework is sometimes used to evaluate metrics given datasets that are hard to fit into memory. In our example, Spark DataFrames can be created from the Python dummy dataset.
spark = start_or_get_spark("EvaluationTesting", "local")
dfs_true = spark.createDataFrame(df_true)
dfs_pred = spark.createDataFrame(df_pred)
dfs_true.filter(dfs_true[COL_USER] == 1).show()
dfs_pred.filter(dfs_pred[COL_USER] == 1).show()
Rating metrics are similar to regression metrics used for evaluating a regression model that predicts numerical values given input observations. In the context of recommendation system, rating metrics are to evaluate how accurate a recommender is to predict ratings that users may give to items. Therefore, the metrics are calculated exactly on the same group of (user, item) pairs that exist in both ground-truth dataset and prediction dataset and averaged by the total number of users.
Rating metrics are effective in measuring the model accuracy. However, in some cases, the rating metrics are limited if
A few notes about the interface of the Rating evaluator class:
In our examples below, to calculate rating metrics for input data frames in Spark, a Spark object, SparkRatingEvaluation is initialized. The input data schemas for the ground-truth dataset and the prediction dataset are
| Column | Data type | Description |
|---|---|---|
COL_USER | <int> | User ID |
COL_ITEM | <int> | Item ID |
COL_RATING | <float> | Rating or numerical value of user preference. |
| Column | Data type | Description |
|---|---|---|
COL_USER | <int> | User ID |
COL_ITEM | <int> | Item ID |
COL_RATING | <float> | Predicted rating or numerical value of user preference. |
spark_rate_eval = SparkRatingEvaluation(dfs_true, dfs_pred, **HEADER)
RMSE is for evaluating the accuracy of prediction on ratings. RMSE is the most widely used metric to evaluate a recommendation algorithm that predicts missing ratings. The benefit is that RMSE is easy to explain and calculate.
print(f"The RMSE is {spark_rate_eval.rmse()}")
R2 is also called "coefficient of determination" in some context. It is a metric that evaluates how well a regression model performs, based on the proportion of total variations of the observed results.
print(f"The R2 is {spark_rate_eval.rsquared()}")
MAE evaluates accuracy of prediction. It computes the metric value from ground truths and prediction in the same scale. Compared to RMSE, MAE is more explainable.
print(f"The MAE is {spark_rate_eval.mae()}")
Explained variance is usually used to measure how well a model performs with regard to the impact from the variation of the dataset.
print(f"The explained variance is {spark_rate_eval.exp_var()}")
| Metric | Range | Selection criteria | Limitation | Reference |
|---|---|---|---|---|
| RMSE | $> 0$ | The smaller the better. | May be biased, and less explainable than MAE | link |
| R2 | $\leq 1$ | The closer to $1$ the better. | Depend on variable distributions. | link |
| MAE | $\geq 0$ | The smaller the better. | Dependent on variable scale. | link |
| Explained variance | $\leq 1$ | The closer to $1$ the better. | Depend on variable distributions. | link |
"Beyond-accuray evaluation" was proposed to evaluate how relevant recommendations are for users. In this case, a recommendation system is a treated as a ranking system. Given relency definition, recommendation system outputs a list of recommended items to each user, which is ordered by relevance. The evaluation part takes ground-truth data, the actual items that users interact with (e.g., liked, purchased, etc.), and the recommendation data, as inputs, to calculate ranking evaluation metrics.
Ranking metrics are often used when hit and/or ranking of the items are considered:
A few notes about the interface of the Rating evaluator class:
Relevancy of recommendation can be measured in different ways:
By ranking - In this case, relevant items in the recommendations are defined as the top ranked items, i.e., top k items, which are taken from the list of the recommended items that is ordered by the predicted ratings (or other numerical scores that indicate preference of a user to an item).
By timestamp - Relevant items are defined as the most recently viewed k items, which are obtained from the recommended items ranked by timestamps.
By rating - Relevant items are defined as items with ratings (or other numerical scores that indicate preference of a user to an item) that are above a given threshold.
Similarly, a ranking metric object can be initialized as below. The input data schema is
| Column | Data type | Description |
|---|---|---|
COL_USER | <int> | User ID |
COL_ITEM | <int> | Item ID |
COL_RATING | <float> | Rating or numerical value of user preference. |
COL_TIMESTAMP | <string> | Timestamps. |
| Column | Data type | Description |
|---|---|---|
COL_USER | <int> | User ID |
COL_ITEM | <int> | Item ID |
COL_RATING | <float> | Predicted rating or numerical value of user preference. |
COL_TIMESTAM | <string> | Timestamps. |
In this case, in addition to the input datasets, there are also other arguments used for calculating the ranking metrics:
| Argument | Data type | Description |
|---|---|---|
k | <int> | Number of items recommended to user. |
revelancy_method | <string> | Methonds that extract relevant items from the recommendation list |
For example, the following code initializes a ranking metric object that calculates the metrics.
spark_rank_eval = SparkRankingEvaluation(dfs_true, dfs_pred, k=3, relevancy_method="top_k", **HEADER)
A few ranking metrics can then be calculated.
Precision@k is a metric that evaluates how many items in the recommendation list are relevant (hit) in the ground-truth data. For each user the precision score is normalized by k and then the overall precision scores are averaged by the total number of users.
Note it is apparent that the precision@k metric grows with the number of k.
print(f"The precision at k is {spark_rank_eval.precision_at_k()}")
Recall@k is a metric that evaluates how many relevant items in the ground-truth data are in the recommendation list. For each user the recall score is normalized by the total number of ground-truth items and then the overall recall scores are averaged by the total number of users.
print(f"The recall at k is {spark_rank_eval.recall_at_k()}")
NDCG is a metric that evaluates how well the recommender performs in recommending ranked items to users. Therefore both hit of relevant items and correctness in ranking of these items matter to the NDCG evaluation. The total NDCG score is normalized by the total number of users.
print(f"The NDCG at k is {spark_rank_eval.ndcg_at_k()}")
MAP is a metric that evaluates the average precision for each user in the datasets. It also penalizes ranking correctness of the recommended items. The overall MAP score is normalized by the total number of users.
print(f"The MAP at k is {spark_rank_eval.map_at_k()}")
ROC, as well as AUC, is a well known metric that is used for evaluating binary classification problem. It is similar in the case of binary rating typed recommendation algorithm where the "hit" accuracy on the relevant items is used for measuring the recommender's performance.
To demonstrate the evaluation method, the original data for testing is manipuldated in a way that the ratings in the testing data are arranged as binary scores, whilst the ones in the prediction are scaled in 0 to 1.
# Convert the original rating to 0 and 1.
df_true_bin = df_true.copy()
df_true_bin[COL_RATING] = df_true_bin[COL_RATING].apply(lambda x: 1 if x > 3 else 0)
df_true_bin
# Convert the predicted ratings into a [0, 1] scale.
df_pred_bin = df_pred.copy()
df_pred_bin[COL_PREDICTION] = minmax_scale(df_pred_bin[COL_PREDICTION].astype(float))
df_pred_bin
# Calculate the AUC metric
auc_score = auc(
df_true_bin,
df_pred_bin,
col_user = COL_USER,
col_item = COL_ITEM,
col_rating = COL_RATING,
col_prediction = COL_RATING
)
print(f"The auc score is {auc_score}")
It is worth mentioning that in some literature there are variants of the original AUC metric, that considers the effect of the number of the recommended items (k), grouping effect of users (compute AUC for each user group, and take the average across different groups). These variants are applicable to various different scenarios, and choosing an appropriate one depends on the context of the use case itself.
Logistic loss (sometimes it is called simply logloss, or cross-entropy loss) is another useful metric to evaluate the hit accuracy. It is defined as the negative log-likelihood of the true labels given the predictions of a classifier.
# Calculate the logloss metric
logloss_score = logloss(
df_true_bin,
df_pred_bin,
col_user = COL_USER,
col_item = COL_ITEM,
col_rating = COL_RATING,
col_prediction = COL_RATING
)
print(f"The logloss score is {logloss_score}")
It is worth noting that logloss may be sensitive to the class balance of datasets, as it penalizes heavily classifiers that are confident about incorrect classifications. To demonstrate, the ground truth data set for testing is manipulated purposely to unbalance the binary labels. For example, the following binarizes the original rating data by using a lower threshold, i.e., 2, to create more positive feedback from the user.
df_true_bin_pos = df_true.copy()
df_true_bin_pos[COL_RATING] = df_true_bin_pos[COL_RATING].apply(lambda x: 1 if x > 2 else 0)
df_true_bin_pos
By using threshold of 2, the labels in the ground truth data is not balanced, and the ratio of 1 over 0 is
one_zero_ratio = df_true_bin_pos[COL_PREDICTION].sum() / (df_true_bin_pos.shape[0] - df_true_bin_pos[COL_PREDICTION].sum())
print(f"The ratio between label 1 and label 0 is {one_zero_ratio}")
Another prediction data is also created, where the probabilities for label 1 and label 0 are fixed. Without loss of generity, the probability of predicting 1 is 0.6. The data set is purposely created to make the precision to be 100% given an presumption of cut-off equal to 0.5.
prob_true = 0.6
df_pred_bin_pos = df_true_bin_pos.copy()
df_pred_bin_pos[COL_PREDICTION] = df_pred_bin_pos[COL_PREDICTION].apply(lambda x: prob_true if x==1 else 1-prob_true)
df_pred_bin_pos
Then the logloss is calculated as follows.
# Calculate the logloss metric
logloss_score_pos = logloss(
df_true_bin_pos,
df_pred_bin_pos,
col_user = COL_USER,
col_item = COL_ITEM,
col_rating = COL_RATING,
col_prediction = COL_RATING
)
print(f"The logloss score is {logloss_score}")
For comparison, a similar process is used with a threshold value of 3 to create a more balanced dataset. Another prediction dataset is also created by using the balanced dataset. Again, the probabilities of predicting label 1 and label 0 are fixed as 0.6 and 0.4, respectively. NOTE, same as above, in this case, the prediction also gives us a 100% precision. The only difference is the proportion of binary labels.
prob_true = 0.6
df_pred_bin_balanced = df_true_bin.copy()
df_pred_bin_balanced[COL_PREDICTION] = df_pred_bin_balanced[COL_PREDICTION].apply(lambda x: prob_true if x==1 else 1-prob_true)
df_pred_bin_balanced
The ratio of label 1 and label 0 is
one_zero_ratio = df_true_bin[COL_PREDICTION].sum() / (df_true_bin.shape[0] - df_true_bin[COL_PREDICTION].sum())
print(f"The ratio between label 1 and label 0 is {one_zero_ratio}")
It is perfectly balanced.
Applying the logloss function to calculate the metric gives us a more promising result, as shown below.
# Calculate the logloss metric
logloss_score = logloss(
df_true_bin,
df_pred_bin_balanced,
col_user = COL_USER,
col_item = COL_ITEM,
col_rating = COL_RATING,
col_prediction = COL_RATING
)
print(f"The logloss score is {logloss_score}")
It can be seen that the score is more close to 0, and, by definition, it means that the predictions are generating better results than the one before where binary labels are more biased.
| Metric | Range | Selection criteria | Limitation | Reference |
|---|---|---|---|---|
| Precision | $\geq 0$ and $\leq 1$ | The closer to $1$ the better. | Only for hits in recommendations. | link |
| Recall | $\geq 0$ and $\leq 1$ | The closer to $1$ the better. | Only for hits in the ground truth. | link |
| NDCG | $\geq 0$ and $\leq 1$ | The closer to $1$ the better. | Does not penalize for bad/missing items, and does not perform for several equally good items. | link |
| MAP | $\geq 0$ and $\leq 1$ | The closer to $1$ the better. | Depend on variable distributions. | link |
| AUC | $\geq 0$ and $\leq 1$ | The closer to $1$ the better. 0.5 indicates an uninformative classifier | Depend on the number of recommended items (k). | link |
| Logloss | $0$ to $\infty$ | The closer to $0$ the better. | Logloss can be sensitive to imbalanced datasets. | link |
# cleanup spark instance
spark.stop()