xDeepFM : the eXtreme Deep Factorization Machine

This notebook will give you a quick example of how to train an xDeepFM model. xDeepFM [1] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:

It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like use_Linear_part, use_FM_part, use_CIN_part, and use_DNN_part. For example, by enabling only the use_Linear_part and use_FM_part, we can get a classical FM model.

In this notebook, we test xDeepFM on Criteo dataset.

0. Global Settings and Imports

python

import os
import sys
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams
from recommenders.models.deeprec.models.xDeepFM import XDeepFMModel
from recommenders.models.deeprec.io.iterator import FFMTextIterator
from recommenders.utils.notebook_utils import store_metadata

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")

Parameters

python

EPOCHS = 10
BATCH_SIZE = 4096
RANDOM_SEED = 42  # Set this to None for non-deterministic result

xDeepFM uses the FFM format as data input: <label> <field_id>:<feature_id>:<feature_value>
Each line represents an instance, <label> is a binary value with 1 meaning positive instance and 0 meaning negative instance. Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1.

python

tmpdir = TemporaryDirectory()
data_path = tmpdir.name
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
output_file = os.path.join(data_path, r'output.txt')
train_file = os.path.join(data_path, r'cretio_tiny_train')
valid_file = os.path.join(data_path, r'cretio_tiny_valid')
test_file = os.path.join(data_path, r'cretio_tiny_test')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')

2. Criteo data

Now let's try the xDeepFM on a real world dataset, a small sample from Criteo dataset. Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers.

The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.

python

print('Demo with Criteo dataset')
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=2300000, 
                          FIELD_COUNT=39, 
                          cross_l2=0.01, 
                          embed_l2=0.01, 
                          layer_l2=0.01,
                          learning_rate=0.002, 
                          batch_size=BATCH_SIZE, 
                          epochs=EPOCHS, 
                          cross_layer_sizes=[20, 10], 
                          init_value=0.1, 
                          layer_sizes=[20,20],
                          use_Linear_part=True, 
                          use_CIN_part=True, 
                          use_DNN_part=True)
print(hparams)

python

model = XDeepFMModel(hparams, FFMTextIterator, seed=RANDOM_SEED)

python

# check the predictive performance before the model is trained
print(model.run_eval(test_file))

python

%%time
model.fit(train_file, valid_file)

python

# check the predictive performance after the model is trained
result = model.run_eval(test_file)
print(result)

python

# Record results for tests - ignore this cell
store_metadata("auc", result["auc"])
store_metadata("logloss", result["logloss"])

python

# Cleanup
tmpdir.cleanup()

Reference

[1] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018.