Back to Recommenders

Dataset module

docs/datasets.rst

1.2.15.0 KB
Original Source

.. _dataset:

Dataset module ##############

Recommendation datasets and related utilities

Recommendation datasets


Amazon Reviews

Amazon Reviews dataset <https://snap.stanford.edu/data/web-Amazon.html>_ consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.

:Citation:

J. McAuley and J. Leskovec, "Hidden factors and hidden topics: understanding rating dimensions with review text", 
RecSys, 2013.

.. automodule:: recommenders.datasets.amazon_reviews :members:

CORD-19

COVID-19 Open Research Dataset (CORD-19) <https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/>_ is a full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.

In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease.

:Citation:

Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., 
Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. "Cord-19: The COVID-19 Open Research Dataset.", 2020.

.. automodule:: recommenders.datasets.covid_utils :members:

Criteo

Criteo dataset <https://www.kaggle.com/c/criteo-display-ad-challenge/overview>_, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents that the Ad has been clicked on and a 0 represents it wasn't clicked on. The rest consist of 13 integer columns and 26 categorical columns.

.. automodule:: recommenders.datasets.criteo :members:

MIND

MIcrosoft News Dataset (MIND) <https://msnews.github.io/>_, is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website.

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.

:Citation:

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu 
and Ming Zhou, "MIND: A Large-scale Dataset for News Recommendation", ACL, 2020.

.. automodule:: recommenders.datasets.mind :members:

MovieLens

The MovieLens datasets <https://grouplens.org/datasets/movielens/>_, first released in 1998, describe people's expressed preferences for movies. These preferences take the form of <user, item, rating, timestamp> tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time.

It comes with several sizes:

  • MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
  • MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
  • MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
  • MovieLens 20M: 20 million ratings from 138000 users on 27000 movies

:Citation:

F. M. Harper and J. A. Konstan. "The MovieLens Datasets: History and Context". 
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, 
DOI=http://dx.doi.org/10.1145/2827872, 2015.

.. automodule:: recommenders.datasets.movielens :members:

Download utilities


.. automodule:: recommenders.datasets.download_utils :members:

Pandas dataframe utilities


.. automodule:: recommenders.datasets.pandas_df_utils :members:

Splitter utilities


Python splitters

.. automodule:: recommenders.datasets.python_splitters :members:

PySpark splitters

.. automodule:: recommenders.datasets.spark_splitters :members:

Other splitters utilities

.. automodule:: recommenders.datasets.split_utils :members:

Sparse utilities


.. automodule:: recommenders.datasets.sparse :members:

Knowledge graph utilities


.. automodule:: recommenders.datasets.wikidata :members: