Back to Recommenders

Data manipulation

examples/07_tutorials/KDD2020-tutorial/step1_data_preparation.ipynb

1.2.18.6 KB
Original Source

<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

Data manipulation

This notebook provides necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset

python
import os 
import codecs
import pickle
import time 
from datetime import datetime  
import random
import numpy as np
import math

from utils.task_helper import *
from utils.general import *
from utils.data_helper import *

First let's generate data for papers. For DKN, the paper data format is like:

[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]

where w and e are the indices of words and entities sequence of this paper. Words and entities are aligned. To take a quick example, a paper with title is:
One Health approach in the South East Asia region: opportunities and challenges

Then the title words value can be 101,56,23,14,1,69,256,887,365,32,11,567 and the title entitie value can be: 10,10,0,0,0,45,45,45,0,0,0,0 The first two values of entities sequence is 10, indicating that these two words corresponding to the same entity. The title value and entity value is hashed from 1 to n and m(n/m is the number of distinct words/entities).

python
InFile_dir = 'data_folder/raw'
OutFile_dir = 'data_folder/my'
create_dir(OutFile_dir)

Path_PaperTitleAbs_bySentence = os.path.join(InFile_dir, 'PaperTitleAbs_bySentence.txt')
Path_PaperFeature = os.path.join(OutFile_dir, 'paper_feature.txt')

max_word_size_per_paper = 15 

Step 1 is to hash the words and entities.

For simplicy, in this tutorial we only use the paper title to repsesent the content of paper. Definitely you can use more content, such as paper abstract and paper body.

Each feature length should be fixed at k (max_word_size_per_paper), if the number of words in document is more than k, we will truncate the document to k words. If the number of words in document is less than k, we will pad 0 to the end.

python
word2idx = {}
entity2idx = {}
relation2idx = {}
word2idx, entity2idx = gen_paper_content(
    Path_PaperTitleAbs_bySentence, Path_PaperFeature, word2idx, entity2idx, field=["Title"], doc_len=max_word_size_per_paper
)

Step 2 is to generate the data of the knowledge graph, in turns of a set of triples:

head, tail, relation

python
word2idx_filename = os.path.join(OutFile_dir, 'word2idx.pkl')
entity2idx_filename = os.path.join(OutFile_dir, 'entity2idx.pkl')

Path_RelatedFieldOfStudy = os.path.join(InFile_dir, 'RelatedFieldOfStudy.txt')
OutFile_dir_KG = os.path.join(OutFile_dir, 'KG')
create_dir(OutFile_dir_KG)

gen_knowledge_relations(Path_RelatedFieldOfStudy, OutFile_dir_KG, entity2idx, relation2idx) 

The data files will be outputed to the folder OutFile_dir_KG.

To train word embeddings, we need a collection of sentences:

python
Path_SentenceCollection = os.path.join(OutFile_dir, 'sentence.txt')
gen_sentence_collection(
    Path_PaperTitleAbs_bySentence,
    Path_SentenceCollection,
    word2idx
)

## save the id mapper
with open(word2idx_filename, 'wb') as f:
    pickle.dump(word2idx, f)
dump_dict_as_txt(word2idx, os.path.join(OutFile_dir, 'word2id.tsv'))
with open(entity2idx_filename, 'wb') as f:
    pickle.dump(entity2idx, f)

Next we generate user related files. Our first task is user-to-paper recommendations. For each user, we collect his/her complete cited papers, and arrange them in chronological order. The recommendation task can then be formulated as: given a user's citation history, to predict what paper he/she will cite in the future.

python

_t0 = time.time()

Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_Author2ReferencePapers = os.path.join(OutFile_dir, 'Author2ReferencePapers.tsv')

author2paper_list = load_author_paperlist(Path_PaperAuthorAffiliations)
paper2date = load_paper_date(Path_Papers)
paper2reference_list = load_paper_reference(Path_PaperReference)

author2reference_list = get_author_reference_list(author2paper_list, paper2reference_list, paper2date)

output_author2reference_list(
    author2reference_list,
    Path_Author2ReferencePapers
)

OutFile_dir_DKN = os.path.join(OutFile_dir, 'DKN-training-folder')
create_dir(OutFile_dir_KG)

DKN takes several more files as inputs:

  • training / validation / test files: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is :

[label] [userid] [CandidateNews]%[impressionid]

e.g., 1 train_U1 N1%0

  • user history file: each line in this file represents a users' citation history. You need to set his_size parameter in config file, which is the max number of user's click history we use. We will automatically keep the last his_size number of user click history, if user's click history is more than his_size, and we will automatically padding 0 if user's click history less than his_size. the format is :

[Userid] [newsid1,newsid2...]

e.g., train_U1 N1,N2

DKN take recommendations as a binary classification problem. We sample negative instances according to item's popularity:

python
gen_experiment_splits(
    Path_Author2ReferencePapers,
    OutFile_dir_DKN,
    Path_PaperFeature,
    item_ratio=0.1,
    tag='small',
    process_num=2
)

_t1 = time.time()
print('time elapses for user is : {0:.1f}s'.format(_t1 - _t0))

Prepare item2item recommendation dataset

Our second recommendation scenario is about item-to-item recommendations. Given a paper, we can recommend a list of related papers for users to cite. Here we use a supervised learning approach to train this model. Each instance is a tuple of <paper_a, paper_b, label>. Label = 1 means the pair is highly related; otherwise the label will be 0. The positive labels are constructed in the following three ways:

  1. Paper A and B overlap a lot in their reference list;
  2. Paper A and B are co-cited by many other papers;
  3. Paper A and B are published in 12 months by the same author (first author).
python
OutFile_dir_item2item = r'data_folder/my/item2item'
create_dir(OutFile_dir_item2item)
Path_PaperFeature
item_set = load_has_feature_items(Path_PaperFeature)


Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
pair2CocitedCnt, pair2CoReferenceCnt = gen_paper_cocitation(Path_PaperReference)

Path_paper_pair_cocitation = os.path.join(OutFile_dir_item2item, 'paper_pair_cocitation_cnt.csv')
Path_paper_pair_coreference = os.path.join(OutFile_dir_item2item, 'paper_pair_coreference_cnt.csv')

with open(Path_paper_pair_cocitation, 'w') as wt:
    for p, v in pair2CocitedCnt.items():
        if p[0] in item_set and p[1] in item_set:
            wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))

with open(Path_paper_pair_coreference, 'w') as wt:
    for p, v in pair2CoReferenceCnt.items():
        if p[0] in item_set and p[1] in item_set:
            wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))
            
            
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
paper2date = load_paper_date(Path_Papers)
author2paper_list, paper2author_set = load_paper_author_relation(Path_PaperAuthorAffiliations)
Path_FirstAuthorPaperPair = os.path.join(OutFile_dir_item2item, 'paper_pair_cofirstauthor.csv')
first_author_pairs = gen_paper_pairs_from_same_author(
    author2paper_list, paper2author_set, paper2date, Path_FirstAuthorPaperPair, item_set
)

Now let's separate the instances into training and validation set, and conduct negative sampling:

python
split_train_valid_file(
    [Path_paper_pair_cocitation, Path_FirstAuthorPaperPair, Path_paper_pair_coreference],
    OutFile_dir_DKN
)
gen_negative_instances(
    item_set,
    os.path.join(OutFile_dir_DKN, 'item2item_train.txt'),
    os.path.join(OutFile_dir_DKN, 'item2item_train_instances.txt'),
    9
)
gen_negative_instances(
    item_set,
    os.path.join(OutFile_dir_DKN, 'item2item_valid.txt'),
    os.path.join(OutFile_dir_DKN, 'item2item_valid_instances.txt'),
    9
)

Generating the full dataset will take a longer time, let it run in the background freely...

python
gen_experiment_splits(
    Path_Author2ReferencePapers,
    OutFile_dir_DKN,
    Path_PaperFeature,
    item_ratio=1.0,
    tag='full',
    process_num=8
)