examples/07_tutorials/KDD2020-tutorial/step1_data_preparation.ipynb
<i>Copyright (c) Recommenders contributors.</i>
<i>Licensed under the MIT License.</i>
This notebook provides necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset
import os
import codecs
import pickle
import time
from datetime import datetime
import random
import numpy as np
import math
from utils.task_helper import *
from utils.general import *
from utils.data_helper import *
First let's generate data for papers. For DKN, the paper data format is like:
[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]
where w and e are the indices of words and entities sequence of this paper.
Words and entities are aligned. To take a quick example, a paper with title is:
One Health approach in the South East Asia region: opportunities and challenges
Then the title words value can be
101,56,23,14,1,69,256,887,365,32,11,567
and the title entitie value can be:
10,10,0,0,0,45,45,45,0,0,0,0
The first two values of entities sequence is 10, indicating that these two words corresponding to the same entity. The title value and entity value is hashed from 1 to n and m(n/m is the number of distinct words/entities).
InFile_dir = 'data_folder/raw'
OutFile_dir = 'data_folder/my'
create_dir(OutFile_dir)
Path_PaperTitleAbs_bySentence = os.path.join(InFile_dir, 'PaperTitleAbs_bySentence.txt')
Path_PaperFeature = os.path.join(OutFile_dir, 'paper_feature.txt')
max_word_size_per_paper = 15
Step 1 is to hash the words and entities.
For simplicy, in this tutorial we only use the paper title to repsesent the content of paper. Definitely you can use more content, such as paper abstract and paper body.
Each feature length should be fixed at k (max_word_size_per_paper), if the number of words in document is more than k, we will truncate the document to k words. If the number of words in document is less than k, we will pad 0 to the end.
word2idx = {}
entity2idx = {}
relation2idx = {}
word2idx, entity2idx = gen_paper_content(
Path_PaperTitleAbs_bySentence, Path_PaperFeature, word2idx, entity2idx, field=["Title"], doc_len=max_word_size_per_paper
)
Step 2 is to generate the data of the knowledge graph, in turns of a set of triples:
head, tail, relation
word2idx_filename = os.path.join(OutFile_dir, 'word2idx.pkl')
entity2idx_filename = os.path.join(OutFile_dir, 'entity2idx.pkl')
Path_RelatedFieldOfStudy = os.path.join(InFile_dir, 'RelatedFieldOfStudy.txt')
OutFile_dir_KG = os.path.join(OutFile_dir, 'KG')
create_dir(OutFile_dir_KG)
gen_knowledge_relations(Path_RelatedFieldOfStudy, OutFile_dir_KG, entity2idx, relation2idx)
The data files will be outputed to the folder OutFile_dir_KG.
To train word embeddings, we need a collection of sentences:
Path_SentenceCollection = os.path.join(OutFile_dir, 'sentence.txt')
gen_sentence_collection(
Path_PaperTitleAbs_bySentence,
Path_SentenceCollection,
word2idx
)
## save the id mapper
with open(word2idx_filename, 'wb') as f:
pickle.dump(word2idx, f)
dump_dict_as_txt(word2idx, os.path.join(OutFile_dir, 'word2id.tsv'))
with open(entity2idx_filename, 'wb') as f:
pickle.dump(entity2idx, f)
Next we generate user related files. Our first task is user-to-paper recommendations. For each user, we collect his/her complete cited papers, and arrange them in chronological order. The recommendation task can then be formulated as: given a user's citation history, to predict what paper he/she will cite in the future.
_t0 = time.time()
Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_Author2ReferencePapers = os.path.join(OutFile_dir, 'Author2ReferencePapers.tsv')
author2paper_list = load_author_paperlist(Path_PaperAuthorAffiliations)
paper2date = load_paper_date(Path_Papers)
paper2reference_list = load_paper_reference(Path_PaperReference)
author2reference_list = get_author_reference_list(author2paper_list, paper2reference_list, paper2date)
output_author2reference_list(
author2reference_list,
Path_Author2ReferencePapers
)
OutFile_dir_DKN = os.path.join(OutFile_dir, 'DKN-training-folder')
create_dir(OutFile_dir_KG)
[label] [userid] [CandidateNews]%[impressionid]
e.g., 1 train_U1 N1%0
[Userid] [newsid1,newsid2...]
e.g., train_U1 N1,N2
DKN take recommendations as a binary classification problem. We sample negative instances according to item's popularity:
gen_experiment_splits(
Path_Author2ReferencePapers,
OutFile_dir_DKN,
Path_PaperFeature,
item_ratio=0.1,
tag='small',
process_num=2
)
_t1 = time.time()
print('time elapses for user is : {0:.1f}s'.format(_t1 - _t0))
Our second recommendation scenario is about item-to-item recommendations. Given a paper, we can recommend a list of related papers for users to cite. Here we use a supervised learning approach to train this model. Each instance is a tuple of <paper_a, paper_b, label>. Label = 1 means the pair is highly related; otherwise the label will be 0. The positive labels are constructed in the following three ways:
OutFile_dir_item2item = r'data_folder/my/item2item'
create_dir(OutFile_dir_item2item)
Path_PaperFeature
item_set = load_has_feature_items(Path_PaperFeature)
Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
pair2CocitedCnt, pair2CoReferenceCnt = gen_paper_cocitation(Path_PaperReference)
Path_paper_pair_cocitation = os.path.join(OutFile_dir_item2item, 'paper_pair_cocitation_cnt.csv')
Path_paper_pair_coreference = os.path.join(OutFile_dir_item2item, 'paper_pair_coreference_cnt.csv')
with open(Path_paper_pair_cocitation, 'w') as wt:
for p, v in pair2CocitedCnt.items():
if p[0] in item_set and p[1] in item_set:
wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))
with open(Path_paper_pair_coreference, 'w') as wt:
for p, v in pair2CoReferenceCnt.items():
if p[0] in item_set and p[1] in item_set:
wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
paper2date = load_paper_date(Path_Papers)
author2paper_list, paper2author_set = load_paper_author_relation(Path_PaperAuthorAffiliations)
Path_FirstAuthorPaperPair = os.path.join(OutFile_dir_item2item, 'paper_pair_cofirstauthor.csv')
first_author_pairs = gen_paper_pairs_from_same_author(
author2paper_list, paper2author_set, paper2date, Path_FirstAuthorPaperPair, item_set
)
Now let's separate the instances into training and validation set, and conduct negative sampling:
split_train_valid_file(
[Path_paper_pair_cocitation, Path_FirstAuthorPaperPair, Path_paper_pair_coreference],
OutFile_dir_DKN
)
gen_negative_instances(
item_set,
os.path.join(OutFile_dir_DKN, 'item2item_train.txt'),
os.path.join(OutFile_dir_DKN, 'item2item_train_instances.txt'),
9
)
gen_negative_instances(
item_set,
os.path.join(OutFile_dir_DKN, 'item2item_valid.txt'),
os.path.join(OutFile_dir_DKN, 'item2item_valid_instances.txt'),
9
)
Generating the full dataset will take a longer time, let it run in the background freely...
gen_experiment_splits(
Path_Author2ReferencePapers,
OutFile_dir_DKN,
Path_PaperFeature,
item_ratio=1.0,
tag='full',
process_num=8
)