examples/07_tutorials/KDD2020-tutorial/step2_pretraining-embeddings.ipynb
<i>Copyright (c) Recommenders contributors.</i>
<i>Licensed under the MIT License.</i>
This notebook trains word embeddings and entity embeddings for DKN initializations.
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import time
from utils.general import *
import numpy as np
import pickle
from utils.task_helper import *
class MySentenceCollection:
def __init__(self, filename):
self.filename = filename
self.rd = None
def __iter__(self):
self.rd = open(self.filename, 'r', encoding='utf-8', newline='\r\n')
return self
def __next__(self):
line = self.rd.readline()
if line:
return list(line.strip('\r\n').split(' '))
else:
self.rd.close()
raise StopIteration
InFile_dir = 'data_folder/my'
OutFile_dir = 'data_folder/my/pretrained-embeddings'
OutFile_dir_KG = 'data_folder/my/KG'
OutFile_dir_DKN = 'data_folder/my/DKN-training-folder'
Wrod2vec [4] can learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. We use word2vec algorithm implemented in Gensim [5] to generate word embeddings.
def train_word2vec(Path_sentences, OutFile_dir):
OutFile_word2vec = os.path.join(OutFile_dir, r'word2vec.model')
OutFile_word2vec_txt = os.path.join(OutFile_dir, r'word2vec.txt')
create_dir(OutFile_dir)
print('start to train word embedding...', end=' ')
my_sentences = MySentenceCollection(Path_sentences)
model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=8, iter=10) # user more epochs for better accuracy
model.save(OutFile_word2vec)
model.wv.save_word2vec_format(OutFile_word2vec_txt, binary=False)
print('\tdone . ')
Path_sentences = os.path.join(InFile_dir, 'sentence.txt')
t0 = time.time()
train_word2vec(Path_sentences, OutFile_dir)
t1 = time.time()
print('time elapses: {0:.1f}s'.format(t1 - t0))
We leverage a graph embedding model to encode entities into embedding vectors.
We use an open-source implementation of TransE (https://github.com/thunlp/Fast-TransX) for generating knowledge graph embeddings:
!bash ./run_transE.sh
DKN take considerations of both the entity embeddings and its context embeddings.
##### build context embedding
EMBEDDING_LENGTH = 32
entity_file = os.path.join(OutFile_dir_KG, 'entity2vec.vec')
context_file = os.path.join(OutFile_dir_KG, 'context2vec.vec')
kg_file = os.path.join(OutFile_dir_KG, 'train2id.txt')
gen_context_embedding(entity_file, context_file, kg_file, dim=EMBEDDING_LENGTH)
load_np_from_txt(
os.path.join(OutFile_dir_KG, 'entity2vec.vec'),
os.path.join(OutFile_dir_DKN, 'entity_embedding.npy'),
)
load_np_from_txt(
os.path.join(OutFile_dir_KG, 'context2vec.vec'),
os.path.join(OutFile_dir_DKN, 'context_embedding.npy'),
)
format_word_embeddings(
os.path.join(OutFile_dir, 'word2vec.txt'),
os.path.join(InFile_dir, 'word2idx.pkl'),
os.path.join(OutFile_dir_DKN, 'word_embedding.npy')
)
[1] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.
[2] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E
of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html
[3] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/
[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 3111–3119.
[5] Gensim Word2vec embeddings : https://radimrehurek.com/gensim/models/word2vec.html