Pretraining word and entity embeddings

This notebook trains word embeddings and entity embeddings for DKN initializations.

python

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import time
from utils.general import *
import numpy as np
import pickle
from utils.task_helper import *

python

class MySentenceCollection:
    def __init__(self, filename):
        self.filename = filename
        self.rd = None

    def __iter__(self):
        self.rd = open(self.filename, 'r', encoding='utf-8', newline='\r\n')
        return self

    def __next__(self):
        line = self.rd.readline()
        if line:
            return list(line.strip('\r\n').split(' '))
        else:
            self.rd.close()
            raise StopIteration

python

InFile_dir = 'data_folder/my'
OutFile_dir = 'data_folder/my/pretrained-embeddings'
OutFile_dir_KG = 'data_folder/my/KG'
OutFile_dir_DKN = 'data_folder/my/DKN-training-folder'

Wrod2vec [4] can learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. We use word2vec algorithm implemented in Gensim [5] to generate word embeddings.

python

def train_word2vec(Path_sentences, OutFile_dir):     
    OutFile_word2vec = os.path.join(OutFile_dir, r'word2vec.model')
    OutFile_word2vec_txt = os.path.join(OutFile_dir, r'word2vec.txt')
    create_dir(OutFile_dir)

    print('start to train word embedding...', end=' ')
    my_sentences = MySentenceCollection(Path_sentences)
    model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=8, iter=10) # user more epochs for better accuracy

    model.save(OutFile_word2vec)
    model.wv.save_word2vec_format(OutFile_word2vec_txt, binary=False)
    print('\tdone . ')

Path_sentences = os.path.join(InFile_dir, 'sentence.txt')

t0 = time.time()
train_word2vec(Path_sentences, OutFile_dir)
t1 = time.time()
print('time elapses: {0:.1f}s'.format(t1 - t0))

We leverage a graph embedding model to encode entities into embedding vectors.

We use an open-source implementation of TransE (https://github.com/thunlp/Fast-TransX) for generating knowledge graph embeddings:

python

!bash ./run_transE.sh

DKN take considerations of both the entity embeddings and its context embeddings.

python

##### build context embedding
EMBEDDING_LENGTH = 32
entity_file = os.path.join(OutFile_dir_KG, 'entity2vec.vec') 
context_file = os.path.join(OutFile_dir_KG, 'context2vec.vec')   
kg_file = os.path.join(OutFile_dir_KG, 'train2id.txt')   
gen_context_embedding(entity_file, context_file, kg_file, dim=EMBEDDING_LENGTH)

python

load_np_from_txt(
        os.path.join(OutFile_dir_KG, 'entity2vec.vec'),
        os.path.join(OutFile_dir_DKN, 'entity_embedding.npy'),
    )
load_np_from_txt(
        os.path.join(OutFile_dir_KG, 'context2vec.vec'),
        os.path.join(OutFile_dir_DKN, 'context_embedding.npy'),
    )
format_word_embeddings(
    os.path.join(OutFile_dir, 'word2vec.txt'),
    os.path.join(InFile_dir, 'word2idx.pkl'),
    os.path.join(OutFile_dir_DKN, 'word_embedding.npy')
)

Reference

[1] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.

[2] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E

of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html

[3] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/

[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 3111–3119.

[5] Gensim Word2vec embeddings : https://radimrehurek.com/gensim/models/word2vec.html