Back to Rtranslator

Add New Vocab

app/src/main/cpp/python/add_new_vocab.ipynb

2.1.5898 B
Original Source

You can add new special tokens to pre-trained sentencepiece model

Run this code in google/sentencepiece/python/src/sentencepiece

Load pre-trained sentencepiece model

Pre-trained model is needed

python
import sentencepiece_model_pb2 as model
m = model.ModelProto()
m.ParseFromString(open("old.model", "rb").read())

Load tokens want to add

Prepare the list of new tokens want to add

python
special_tokens = open("special_tokens.txt", "r").read().split("\n")
special_tokens

Add new tokens to sentencepiece model

python
for token in special_tokens:
    new_token = model.ModelProto().SentencePiece()
    new_token.piece = token
    new_token.score = 0
    m.pieces.append(new_token)

Save new sentencepiece model

Load the new sentencepiece model to your NLP system

python
with open('new.model', 'wb') as f:
    f.write(m.SerializeToString())