app/src/main/cpp/python/add_new_vocab.ipynb
Pre-trained model is needed
import sentencepiece_model_pb2 as model
m = model.ModelProto()
m.ParseFromString(open("old.model", "rb").read())
Prepare the list of new tokens want to add
special_tokens = open("special_tokens.txt", "r").read().split("\n")
special_tokens
for token in special_tokens:
new_token = model.ModelProto().SentencePiece()
new_token.piece = token
new_token.score = 0
m.pieces.append(new_token)
Load the new sentencepiece model to your NLP system
with open('new.model', 'wb') as f:
f.write(m.SerializeToString())