bindings/python/examples/using_the_visualizer.ipynb
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O /tmp/bert-base-uncased-vocab.txt
from tokenizers import BertWordPieceTokenizer
from tokenizers.tools import EncodingVisualizer
EncodingVisualizer.unk_token_regex.search("aaa[udsnk]aaa")
text = """Mathias Bynens 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞': Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩) in a string, 💩💩💩💩💩💩💩💩💩💩💩💩 and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it."""
tokenizer = BertWordPieceTokenizer("/tmp/bert-base-uncased-vocab.txt", lowercase=True)
visualizer = EncodingVisualizer(tokenizer=tokenizer)
visualizer(text)
First we make some annotations with the Annotation class
from tokenizers.tools import Annotation
anno1 = Annotation(start=0, end=2, label="foo")
anno2 = Annotation(start=2, end=4, label="bar")
anno3 = Annotation(start=6, end=8, label="poo")
anno4 = Annotation(start=9, end=12, label="shoe")
annotations = [
anno1,
anno2,
anno3,
anno4,
Annotation(start=23, end=30, label="random tandem bandem sandem landem fandom"),
Annotation(start=63, end=70, label="foo"),
Annotation(start=80, end=95, label="bar"),
Annotation(start=120, end=128, label="bar"),
Annotation(start=152, end=155, label="poo"),
]
visualizer(text, annotations=annotations)
Every system has its own representation of annotations. That's why we can instantiate the EncodingVisualizer with a convertion function.
funnyAnnotations = [dict(startPlace=i, endPlace=i + 3, theTag=str(i)) for i in range(0, 20, 4)]
funnyAnnotations
def converter(funny):
return Annotation(start=funny["startPlace"], end=funny["endPlace"], label=funny["theTag"])
visualizer = EncodingVisualizer(tokenizer=tokenizer, default_to_notebook=True, annotation_converter=converter)
visualizer(text, annotations=funnyAnnotations)
!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json" -O /tmp/roberta-base-vocab.json
!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt" -O /tmp/roberta-base-merges.txt
from tokenizers import ByteLevelBPETokenizer
roberta_tokenizer = ByteLevelBPETokenizer.from_file("/tmp/roberta-base-vocab.json", "/tmp/roberta-base-merges.txt")
roberta_visualizer = EncodingVisualizer(tokenizer=roberta_tokenizer, default_to_notebook=True)
roberta_visualizer(text, annotations=annotations)