Using The Visualizer - Tokenizers

python

!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O /tmp/bert-base-uncased-vocab.txt

python

from tokenizers import BertWordPieceTokenizer
from tokenizers.tools import EncodingVisualizer

python

EncodingVisualizer.unk_token_regex.search("aaa[udsnk]aaa")

python

text = """Mathias Bynens 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞': Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩) in a string, 💩💩💩💩💩💩💩💩💩💩💩💩 and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it."""

python

tokenizer = BertWordPieceTokenizer("/tmp/bert-base-uncased-vocab.txt", lowercase=True)
visualizer = EncodingVisualizer(tokenizer=tokenizer)

Visualizing Tokens With No Annotations

python

visualizer(text)

Visualizing Tokens With Aligned Annotations

First we make some annotations with the Annotation class

python

from tokenizers.tools import Annotation

python

anno1 = Annotation(start=0, end=2, label="foo")
anno2 = Annotation(start=2, end=4, label="bar")
anno3 = Annotation(start=6, end=8, label="poo")
anno4 = Annotation(start=9, end=12, label="shoe")
annotations = [
    anno1,
    anno2,
    anno3,
    anno4,
    Annotation(start=23, end=30, label="random tandem bandem sandem landem fandom"),
    Annotation(start=63, end=70, label="foo"),
    Annotation(start=80, end=95, label="bar"),
    Annotation(start=120, end=128, label="bar"),
    Annotation(start=152, end=155, label="poo"),
]

python

visualizer(text, annotations=annotations)

Using A Custom Annotation Format

Every system has its own representation of annotations. That's why we can instantiate the EncodingVisualizer with a convertion function.

python

funnyAnnotations = [dict(startPlace=i, endPlace=i + 3, theTag=str(i)) for i in range(0, 20, 4)]
funnyAnnotations

python

def converter(funny):
    return Annotation(start=funny["startPlace"], end=funny["endPlace"], label=funny["theTag"])


visualizer = EncodingVisualizer(tokenizer=tokenizer, default_to_notebook=True, annotation_converter=converter)

python

visualizer(text, annotations=funnyAnnotations)

Trying with Roberta

python

!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json" -O /tmp/roberta-base-vocab.json
!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt" -O /tmp/roberta-base-merges.txt

python

from tokenizers import ByteLevelBPETokenizer

roberta_tokenizer = ByteLevelBPETokenizer.from_file("/tmp/roberta-base-vocab.json", "/tmp/roberta-base-merges.txt")
roberta_visualizer = EncodingVisualizer(tokenizer=roberta_tokenizer, default_to_notebook=True)
roberta_visualizer(text, annotations=annotations)