Back to Annotated Deep Learning Paper Implementations

tokenizer.py

docs/utils/tokenizer.html

latest1.2 KB
Original Source

homeutils

View code on Github

#

1fromtypingimportCallable23fromlabml.configsimportBaseConfigs,option

#

Tokenizer Configurations

6classTokenizerConfigs(BaseConfigs):

#

13tokenizer:Callable='character'

#

15def\_\_init\_\_(self):16super().\_\_init\_\_(\_primary='tokenizer')

#

Basic english tokenizer

We use character level tokenizer in this experiment. You can switch by setting,

'tokenizer': 'basic_english'

in the configurations dictionary when starting the experiment.

19@option(TokenizerConfigs.tokenizer)20defbasic\_english():

#

34fromtorchtext.dataimportget\_tokenizer35returnget\_tokenizer('basic\_english')

#

Character level tokenizer

38defcharacter\_tokenizer(x:str):

#

42returnlist(x)

#

Character level tokenizer configuration

45@option(TokenizerConfigs.tokenizer)46defcharacter():

#

50returncharacter\_tokenizer

labml.ai