docs/transformers/primer_ez/experiment.html
This is an annotated PyTorch experiment to train a Primer EZ transformer.
This is based on our vanilla transformer experiment. We use the same experiment and add the Primer EZ modifications.
15fromlabmlimportexperiment16fromlabml.configsimportoption17fromlabml\_nn.transformersimportTransformerConfigs18fromlabml\_nn.transformers.basic.autoregressive\_experimentimportConfigs19fromlabml\_nn.transformers.configsimportFeedForwardConfigs20fromlabml\_nn.transformers.primer\_ezimportSquaredReLU
Add the option of squared ReLU to configurablefeed forward module.
23@option(FeedForwardConfigs.activation,'SquaredReLU')24def\_squared\_relu():
30returnSquaredReLU()
Add the option of Multi-DConv-Head Attention to configurable transformer
33@option(TransformerConfigs.encoder\_attn,'MultiDConvHeadAttention')34def\_d\_conv\_mha(c:TransformerConfigs):
40fromlabml\_nn.transformers.primer\_ezimportMultiDConvHeadAttention41returnMultiDConvHeadAttention(c.n\_heads,c.d\_model,dropout\_prob=c.dropout)
Add the option of Multi Depth-wise Shared Conv Head Attention to configurable transformer
š This is a variation we tried
44@option(TransformerConfigs.encoder\_attn,'MultiDSharedConvHeadAttention')45def\_d\_shared\_conv\_mha(c:TransformerConfigs):
53fromlabml\_nn.transformers.primer\_ez.variationsimportMultiDSharedConvHeadAttention54returnMultiDSharedConvHeadAttention(c.n\_heads,c.d\_model,dropout\_prob=c.dropout)
Add the option of Multi Depth-wise Per Head Conv Head Attention to configurable transformer
š This is a variation we tried
57@option(TransformerConfigs.encoder\_attn,'MultiDPHConvHeadAttention')58def\_d\_per\_head\_conv\_mha(c:TransformerConfigs):
66fromlabml\_nn.transformers.primer\_ez.variationsimportMultiDPHConvHeadAttention67returnMultiDPHConvHeadAttention(c.n\_heads,c.d\_model,dropout\_prob=c.dropout)
70defmain():
Create experiment
72experiment.create(name="primer\_ez")
Create configs
74conf=Configs()
Override configurations
76experiment.configs(conf,{
Use character level tokenizer
78'tokenizer':'character',
Prompt separator is blank
80'prompt\_separator':'',
Starting prompt for sampling
82'prompt':'It is ',
Use Tiny Shakespeare dataset
84'text':'tiny\_shakespeare',
Use a context size of 256
87'seq\_len':256,
Train for 128 epochs
89'epochs':128,
Batch size 32
91'batch\_size':32,
Switch between training and validation for 10 times per epoch
94'inner\_iterations':10,
Model size
97'd\_model':512,98'transformer.ffn.d\_ff':2048,
Use Adam optimizer
101'optimizer.optimizer':'Adam',102'optimizer.learning\_rate':2.5e-4,
āļø Use squared ReLU activation in the feed forward network.
Replace this with ReLU for ReLU.
107'transformer.ffn.activation':'SquaredReLU',
āļø Use Multi-DConv-Head Attention for encoder attention.
Replace this with mha for original multi-head attention.
112'transformer.encoder\_attn':'MultiDConvHeadAttention',113})
Set models for saving and loading
116experiment.add\_pytorch\_models({'model':conf.model})
Start the experiment
119withexperiment.start():
Run training
121conf.run()
125if\_\_name\_\_=='\_\_main\_\_':126main()