docs/activations/fta/experiment.html
Here we train a transformer that uses Fuzzy Tiling Activation in the Feed-Forward Network. We use it for a language model and train it on Tiny Shakespeare dataset for demonstration.
However, this is probably not the ideal task for FTA, and we believe FTA is more suitable for modeling data with continuous variables.
21importcopy2223importtorch24importtorch.nnasnn2526fromlabmlimportexperiment27fromlabml.configsimportoption28fromlabml\_nn.activations.ftaimportFTA29fromlabml\_nn.experiments.nlp\_autoregressionimportNLPAutoRegressionConfigs30fromlabml\_nn.transformersimportMultiHeadAttention,TransformerLayer31fromlabml\_nn.transformers.utilsimportsubsequent\_mask
34classFeedForwardFTA(nn.Module):
d_model is the number of features in a token embeddingd_ff is the number of features in the hidden layer of the FFNactivation is FTA activation moduledropout is dropout probability for the hidden layer39def\_\_init\_\_(self,d\_model:int,d\_ff:int,40activation:FTA,41dropout:float=0.1):
48super().\_\_init\_\_()
Layer one parameterized by weight W1 and bias b1
50self.layer1=nn.Linear(d\_model,d\_ff)
Layer two parameterized by weight W1 and bias b1
52self.layer2=nn.Linear(d\_ff\*activation.expansion\_factor,d\_model)
Hidden layer dropout
54self.dropout=nn.Dropout(dropout)
Activation function f
56self.activation=activation
58defforward(self,x:torch.Tensor):
f(xW1+b1)
60x=self.activation(self.layer1(x))
Apply dropout
62x=self.dropout(x)
64returnself.layer2(x)
This is an autoregressive transformer model that uses Feed-Forward Networks with (Fuzzy Tiling Activations)(index.html).
67classAutoregressiveTransformer(nn.Module):
n_tokens is the number of tokens in the vocabularyd_model is the embedding sizen_layers is the number of transformer layerslayer is the layer. We use n_layers copies of this for the transformer.75def\_\_init\_\_(self,n\_tokens:int,d\_model:int,n\_layers:int,layer:TransformerLayer):
82super().\_\_init\_\_()
Transformer with n_layers layers
84self.transformer\_layers=nn.ModuleList([copy.deepcopy(layer)for\_inrange(n\_layers)])
Token embedding layer
87self.emb=nn.Embedding(n\_tokens,d\_model)
Readout layer
89self.readout=nn.Linear(d\_model,n\_tokens)
The mask will be initialized on the first call
92self.mask=None
x are the input tokens of shape [seq_len, batch_size]94defforward(self,x:torch.Tensor):
Create auto-regressive mask
99ifself.maskisNoneorself.mask.size(0)!=len(x):
Subsequent mask, will mask out tokens from seeing future tokens
101self.mask=subsequent\_mask(len(x)).to(x.device)
Get the token embeddings
104x=self.emb(x)
Transformer encoder
106forlayerinself.transformer\_layers:107x=layer(x=x,mask=self.mask)
Get logits
109x=self.readout(x)
Return results
112returnx,None
This inherits from NLPAutoRegressionConfigs
115classConfigs(NLPAutoRegressionConfigs):
Model
124model:AutoregressiveTransformer
Number of layers
127n\_layers:int=4
α and β for DeepNorm
130deep\_norm\_alpha:float131deep\_norm\_beta:float
Number of heads in the attention
134n\_heads:int=4
Embedding size
136d\_model:int=256
Size of each attention head
138d\_k:int=16
Feed forward layer size
140d\_ff:int=256
FTA
143fta\_lower\_limit:float=-1.144fta\_upper\_limit:float=+1.145fta\_delta:float=0.2146fta\_eta:float=0.05
149@option(Configs.model)150def\_model(c:Configs):
Create FTA activation module
156fta=FTA(c.fta\_lower\_limit,c.fta\_upper\_limit,c.fta\_delta,c.fta\_eta)
Create the transformer. We re-use TransformerLayer and MultiHeadAttention implementations.
160m=AutoregressiveTransformer(c.n\_tokens,c.d\_model,c.n\_layers,161TransformerLayer(d\_model=c.d\_model,162feed\_forward=FeedForwardFTA(d\_model=c.d\_model,163d\_ff=c.d\_ff,164activation=fta,165dropout=0.1),166self\_attn=MultiHeadAttention(c.n\_heads,c.d\_model,167dropout\_prob=0.0),168dropout\_prob=0.0))
Move to the device
171returnm.to(c.device)
174defmain():
Create experiment
179experiment.create(name="fta",writers={'screen','labml'})
Create configs
181conf=Configs()
Override configurations
183experiment.configs(conf,{
Use character level tokenizer
185'tokenizer':'character',
Prompt separator is blank
187'prompt\_separator':'',
Starting prompt for sampling
189'prompt':'It is ',
Use Tiny Shakespeare dataset
191'text':'tiny\_shakespeare',
Use a context size of 256
194'seq\_len':256,
Train for 32 epochs
196'epochs':32,
Batch size 16
198'batch\_size':16,
Switch between training and validation for 10 times per epoch
200'inner\_iterations':10,
Adam optimizer with no warmup
203'optimizer.optimizer':'Adam',204'optimizer.learning\_rate':3e-4,205})
Set model(s) for saving and loading
208experiment.add\_pytorch\_models({'model':conf.model})
Start the experiment
211withexperiment.start():
Run training
213conf.run()
217if\_\_name\_\_=='\_\_main\_\_':218main()