Back to Annotated Deep Learning Paper Implementations

Fuzzy Tiling Activation Experiment

docs/activations/fta/experiment.html

latest6.9 KB
Original Source

homeactivationsfta

View code on Github

#

Fuzzy Tiling Activation Experiment

Here we train a transformer that uses Fuzzy Tiling Activation in the Feed-Forward Network. We use it for a language model and train it on Tiny Shakespeare dataset for demonstration.

However, this is probably not the ideal task for FTA, and we believe FTA is more suitable for modeling data with continuous variables.

21importcopy2223importtorch24importtorch.nnasnn2526fromlabmlimportexperiment27fromlabml.configsimportoption28fromlabml\_nn.activations.ftaimportFTA29fromlabml\_nn.experiments.nlp\_autoregressionimportNLPAutoRegressionConfigs30fromlabml\_nn.transformersimportMultiHeadAttention,TransformerLayer31fromlabml\_nn.transformers.utilsimportsubsequent\_mask

#

FFN module with FTA activation

34classFeedForwardFTA(nn.Module):

#

  • d_model is the number of features in a token embedding
  • d_ff is the number of features in the hidden layer of the FFN
  • activation is FTA activation module
  • dropout is dropout probability for the hidden layer
39def\_\_init\_\_(self,d\_model:int,d\_ff:int,40activation:FTA,41dropout:float=0.1):

#

48super().\_\_init\_\_()

#

Layer one parameterized by weight W1​ and bias b1​

50self.layer1=nn.Linear(d\_model,d\_ff)

#

Layer two parameterized by weight W1​ and bias b1​

52self.layer2=nn.Linear(d\_ff\*activation.expansion\_factor,d\_model)

#

Hidden layer dropout

54self.dropout=nn.Dropout(dropout)

#

Activation function f

56self.activation=activation

#

58defforward(self,x:torch.Tensor):

#

f(xW1​+b1​)

60x=self.activation(self.layer1(x))

#

Apply dropout

62x=self.dropout(x)

#

64returnself.layer2(x)

#

Auto-Regressive model

This is an autoregressive transformer model that uses Feed-Forward Networks with (Fuzzy Tiling Activations)(index.html).

67classAutoregressiveTransformer(nn.Module):

#

  • n_tokens is the number of tokens in the vocabulary
  • d_model is the embedding size
  • n_layers is the number of transformer layers
  • layer is the layer. We use n_layers copies of this for the transformer.
75def\_\_init\_\_(self,n\_tokens:int,d\_model:int,n\_layers:int,layer:TransformerLayer):

#

82super().\_\_init\_\_()

#

Transformer with n_layers layers

84self.transformer\_layers=nn.ModuleList([copy.deepcopy(layer)for\_inrange(n\_layers)])

#

Token embedding layer

87self.emb=nn.Embedding(n\_tokens,d\_model)

#

Readout layer

89self.readout=nn.Linear(d\_model,n\_tokens)

#

The mask will be initialized on the first call

92self.mask=None

#

  • x are the input tokens of shape [seq_len, batch_size]
94defforward(self,x:torch.Tensor):

#

Create auto-regressive mask

99ifself.maskisNoneorself.mask.size(0)!=len(x):

#

Subsequent mask, will mask out tokens from seeing future tokens

101self.mask=subsequent\_mask(len(x)).to(x.device)

#

Get the token embeddings

104x=self.emb(x)

#

Transformer encoder

106forlayerinself.transformer\_layers:107x=layer(x=x,mask=self.mask)

#

Get logits

109x=self.readout(x)

#

Return results

112returnx,None

#

Configurations

This inherits from NLPAutoRegressionConfigs

115classConfigs(NLPAutoRegressionConfigs):

#

Model

124model:AutoregressiveTransformer

#

Number of layers

127n\_layers:int=4

#

α and β for DeepNorm

130deep\_norm\_alpha:float131deep\_norm\_beta:float

#

Number of heads in the attention

134n\_heads:int=4

#

Embedding size

136d\_model:int=256

#

Size of each attention head

138d\_k:int=16

#

Feed forward layer size

140d\_ff:int=256

#

FTA

143fta\_lower\_limit:float=-1.144fta\_upper\_limit:float=+1.145fta\_delta:float=0.2146fta\_eta:float=0.05

#

Initialize the model

149@option(Configs.model)150def\_model(c:Configs):

#

Create FTA activation module

156fta=FTA(c.fta\_lower\_limit,c.fta\_upper\_limit,c.fta\_delta,c.fta\_eta)

#

Create the transformer. We re-use TransformerLayer and MultiHeadAttention implementations.

160m=AutoregressiveTransformer(c.n\_tokens,c.d\_model,c.n\_layers,161TransformerLayer(d\_model=c.d\_model,162feed\_forward=FeedForwardFTA(d\_model=c.d\_model,163d\_ff=c.d\_ff,164activation=fta,165dropout=0.1),166self\_attn=MultiHeadAttention(c.n\_heads,c.d\_model,167dropout\_prob=0.0),168dropout\_prob=0.0))

#

Move to the device

171returnm.to(c.device)

#

Create and run the experiment

174defmain():

#

Create experiment

179experiment.create(name="fta",writers={'screen','labml'})

#

Create configs

181conf=Configs()

#

Override configurations

183experiment.configs(conf,{

#

Use character level tokenizer

185'tokenizer':'character',

#

Prompt separator is blank

187'prompt\_separator':'',

#

Starting prompt for sampling

189'prompt':'It is ',

#

Use Tiny Shakespeare dataset

191'text':'tiny\_shakespeare',

#

Use a context size of 256

194'seq\_len':256,

#

Train for 32 epochs

196'epochs':32,

#

Batch size 16

198'batch\_size':16,

#

Switch between training and validation for 10 times per epoch

200'inner\_iterations':10,

#

Adam optimizer with no warmup

203'optimizer.optimizer':'Adam',204'optimizer.learning\_rate':3e-4,205})

#

Set model(s) for saving and loading

208experiment.add\_pytorch\_models({'model':conf.model})

#

Start the experiment

211withexperiment.start():

#

Run training

213conf.run()

#

217if\_\_name\_\_=='\_\_main\_\_':218main()

labml.ai