Fuzzy Tiling Activation Experiment

Here we train a transformer that uses Fuzzy Tiling Activation in the Feed-Forward Network. We use it for a language model and train it on Tiny Shakespeare dataset for demonstration.

However, this is probably not the ideal task for FTA, and we believe FTA is more suitable for modeling data with continuous variables.

21importcopy2223importtorch24importtorch.nnasnn2526fromlabmlimportexperiment27fromlabml.configsimportoption28fromlabml\_nn.activations.ftaimportFTA29fromlabml\_nn.experiments.nlp\_autoregressionimportNLPAutoRegressionConfigs30fromlabml\_nn.transformersimportMultiHeadAttention,TransformerLayer31fromlabml\_nn.transformers.utilsimportsubsequent\_mask

FFN module with FTA activation

34classFeedForwardFTA(nn.Module):

d_model is the number of features in a token embedding
d_ff is the number of features in the hidden layer of the FFN
activation is FTA activation module
dropout is dropout probability for the hidden layer

39def\_\_init\_\_(self,d\_model:int,d\_ff:int,40activation:FTA,41dropout:float=0.1):

48super().\_\_init\_\_()

Layer one parameterized by weight W1 and bias b1

50self.layer1=nn.Linear(d\_model,d\_ff)

Layer two parameterized by weight W1 and bias b1

52self.layer2=nn.Linear(d\_ff\*activation.expansion\_factor,d\_model)

Hidden layer dropout

54self.dropout=nn.Dropout(dropout)

Activation function f

56self.activation=activation

58defforward(self,x:torch.Tensor):

f(xW1+b1)

60x=self.activation(self.layer1(x))

Apply dropout

62x=self.dropout(x)

64returnself.layer2(x)

Auto-Regressive model

This is an autoregressive transformer model that uses Feed-Forward Networks with (Fuzzy Tiling Activations)(index.html).

67classAutoregressiveTransformer(nn.Module):

n_tokens is the number of tokens in the vocabulary
d_model is the embedding size
n_layers is the number of transformer layers
layer is the layer. We use n_layers copies of this for the transformer.

75def\_\_init\_\_(self,n\_tokens:int,d\_model:int,n\_layers:int,layer:TransformerLayer):

82super().\_\_init\_\_()

Transformer with n_layers layers

84self.transformer\_layers=nn.ModuleList([copy.deepcopy(layer)for\_inrange(n\_layers)])

Token embedding layer

87self.emb=nn.Embedding(n\_tokens,d\_model)

Readout layer

89self.readout=nn.Linear(d\_model,n\_tokens)

The mask will be initialized on the first call

92self.mask=None

x are the input tokens of shape [seq_len, batch_size]

94defforward(self,x:torch.Tensor):

Create auto-regressive mask

99ifself.maskisNoneorself.mask.size(0)!=len(x):

Subsequent mask, will mask out tokens from seeing future tokens

101self.mask=subsequent\_mask(len(x)).to(x.device)

Get the token embeddings

104x=self.emb(x)

Transformer encoder

106forlayerinself.transformer\_layers:107x=layer(x=x,mask=self.mask)

Get logits

109x=self.readout(x)

Return results

112returnx,None

Configurations

This inherits from NLPAutoRegressionConfigs

115classConfigs(NLPAutoRegressionConfigs):

Model

124model:AutoregressiveTransformer

Number of layers

127n\_layers:int=4

α and β for DeepNorm

130deep\_norm\_alpha:float131deep\_norm\_beta:float

Number of heads in the attention

134n\_heads:int=4

Embedding size

136d\_model:int=256

Size of each attention head

138d\_k:int=16

Feed forward layer size

140d\_ff:int=256

FTA

143fta\_lower\_limit:float=-1.144fta\_upper\_limit:float=+1.145fta\_delta:float=0.2146fta\_eta:float=0.05

Initialize the model

149@option(Configs.model)150def\_model(c:Configs):

Create FTA activation module

156fta=FTA(c.fta\_lower\_limit,c.fta\_upper\_limit,c.fta\_delta,c.fta\_eta)

Create the transformer. We re-use TransformerLayer and MultiHeadAttention implementations.

160m=AutoregressiveTransformer(c.n\_tokens,c.d\_model,c.n\_layers,161TransformerLayer(d\_model=c.d\_model,162feed\_forward=FeedForwardFTA(d\_model=c.d\_model,163d\_ff=c.d\_ff,164activation=fta,165dropout=0.1),166self\_attn=MultiHeadAttention(c.n\_heads,c.d\_model,167dropout\_prob=0.0),168dropout\_prob=0.0))

Move to the device

171returnm.to(c.device)

Create and run the experiment

174defmain():

Create experiment

179experiment.create(name="fta",writers={'screen','labml'})

Create configs

181conf=Configs()

Override configurations

183experiment.configs(conf,{

Use character level tokenizer

185'tokenizer':'character',

Prompt separator is blank

187'prompt\_separator':'',

Starting prompt for sampling

189'prompt':'It is ',

Use Tiny Shakespeare dataset

191'text':'tiny\_shakespeare',

Use a context size of 256

194'seq\_len':256,

Train for 32 epochs

196'epochs':32,

Batch size 16

198'batch\_size':16,

Switch between training and validation for 10 times per epoch

200'inner\_iterations':10,

Adam optimizer with no warmup

203'optimizer.optimizer':'Adam',204'optimizer.learning\_rate':3e-4,205})

Set model(s) for saving and loading

208experiment.add\_pytorch\_models({'model':conf.model})

Start the experiment

211withexperiment.start():

Run training

213conf.run()

217if\_\_name\_\_=='\_\_main\_\_':218main()

labml.ai