docs/transformers/feed_forward.html
This is a PyTorch implementation of position-wise feedforward network used in transformer.
FFN consists of two fully connected layers. Number of dimensions in the hidden layer dff, is generally set to around four times that of the token embedding dmodel. So it is sometime also called the expand-and-contract network.
There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, max(0,x)
That is, the FFN function is, FFN(x,W1,W2,b1,b2)=max(0,xW1+b1)W2+b2 where W1, W2, b1 and b2 are learnable parameters.
Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. xΦ(x) where Φ(x)=P(X≤x),X∼N(0,1)
This is a generic implementation that supports different variants including Gated Linear Units (GLU). We have also implemented experiments on these:
38importtorch39fromtorchimportnn
43classFeedForward(nn.Module):
d_model is the number of features in a token embeddingd_ff is the number of features in the hidden layer of the FFNdropout is dropout probability for the hidden layeris_gated specifies whether the hidden layer is gatedbias1 specified whether the first fully connected layer should have a learnable biasbias2 specified whether the second fully connected layer should have a learnable biasbias_gate specified whether the fully connected layer for the gate should have a learnable bias48def\_\_init\_\_(self,d\_model:int,d\_ff:int,49dropout:float=0.1,50activation=nn.ReLU(),51is\_gated:bool=False,52bias1:bool=True,53bias2:bool=True,54bias\_gate:bool=True):
64super().\_\_init\_\_()
Layer one parameterized by weight W1 and bias b1
66self.layer1=nn.Linear(d\_model,d\_ff,bias=bias1)
Layer one parameterized by weight W1 and bias b1
68self.layer2=nn.Linear(d\_ff,d\_model,bias=bias2)
Hidden layer dropout
70self.dropout=nn.Dropout(dropout)
Activation function f
72self.activation=activation
Whether there is a gate
74self.is\_gated=is\_gated75ifis\_gated:
If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight V and bias c
78self.linear\_v=nn.Linear(d\_model,d\_ff,bias=bias\_gate)
80defforward(self,x:torch.Tensor):
f(xW1+b1)
82g=self.activation(self.layer1(x))
If gated, f(xW1+b1)⊗(xV+b)
84ifself.is\_gated:85x=g\*self.linear\_v(x)
Otherwise
87else:88x=g
Apply dropout
90x=self.dropout(x)
(f(xW1+b1)⊗(xV+b))W2+b2 or f(xW1+b1)W2+b2 depending on whether it is gated
93returnself.layer2(x)