Position-wise Feed-Forward Network (FFN)

This is a PyTorch implementation of position-wise feedforward network used in transformer.

FFN consists of two fully connected layers. Number of dimensions in the hidden layer dff, is generally set to around four times that of the token embedding dmodel. So it is sometime also called the expand-and-contract network.

There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, max(0,x)

That is, the FFN function is, FFN(x,W1,W2,b1,b2)=max(0,xW1+b1)W2+b2 where W1, W2, b1 and b2 are learnable parameters.

Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. xΦ(x) where Φ(x)=P(X≤x),X∼N(0,1)

Gated Linear Units

This is a generic implementation that supports different variants including Gated Linear Units (GLU). We have also implemented experiments on these:

38importtorch39fromtorchimportnn

FFN module

43classFeedForward(nn.Module):

d_model is the number of features in a token embedding
d_ff is the number of features in the hidden layer of the FFN
dropout is dropout probability for the hidden layer
is_gated specifies whether the hidden layer is gated
bias1 specified whether the first fully connected layer should have a learnable bias
bias2 specified whether the second fully connected layer should have a learnable bias
bias_gate specified whether the fully connected layer for the gate should have a learnable bias

48def\_\_init\_\_(self,d\_model:int,d\_ff:int,49dropout:float=0.1,50activation=nn.ReLU(),51is\_gated:bool=False,52bias1:bool=True,53bias2:bool=True,54bias\_gate:bool=True):

64super().\_\_init\_\_()

Layer one parameterized by weight W1 and bias b1

66self.layer1=nn.Linear(d\_model,d\_ff,bias=bias1)

Layer one parameterized by weight W1 and bias b1

68self.layer2=nn.Linear(d\_ff,d\_model,bias=bias2)

Hidden layer dropout

70self.dropout=nn.Dropout(dropout)

Activation function f

72self.activation=activation

Whether there is a gate

74self.is\_gated=is\_gated75ifis\_gated:

If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight V and bias c

78self.linear\_v=nn.Linear(d\_model,d\_ff,bias=bias\_gate)

80defforward(self,x:torch.Tensor):

f(xW1+b1)

82g=self.activation(self.layer1(x))

If gated, f(xW1+b1)⊗(xV+b)

84ifself.is\_gated:85x=g\*self.linear\_v(x)

Otherwise

87else:88x=g

Apply dropout

90x=self.dropout(x)

(f(xW1+b1)⊗(xV+b))W2+b2 or f(xW1+b1)W2+b2 depending on whether it is gated

93returnself.layer2(x)

labml.ai