Back to Annotated Deep Learning Paper Implementations

Position-wise Feed-Forward Network (FFN)

docs/transformers/feed_forward.html

latest3.4 KB
Original Source

hometransformers

View code on Github

#

Position-wise Feed-Forward Network (FFN)

This is a PyTorch implementation of position-wise feedforward network used in transformer.

FFN consists of two fully connected layers. Number of dimensions in the hidden layer dff​, is generally set to around four times that of the token embedding dmodel​. So it is sometime also called the expand-and-contract network.

There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, max(0,x)

That is, the FFN function is, FFN(x,W1​,W2​,b1​,b2​)=max(0,xW1​+b1​)W2​+b2​ where W1​, W2​, b1​ and b2​ are learnable parameters.

Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. xΦ(x) where Φ(x)=P(X≤x),X∼N(0,1)

Gated Linear Units

This is a generic implementation that supports different variants including Gated Linear Units (GLU). We have also implemented experiments on these:

38importtorch39fromtorchimportnn

#

FFN module

43classFeedForward(nn.Module):

#

  • d_model is the number of features in a token embedding
  • d_ff is the number of features in the hidden layer of the FFN
  • dropout is dropout probability for the hidden layer
  • is_gated specifies whether the hidden layer is gated
  • bias1 specified whether the first fully connected layer should have a learnable bias
  • bias2 specified whether the second fully connected layer should have a learnable bias
  • bias_gate specified whether the fully connected layer for the gate should have a learnable bias
48def\_\_init\_\_(self,d\_model:int,d\_ff:int,49dropout:float=0.1,50activation=nn.ReLU(),51is\_gated:bool=False,52bias1:bool=True,53bias2:bool=True,54bias\_gate:bool=True):

#

64super().\_\_init\_\_()

#

Layer one parameterized by weight W1​ and bias b1​

66self.layer1=nn.Linear(d\_model,d\_ff,bias=bias1)

#

Layer one parameterized by weight W1​ and bias b1​

68self.layer2=nn.Linear(d\_ff,d\_model,bias=bias2)

#

Hidden layer dropout

70self.dropout=nn.Dropout(dropout)

#

Activation function f

72self.activation=activation

#

Whether there is a gate

74self.is\_gated=is\_gated75ifis\_gated:

#

If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight V and bias c

78self.linear\_v=nn.Linear(d\_model,d\_ff,bias=bias\_gate)

#

80defforward(self,x:torch.Tensor):

#

f(xW1​+b1​)

82g=self.activation(self.layer1(x))

#

If gated, f(xW1​+b1​)⊗(xV+b)

84ifself.is\_gated:85x=g\*self.linear\_v(x)

#

Otherwise

87else:88x=g

#

Apply dropout

90x=self.dropout(x)

#

(f(xW1​+b1​)⊗(xV+b))W2​+b2​ or f(xW1​+b1​)W2​+b2​ depending on whether it is gated

93returnself.layer2(x)

labml.ai