docs/normalization/layer_norm/index.html
[View code on Github](https://github.com/labmlai/annotated_deep_learning_paper_implementations/tree/master/labml_nn/normalization/layer_norm/ init.py)
This is a PyTorch implementation of Layer Normalization.
Layer normalization is a simpler normalization method that works on a wider range of settings. Layer normalization transforms the inputs to have zero mean and unit variance across the features. Note that batch normalization fixes the zero mean and unit variance for each element. Layer normalization does it for each batch across all elements.
Layer normalization is generally used for NLP tasks.
We have used layer normalization in most of the transformer implementations.
35fromtypingimportUnion,List3637importtorch38fromtorchimportnn,Size
Layer normalization LN normalizes the input X as follows:
When input X∈RB×C is a batch of embeddings, where B is the batch size and C is the number of features. γ∈RC and β∈RC. LN(X)=γCVar[X]+ϵX−CE[X]+β
When input X∈RL×B×C is a batch of a sequence of embeddings, where B is the batch size, C is the number of channels, L is the length of the sequence. γ∈RC and β∈RC. LN(X)=γCVar[X]+ϵX−CE[X]+β
When input X∈RB×C×H×W is a batch of image representations, where B is the batch size, C is the number of channels, H is the height and W is the width. This is not a widely used scenario. γ∈RC×H×W and β∈RC×H×W. LN(X)=γC,H,WVar[X]+ϵX−C,H,WE[X]+β
42classLayerNorm(nn.Module):
normalized_shapeS is the shape of the elements (except the batch). The input should then be X∈R∗×S[0]×S[1]×...×S[n]eps is ϵ, used in Var[X]+ϵ for numerical stabilityelementwise_affine is whether to scale and shift the normalized valueWe've tried to use the same names for arguments as PyTorch LayerNorm implementation.
71def\_\_init\_\_(self,normalized\_shape:Union[int,List[int],Size],\*,72eps:float=1e-5,73elementwise\_affine:bool=True):
83super().\_\_init\_\_()
Convert normalized_shape to torch.Size
86ifisinstance(normalized\_shape,int):87normalized\_shape=torch.Size([normalized\_shape])88elifisinstance(normalized\_shape,list):89normalized\_shape=torch.Size(normalized\_shape)90assertisinstance(normalized\_shape,torch.Size)
93self.normalized\_shape=normalized\_shape94self.eps=eps95self.elementwise\_affine=elementwise\_affine
Create parameters for γ and β for gain and bias
97ifself.elementwise\_affine:98self.gain=nn.Parameter(torch.ones(normalized\_shape))99self.bias=nn.Parameter(torch.zeros(normalized\_shape))
x is a tensor of shape [*, S[0], S[1], ..., S[n]] . * could be any number of dimensions. For example, in an NLP task this will be [seq_len, batch_size, features]
101defforward(self,x:torch.Tensor):
Sanity check to make sure the shapes match
109assertself.normalized\_shape==x.shape[-len(self.normalized\_shape):]
The dimensions to calculate the mean and variance on
112dims=[-(i+1)foriinrange(len(self.normalized\_shape))]
Calculate the mean of all elements; i.e. the means for each element E[X]
116mean=x.mean(dim=dims,keepdim=True)
Calculate the squared mean of all elements; i.e. the means for each element E[X2]
119mean\_x2=(x\*\*2).mean(dim=dims,keepdim=True)
Variance of all element Var[X]=E[X2]−E[X]2
121var=mean\_x2-mean\*\*2
Normalize X^=Var[X]+ϵX−E[X]
124x\_norm=(x-mean)/torch.sqrt(var+self.eps)
Scale and shift LN(x)=γX^+β
126ifself.elementwise\_affine:127x\_norm=self.gain\*x\_norm+self.bias
130returnx\_norm
Simple test
133def\_test():
137fromlabml.loggerimportinspect138139x=torch.zeros([2,3,2,4])140inspect(x.shape)141ln=LayerNorm(x.shape[2:])142143x=ln(x)144inspect(x.shape)145inspect(ln.gain.shape)
149if\_\_name\_\_=='\_\_main\_\_':150\_test()