Multi-Head Attention

Multi-head attention is an attention mechanism that runs through the attention process multiple times independently. Each of these independent attention mechanisms is called a "head." The outputs of all the heads are then concatenated and linearly transformed to produce the final output. This allows the model to attend to different parts of the input sequence with different learned representations, capturing a richer set of relationships than a single attention mechanism could.

Visit the following resources to learn more:

@article@Understanding Multi-Head Attention in Transformers
@article@Multi-head Attention