docs/cpp/source/api/nn/transformer.md
Transformer layers use self-attention mechanisms to process sequences in parallel, enabling efficient training on long sequences. They are the foundation of modern NLP models (BERT, GPT) and increasingly used in vision and other domains.
Key parameters:
d_model: Dimension of the model (embedding dimension)nhead: Number of attention headsnum_encoder_layers/num_decoder_layers: Number of stacked layersdim_feedforward: Dimension of feedforward networkdropout: Dropout rate for regularizationComplete encoder-decoder transformer architecture.
:members:
:undoc-members:
:members:
:undoc-members:
Example:
auto transformer = torch::nn::Transformer(
torch::nn::TransformerOptions()
.d_model(512)
.nhead(8)
.num_encoder_layers(6)
.num_decoder_layers(6)
.dim_feedforward(2048)
.dropout(0.1));
Stack of encoder layers for processing source sequences.
:members:
:undoc-members:
:members:
:undoc-members:
Stack of decoder layers for generating target sequences.
:members:
:undoc-members:
:members:
:undoc-members:
Single encoder layer with self-attention and feedforward network.
:members:
:undoc-members:
Single decoder layer with self-attention, cross-attention, and feedforward network.
:members:
:undoc-members:
Scaled dot-product attention with multiple parallel heads.
:members:
:undoc-members:
:members:
:undoc-members: