skills/train-sentence-transformers/references/losses_sentence_transformer.md
All losses live in sentence_transformers.sentence_transformer.losses.
Losses are grouped by data shape. The #1 rule: pick a loss that matches your data, not the other way around.
| You have | Use |
|---|---|
(anchor, positive) pairs | MultipleNegativesRankingLoss (or Cached variant for large batches) |
(anchor, positive, negative) triplets | MultipleNegativesRankingLoss — it handles triplets natively |
(text1, text2, score) with score ∈ [-1, 1] or [0, 1] | CoSENTLoss (strongly recommended) |
(text1, text2, label) with label ∈ {0, 1} | OnlineContrastiveLoss |
(text, class_id) single-column with integer class | BatchAllTripletLoss |
(query, positive, negative, score_diff) | MarginMSELoss (distillation) |
(text, teacher_embedding) | MSELoss (embedding distillation) |
| Want multiple output dims from one training | Wrap any of the above in MatryoshkaLoss |
| No labels at all, just sentences | DenoisingAutoEncoderLoss or ContrastiveTensionLossInBatchNegatives |
MultipleNegativesRankingLoss (MNRL)The default bi-encoder loss. Uses in-batch negatives: every other positive in the batch acts as a negative for the current anchor.
loss = MultipleNegativesRankingLoss(model, scale=20.0) # similarity_fct defaults to cos_sim
(anchor, positive) or (anchor, positive, negative). More columns = more explicit hard negatives per row.scale=20.0 multiplies similarities by 20 (equivalent to softmax temperature 0.05). Tune only if cosine similarities end up saturated.batch_sampler=BatchSamplers.NO_DUPLICATES on training args. Otherwise duplicate anchors create false negatives.CachedMultipleNegativesRankingLossSame loss, but with gradient caching (GradCache): forwards in mini-batches but computes the contrastive loss over the full batch. Use this when you want effective batch size of 256+ but your GPU can only fit 32 forwards.
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=32)
gradient_checkpointing=True.mini_batch_size to whatever per_device_train_batch_size would be if you couldn't use this. Then crank the actual per_device_train_batch_size to what you want the effective batch to be (256+, 1024+).MultipleNegativesSymmetricRankingLossMNRL computed bidirectionally — scores positives from both (anchor -> positive) and (positive -> anchor) directions. Slightly better on retrieval tasks where the "anchor" and "positive" distinctions are soft (paraphrase, deduplication).
CachedMultipleNegativesSymmetricRankingLossCached variant of the above.
GISTEmbedLossLike MNRL, but uses a guide model (a separate pretrained Sentence Transformer) to filter out false negatives before computing the contrastive loss. The guide model scores each potential negative; if it looks too similar to the positive, it's excluded.
guide = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
loss = GISTEmbedLoss(model, guide=guide)
CachedGISTEmbedLossCached + GIST.
MegaBatchMarginLossIn-batch margin-based triplet: for each anchor, find the hardest negative in the batch and apply a margin loss. Older pattern, usually outperformed by MNRL.
TripletLossClassic triplet margin loss on explicit (anchor, positive, negative). Uses a fixed margin and the hardest in-batch is not considered — only the provided triplet.
loss = TripletLoss(model, distance_metric=TripletDistanceMetric.EUCLIDEAN, triplet_margin=5)
These mine triplets within the batch from samples sharing a label.
BatchAllTripletLossFor each anchor, form triplets with all positive/negative combinations in the batch. Max signal per batch.
loss = BatchAllTripletLoss(model, margin=5)
label. Needs multiple samples per label in each batch (set batch_sampler=BatchSamplers.GROUP_BY_LABEL).BatchHardTripletLossSame, but only the single hardest positive + hardest negative per anchor.
BatchSemiHardTripletLossSemi-hard mining: negatives harder than the positive but easier than the margin. Often more stable than fully-hard.
BatchHardSoftMarginTripletLossVariant with a soft margin (log-sum-exp) instead of a fixed margin hinge.
When to use batch-triplet losses: classification-style datasets (labels are class IDs, not pair annotations). E.g. "train an embedder where samples from the same class are close."
CoSENTLossThe recommended regression loss for (text1, text2, score). Trains on pairwise ranking: for any two pairs (a, b) and (c, d) with score(a,b) > score(c,d), the model should score (a, b) higher. Much better than squared error.
loss = CoSENTLoss(model, scale=20.0)
(text1, text2, float_score). Labels can be [0, 1] or [-1, 1].AnglELossSimilar to CoSENT but uses angle-based optimization in complex space. Sometimes outperforms CoSENT on tasks with fine-grained similarity gradations. Strong alternative.
CosineSimilarityLossSquared-error loss on cosine similarity: mse(cos(text1, text2), label). Simpler than CoSENT, usually worse. Keep for legacy / reproducibility.
ContrastiveLossFor (text1, text2, label) where label ∈ {0, 1}. Minimizes distance for positives; pushes negatives past a margin.
loss = ContrastiveLoss(model, margin=0.5, distance_metric=SiameseDistanceMetric.COSINE_DISTANCE)
OnlineContrastiveLossSame setup but ignores "easy" pairs (positives already close, negatives already far) and only optimizes hard ones. Much more robust to label noise.
loss = OnlineContrastiveLoss(model, margin=0.5)
Preferred over ContrastiveLoss for most practical labeled pair datasets.
SoftmaxLossA classifier head on concatenated (u, v, |u-v|) embeddings, trained with cross-entropy. Useful when you have NLI-style multi-class labels (entailment / neutral / contradiction) and want a categorical loss. Historically important (it trained the first popular sentence embedding models) but generally outperformed by MNRL.
MSELossRegress the student's embedding to match a teacher's embedding.
(text, teacher_embedding). The teacher embedding is a fixed vector per row.MarginMSELossFor (query, positive, negative, score_diff): minimize mse(student_score_diff, teacher_score_diff). The teacher is typically a cross-encoder that produced the score differences.
DistillKLDivLossKL-divergence distillation: student's softmax distribution over candidates should match teacher's.
(query, passages[], teacher_scores[]).See ../scripts/train_sentence_transformer_distillation_example.py for the end-to-end pattern (its docstring covers Embedding MSE / Margin MSE / Listwise KL with full recipes).
These don't have their own data shape — they wrap another loss and add a regularization objective.
MatryoshkaLossTrain once, deploy at any of several dimensions. Wraps any loss and computes it at multiple truncated dimensions, adding them weighted.
base_loss = MultipleNegativesRankingLoss(model)
loss = MatryoshkaLoss(
model,
base_loss,
matryoshka_dims=[768, 512, 256, 128, 64],
matryoshka_weights=[1, 1, 1, 1, 1], # relative weighting per dim
)
SentenceTransformer(..., truncate_dim=128) gives 128-dim output with ~95% of full quality.Matryoshka2dLoss2D-Matryoshka: reduce dimension and number of transformer layers in a single wrapper. Internally composes MatryoshkaLoss + AdaptiveLayerLoss, so you only need this one (don't wrap it in AdaptiveLayerLoss yourself). Deploy at any (dim, layer) pair at inference.
AdaptiveLayerLossWrap any loss; adds a term that trains each of the transformer's layers to be a valid exit point. Deploy with fewer layers at inference for faster encoding.
loss = AdaptiveLayerLoss(
model,
base_loss,
n_layers_per_step=1,
last_layer_weight=1.0,
prior_layers_weight=1.0,
)
GlobalOrthogonalRegularizationLoss (GOR)Stand-alone regularizer (not a wrapper, despite living in this section). Penalizes embedding pairs whose dot product deviates from orthogonality, encouraging the model to spread embeddings across the full vector space. Use it alongside a primary contrastive loss by summing the two outputs in your own training step; can help with downstream retrieval diversity.
DenoisingAutoEncoderLoss (TSDAE)Sentence-level denoising autoencoder: corrupt a sentence (drop tokens), force the model to reconstruct it. Pretraining-style — useful for domain adaptation when you have unlabeled in-domain sentences.
ContrastiveTensionLossUnsupervised contrastive: two copies of the model encode the same sentence; they should agree. Pure self-supervised.
ContrastiveTensionLossInBatchNegativesCT with in-batch negatives. Stronger than vanilla CT.
MultipleNegativesRankingLoss without BatchSamplers.NO_DUPLICATES will include duplicate anchors in the same batch, destroying training signal. Always set the sampler.Cached* loss + gradient_checkpointing=True = crash. Pick one.TripletLoss with bad negatives (too easy) = loss hits zero fast and model stops learning. Mine hard negatives first.CachedMultipleNegativesRankingLoss: supported, but the cached-loss's mini-batch semantics apply to the base loss only. Think twice before combining.