Dataset Overview

{eval-rst}

.. hint::

   **Quickstart:** Find `curated datasets <https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552>`_ or `community datasets <https://huggingface.co/datasets?other=sentence-transformers>`_, choose a loss function via this `loss overview <loss_overview.html>`_, and `verify <training_overview.html#dataset-format>`_ that it works with your dataset.

It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format). See Training Overview > Dataset Format to learn how to verify whether a dataset format works with a loss function.

In practice, most dataset configurations will take one of four forms:

Positive Pair: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or asymmetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
- Examples: sentence-transformers/sentence-compression, sentence-transformers/coco-captions, sentence-transformers/codesearchnet, sentence-transformers/natural-questions, sentence-transformers/gooaq, sentence-transformers/squad, sentence-transformers/wikihow, sentence-transformers/eli5
Triplets: (anchor, positive, negative) text triplets. These datasets don't need labels.
- Examples: sentence-transformers/quora-duplicates, nirantk/triplets, sentence-transformers/all-nli
Pair with Similarity Score: A pair of sentences with a score indicating their similarity. Common examples are "Semantic Textual Similarity" datasets.
- Examples: sentence-transformers/stsb, PhilipMay/stsb_multi_mt.
Texts with Classes: A text with its corresponding class. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class.
- Examples: trec, yahoo_answers_topics.

Note that it is often simple to transform a dataset from one format to another, such that it works with your loss function of choice.

{eval-rst}


.. tip::

   You can use :func:`~sentence_transformers.util.hard_negatives.mine_hard_negatives` to convert a dataset of positive pairs into a dataset of triplets. It uses a :class:`~sentence_transformers.sentence_transformer.model.SentenceTransformer` model to find hard negatives: texts that are similar to the first dataset column, but are not quite as similar as the text in the second dataset column. Datasets with hard triplets often outperform datasets with just positive pairs.
   
   For example, we mined hard negatives from `sentence-transformers/gooaq <https://huggingface.co/datasets/sentence-transformers/gooaq>`_ to produce `tomaarsen/gooaq-hard-negatives <https://huggingface.co/datasets/tomaarsen/gooaq-hard-negatives>`_ and trained `tomaarsen/mpnet-base-gooaq <https://huggingface.co/tomaarsen/mpnet-base-gooaq>`_ and `tomaarsen/mpnet-base-gooaq-hard-negatives <https://huggingface.co/tomaarsen/mpnet-base-gooaq-hard-negatives>`_ on the two datasets, respectively. Sadly, the two models use a different evaluation split, so their performance can't be compared directly.

Multimodal Datasets

{eval-rst}


.. tip::

   Multimodal models require additional dependencies. Install them with e.g. ``pip install -U "sentence-transformers[image]"`` for image support. See `Installation <../installation.html>`_ for all options.

Dataset columns are not limited to text. When using a multimodal model (e.g. a vision-language model like Qwen/Qwen3-VL-Embedding-2B), columns can contain images, audio, video, or combinations of these modalities. The same dataset format categories described above (Positive Pair, Triplets, etc.) apply. The only difference is that one or more columns hold non-text data instead of strings.

Accepted column types

{eval-rst}

The following input types are supported:

- **Text**: strings.
- **Image**: PIL images, file paths, URLs, or numpy/torch arrays.
- **Audio**: file paths, numpy/torch arrays, dicts with ``"array"`` and ``"sampling_rate"`` keys, or (if ``torchcodec`` installed) :class:`torchcodec.AudioDecoder <torchcodec.decoders.AudioDecoder>` instances.
- **Video**: file paths, numpy/torch arrays, dicts with ``"array"`` and ``"video_metadata"`` keys, or (if ``torchcodec`` installed) :class:`torchcodec.VideoDecoder <torchcodec.decoders.VideoDecoder>` instances.
- **Multimodal dicts**: a dict mapping modality names to values, e.g. ``{"text": ..., "audio": ...}``. The keys must be ``"text"``, ``"image"``, ``"audio"``, or ``"video"``.
- **Chat messages**: a list of dicts with ``"role"`` and ``"content"`` keys for multimodal models that use an uncommon chat template to combine text and non-text inputs.

A common use case is matching text queries to document screenshots (images). This is simply a Positive Pair dataset where the first column contains text and the second column contains images:

python

from datasets import load_dataset

dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
"""
Dataset({
    features: ['query', 'image', 'negative_0', 'negative_1', 'negative_2', 'negative_3'],
    num_rows: 10000
})
"""
print(dataset[0]["query"])
# "What are the new anthropological perspectives on development as discussed by Quarles Van Ufford and Giri in 2003?"

print(dataset[0]["image"])
# <PIL.Image.Image image mode=RGB size=...>

Here, query is a text column and image is an image column. The first column is the anchor and the second is the positive, just like a standard Positive Pair dataset. The negative columns (negative_0 through negative_3) contain additional hard-negative images.

Automatic preprocessing

You do not need to manually tokenize text or transform images before training. The data collator calls the model's preprocess method on each column, which automatically detects the modality (text, image, audio, or video) and applies the appropriate preprocessing (tokenization, pixel processing, audio feature extraction, etc.). This means you can pass raw PIL.Image objects, file paths, or URLs directly in your dataset and they will be handled correctly.

Datasets on the Hugging Face Hub

{eval-rst}

The `Datasets library <https://huggingface.co/docs/datasets/index>`_ (``pip install datasets``) allows you to load datasets from the Hugging Face Hub with the :func:`~datasets.load_dataset` function::

   from datasets import load_dataset

   # Indicate the dataset id from the Hub
   dataset_id = "sentence-transformers/natural-questions"
   dataset = load_dataset(dataset_id, split="train")
   """
   Dataset({
      features: ['query', 'answer'],
      num_rows: 100231
   })
   """
   print(dataset[0])
   """
   {
      'query': 'when did richmond last play in a preliminary final',
      'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next."
   }
   """

For more information on how to manipulate your dataset see the Datasets Documentation.

{eval-rst}

.. tip::
   
   It's common for Hugging Face Datasets to contain extraneous columns, e.g. sample_id, metadata, source, type, etc. You can use :meth:`Dataset.remove_columns <datasets.Dataset.remove_columns>` to remove these columns, as they will be used as inputs otherwise. You can also use :meth:`Dataset.select_columns <datasets.Dataset.select_columns>` to keep only the desired columns.

Pre-existing Datasets

The Hugging Face Hub hosts 150k+ datasets, many of which can be converted for training embedding models. We are aiming to tag all Hugging Face datasets that work out of the box with Sentence Transformers with sentence-transformers, allowing you to easily find them by browsing to https://huggingface.co/datasets?other=sentence-transformers. We strongly recommend that you browse these datasets to find training datasets that might be useful for your tasks.

These are some of the popular pre-existing datasets tagged as sentence-transformers that can be used to train and fine-tune SentenceTransformer models:

Dataset	Description
GooAQ	(Question, Answer) pairs from Google auto suggest
Yahoo Answers	(Title+Question, Answer), (Title, Answer), (Title, Question), (Question, Answer) pairs from Yahoo Answers
MS MARCO Triplets (msmarco-distilbert-base-tas-b)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (msmarco-distilbert-base-v3)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (msmarco-MiniLM-L6-v3)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-cls-dot-v2)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-cls-dot-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-mean-dot-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (mpnet-margin-mse-mean-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (co-condenser-margin-mse-cls-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-mnrl-mean-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (distilbert-margin-mse-sym-mnrl-mean-v2)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (co-condenser-margin-mse-sym-mnrl-mean-v1)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
MS MARCO Triplets (BM25)	(Question, Answer, Negative) triplets from MS MARCO Passages dataset with mined negatives
Stack Exchange Duplicates	(Title, Title), (Title+Body, Title+Body), (Body, Body) pairs of duplicate questions from StackExchange
ELI5	(Question, Answer) pairs from ELI5 dataset
SQuAD	(Question, Answer) pairs from SQuAD dataset
WikiHow	(Summary, Text) pairs from WikiHow
Amazon Reviews 2018	(Title, review) pairs from Amazon Reviews
Natural Questions	(Query, Answer) pairs from the Natural Questions dataset
Amazon QA	(Question, Answer) pairs from Amazon
S2ORC	(Title, Abstract), (Abstract, Citation), (Title, Citation) pairs of scientific papers
Quora Duplicates	Duplicate question pairs from Quora
WikiAnswers	Duplicate question pairs from WikiAnswers
AGNews	(Title, Description) pairs of news articles from the AG News dataset
AllNLI	(Anchor, Entailment, Contradiction) triplets from SNLI + MultiNLI
NPR	(Title, Body) pairs from the npr.org website
SPECTER	(Title, Positive Title, Negative Title) triplets of Scientific Publications from Specter
Simple Wiki	(English, Simple English) pairs from Wikipedia
PAQ	(Query, Answer) from the Probably-Asked Questions dataset
altlex	(English, Simple English) pairs from Wikipedia
CC News	(Title, article) pairs from the CC News dataset
CodeSearchNet	(Comment, Code) pairs from open source libraries on GitHub
Sentence Compression	(Long text, Short text) pairs from the Sentence Compression dataset
Trivia QA	(Query, Answer) pairs from the TriviaQA dataset
Flickr30k Captions	Duplicate captions from the Flickr30k dataset
xsum	(News Article, Summary) pairs from XSUM dataset
Coco Captions	Duplicate captions from the Coco Captions dataset
Parallel Sentences: Europarl	(English, Non-English) pairs across numerous languages
Parallel Sentences: Global Voices	(English, Non-English) pairs across numerous languages
Parallel Sentences: MUSE	(English, Non-English) pairs across numerous languages
Parallel Sentences: JW300	(English, Non-English) pairs across numerous languages
Parallel Sentences: News Commentary	(English, Non-English) pairs across numerous languages
Parallel Sentences: OpenSubtitles	(English, Non-English) pairs across numerous languages
Parallel Sentences: Talks	(English, Non-English) pairs across numerous languages
Parallel Sentences: Tatoeba	(English, Non-English) pairs across numerous languages
Parallel Sentences: WikiMatrix	(English, Non-English) pairs across numerous languages
Parallel Sentences: WikiTitles	(English, Non-English) pairs across numerous languages

{eval-rst}


.. note::

   We advise users to tag datasets that can be used for training embedding models with ``sentence-transformers`` by adding ``tags: sentence-transformers``. We would also gladly accept high quality datasets to be added to the list above for all to see and use.

Dataset Overview

Dataset Overview

Multimodal Datasets

Accepted column types

Cross-modal dataset example

Automatic preprocessing

Datasets on the Hugging Face Hub

Pre-existing Datasets