Back to Sentence Transformers

Training

examples/sentence_transformer/training/README.md

5.4.12.8 KB
Original Source

Training

This folder contains various examples to fine-tune SentenceTransformers for specific tasks.

For the beginning, I can recommend to have a look at the Semantic Textual Similarity (STS) or the Natural Language Inference (NLI) examples.

For the documentation how to train your own models, see Training Overview.

Training Examples

  • adaptive_layer - Examples to train models whose layers can be removed on the fly for faster inference.
  • avg_word_embeddings - This folder contains examples to train models based on classical word embeddings like GloVe. These models are extremely fast, but are a more inaccurate than transformers based models.
  • clip - Examples to train CLIP image models.
  • data_augmentation Examples of how to apply data augmentation strategies to improve embedding models.
  • distillation - Examples to make models smaller, faster and lighter.
  • hpo - Examples with hyperparameter search to find the best hyperparameters for your task.
  • matryoshka - Examples with training embedding models whose embeddings can be truncated (allowing for faster search) with minimal performance loss.
  • ms_marco - Example training scripts for training on the MS MARCO information retrieval dataset.
  • multilingual - Existent monolingual models can be extend to various languages (paper). This folder contains a step-by-step guide to extend existent models to new languages.
  • nli - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sentence embeddings.
  • other - Various tiny examples for show-casing one specific training case.
  • paraphrases - Examples for training models capable of recognizing paraphrases, i.e. understand when texts have the same meaning despite using different words.
  • peft - Examples for training with PEFT adapters (e.g. LoRA) for parameter-efficient fine-tuning.
  • prompts - Examples and documentation for training and using embedding models with prompts / instructions.
  • quora_duplicate_questions - Quora Duplicate Questions is large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
  • sts - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we have a sentence pair and a score indicating the semantic similarity.
  • unsloth - Examples for fast LoRA / QLoRA fine-tuning with the Unsloth training framework on top of Sentence Transformers.