examples/sentence_transformer/training/data_augmentation/README.md
Bi-encoders (a.k.a. sentence embeddings models) require substantial training data and fine-tuning over the target task to achieve competitive performances. However, in many scenarios, there is only little training data available.
To solve this practical issue, we release an effective data-augmentation strategy known as <b>Augmented SBERT</b> where we utilize a high performing and slow cross-encoder (BERT) to label a larger set of input pairs to augment the training data for the bi-encoder (SBERT).
For more details, refer to our publication - Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks which is a joint effort by Nandan Thakur, Nils Reimers and Johannes Daxenberger of UKP Lab, TU Darmstadt.
Chien Vu also wrote a nice blog article on this technique: Advance BERT model via transferring knowledge from Cross-Encoders to Bi-Encoders
Scenario 1: Limited or small annotated datasets (few labeled sentence-pairs (1k-3k))
If you have specialized datasets in your company or research which are small-sized or contain labeled few sentence-pairs. You can extend the idea of Augmented SBERT (in-domain) strategy by training a cross-encoder over your small gold dataset and use BM25 sampling to generate combinations not seen earlier. Use the cross-encoder to label these unlabeled pairs to create the silver dataset. Finally train a bi-encoder (i.e. SBERT) over your extended dataset (gold+silver) dataset as shown in train_sts_indomain_bm25.py.
Scenario 2: No annotated datasets (Only unlabeled sentence-pairs)
If you have specialized datasets in your company or research which only contain unlabeled sentence-pairs. You can extend the idea of Augmented SBERT (domain-transfer) strategy by training a cross-encoder over a source dataset which is annotated (for eg. QQP). Use this cross-encoder to label your specialised unlabeled dataset i.e. target dataset. Finally train a bi-encoder i.e. SBERT over your labeled target dataset as shown in train_sts_qqp_crossdomain.py.
There are two major scenarios for the Augmented SBERT approach for pairwise-sentence regression or classification tasks.
We apply the Augmented SBERT (<b>In-domain</b>) strategy, it involves three steps -
Step 1: Train a cross-encoder (BERT) over the small (gold or annotated) dataset
Step 2.1: Create pairs by recombination and reduce the pairs via BM25 or semantic search
Step 2.2: Weakly label new pairs with cross-encoder (BERT). These are silver pairs or (silver) dataset
Step 3: Finally, train a bi-encoder (SBERT) on the extended (gold + silver) training dataset
We apply the Augmented SBERT (<b>Domain-Transfer</b>) strategy, it involves three steps -
Step 1: Train from scratch a cross-encoder (BERT) over a source dataset, for which we contain annotations
Step 2: Use this cross-encoder (BERT) to label your target dataset i.e. unlabeled sentence pairs
Step 3: Finally, train a bi-encoder (SBERT) on the labeled target dataset
The examples/sentence_transformer/training/data_augmentation folder contains simple training examples for each scenario explained below:
train_sts_seed_optimization.py
train_sts_indomain_semantic.py
If you use the code for augmented sbert, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:
@article{thakur-2020-AugSBERT,
title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2010.08240",
month = "10",
year = "2020",
url = "https://arxiv.org/abs/2010.08240",
}