Back to Unilm

E5 Text Embeddings

e5/README.md

latest6.8 KB
Original Source

E5 Text Embeddings

Multilingual E5 Text Embeddings: A Technical Report. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024

Improving Text Embeddings with Large Language Models. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022

LLM based Models

BEIR# of layersembedding dimensionHuggingface
E5-mistral-7b-instruct56.9324096intfloat/e5-mistral-7b-instruct

English Pre-trained Models

BEIR# of layersembedding dimensionHuggingface
E5-small-v249.012384intfloat/e5-small-v2
E5-base-v250.312768intfloat/e5-base-v2
E5-large-v250.6241024intfloat/e5-large-v2
E5-small46.012384intfloat/e5-small
E5-base48.812768intfloat/e5-base
E5-large50.0241024intfloat/e5-large
E5-small-unsupervised40.812384intfloat/e5-small-unsupervised
E5-base-unsupervised42.912768intfloat/e5-base-unsupervised
E5-large-unsupervised44.2241024intfloat/e5-large-unsupervised

The models with -unsupervised suffix only pre-trains on unlabeled datasets.

Multilingual Pre-trained Models

BEIR# of layersembedding dimensionHuggingface
multilingual-e5-small46.612384intfloat/multilingual-e5-small
multilingual-e5-base48.912768intfloat/multilingual-e5-base
multilingual-e5-large51.4241024intfloat/multilingual-e5-large
multilingual-e5-large-instruct52.5241024intfloat/multilingual-e5-large-instruct

Install Python Package Requirements

shell
pip install -r requirements.txt

For e5-mistral-7b-instruct, it would require transformers>=4.34 to load Mistral model.

Evaluate on the BEIR Benchmark

After installing the required python packages, run the following command on GPU machines:

shell
bash scripts/eval_mteb_beir.sh intfloat/e5-small-v2

By default, the evaluation script will use all the available GPUs.

Caution: it could take quite a long time (~10 hours) due to corpus encoding. For intfloat/e5-mistral-7b-instruct, it could take even longer (several days).

Evaluate on the MTEB Benchmark

Run the following command:

shell
bash scripts/eval_mteb_except_retrieval.sh intfloat/e5-small-v2

For multilingual models, simply add a --multilingual suffix:

shell
bash scripts/eval_mteb_except_retrieval.sh intfloat/multilingual-e5-base --multilingual

Other Resources

The data for our proposed synthetic task personalized passkey retrieval is available at https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval.

Troubleshooting

If you encounter OOM error, please try to reduce the batch size.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}

@article{wang2023improving,
  title={Improving Text Embeddings with Large Language Models},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2401.00368},
  year={2023}
}

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Microsoft Open Source Code of Conduct