examples/sentence_transformer/applications/parallel-sentence-mining/README.md
Bitext mining describes the process of finding parallel (translated) sentence pairs in monolingual corpora. For example, you have an set of English sentences:
This is an example sentences.
Hello World!
My final third sentence in this list.
And a set of German sentences:
Hallo Welt!
Dies ist ein Beispielsatz.
Dieser Satz taucht im Englischen nicht auf.
Here, you want to find all translation pairs between the English set and the German set of languages.
The correct (two) translations are:
Hello World! Hallo Welt!
This is an example sentences. Dies ist ein Beispielsatz.
Usually you apply this method to large corpora, for example, you want to find all translated sentences in the English Wikipedia and the Chinese Wikipedia.
We follow the setup from Artetxe and Schwenk, Section 4.3 to find translated sentences in two datasets: