Back to Nlp Progress

Paraphrase Generation

english/paraphrase-generation.md

0.33.3 KB
Original Source

Paraphrase Generation

Paraphrase generation is the task of generating an output sentence that preserves the meaning of the input sentence but contains variations in word choice and grammar. See the example given below:

InputOutput
The need for investors to earn a commercial return may put upward pressure on pricesThe need for profit is likely to push up prices

PRANMT-50M

PARANMT-50M dataset is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.

ModelBLEUPaper / SourceCode
Trigram (baseline)47.4Wieting and Gimpel, 2018Unavailable
Unsupervised BART w/ Dynamic Blocking20.9Niu et al., 2020Unavailable

QQP-Pos

The QQP-POS dataset is a paraphrase generation dataset with 400K source-target pairs. Each pair is labelled as negative if two questions are not duplicates and positive otherwise.

ModelBLEUPaper / SourceCode
Unsupervised BART w/ Dynamic Blocking26.76Niu et al., 2020Unavailable
ParafraGPT-UC35.9Bui et al., 2020Code

MULTIPIT, MULTIPITCROWD and MULTIPITEXPERT

Past efforts on creating paraphrase corpora only consider one paraphrase criteria without taking into account the fact that the desired “strictness” of semantic equivalence in paraphrases varies from task to task (Bhagat and Hovy, 2013; Liu and Soh, 2022). For example, for the purpose of tracking unfolding events, “A tsunami hit Haiti.” and “303 people died because of the tsunami in Haiti” are sufficiently close to be considered as paraphrases; whereas for paraphrase generation, the extra information “303 people dead” in the latter sentence may lead models to learn to hallucinate and generate more unfaithful content. In this paper, the authors present an effective data collection and annotation method to address these issues.

MULTIPIT is a topic Paraphrase in Twitter corpus that consists of a total of 130k sentence pairs with crowdsoursing (MULTIPITCROWD ) and expert (MULTIPITEXPERT ) annotations. MULTIPITCROWD is a large crowdsourced set of 125K sentence pairs that is useful for tracking information onTwitter.

ModelF1Paper / SourceCode
DeBERTaV3large92.00Improving Large-scale Paraphrase Acquisition and GenerationUnavailable

MULTIPITEXPERT is an expert annotated set of 5.5K sentence pairs using a stricter definition that is more suitable for acquiring paraphrases for generation purpose.

ModelF1Paper / SourceCode
DeBERTaV3large83.20Improving Large-scale Paraphrase Acquisition and GenerationUnavailable