Natural language inference

Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".

Example:

Premise	Label	Hypothesis
A man inspects the uniform of a figure in some East Asian country.	contradiction	The man is sleeping.
An older and younger man smiling.	neutral	Two men are smiling and laughing at the cats playing on the floor.
A soccer game with multiple males playing.	entailment	Some men are playing a sport.

SNLI

The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

State-of-the-art results can be seen on the SNLI website.

MultiNLI

The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.

Model	Matched	Mismatched	Paper / Source	Code
RoBERTa (Liu et al., 2019)	90.8	90.2	RoBERTa: A Robustly Optimized BERT Pretraining Approach	Official
XLNet-Large (ensemble) (Yang et al., 2019)	90.2	89.8	XLNet: Generalized Autoregressive Pretraining for Language Understanding	Official
MT-DNN-ensemble (Liu et al., 2019)	87.9	87.4	Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding	Official
Snorkel MeTaL(ensemble) (Ratner et al., 2018)	87.6	87.2	Training Complex Models with Multi-Task Weak Supervision	Official
Finetuned Transformer LM (Radford et al., 2018)	82.1	81.4	Improving Language Understanding by Generative Pre-Training
Multi-task BiLSTM + Attn (Wang et al., 2018)	72.2	72.1	GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
GenSen (Subramanian et al., 2018)	71.4	71.3	Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

SciTail

The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist "in the wild". Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

Model	Accuracy	Paper / Source
Finetuned Transformer LM (Radford et al., 2018)	88.3	Improving Language Understanding by Generative Pre-Training
Hierarchical BiLSTM Max Pooling (Talman et al., 2018)	86.0	Natural Language Inference with Hierarchical BiLSTM Max Pooling
CAFE (Tay et al., 2018)	83.3	A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference

Go back to the README