english/coreference_resolution.md
Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.
Example:
+-----------+
| |
I voted for Obama because he was most aligned with my values", she said.
| | |
+-------------------------------------------------+------------+
"I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster.
Experiments are conducted on the data of the CoNLL-2012 shared task, which uses OntoNotes coreference annotations. Papers report the precision, recall, and F1 of the MUC, B3, and CEAFφ4 metrics using the official CoNLL-2012 evaluation scripts. The main evaluation metric is the average F1 of the three metrics.
| Model | Avg F1 | Paper / Source | Code |
|---|---|---|---|
| wl-coref + RoBERTa | 81.0 | Word-Level Coreference Resolution | Official |
| s2e+Longformer-Large | 80.3 | Coreference Resolution without Span Representations | Official |
| Xu et al. (2020) | 80.2 | Revealing the Myth of Higher-Order Inference in Coreference Resolution | Official |
| Joshi et al. (2019)<sup>1</sup> | 79.6 | SpanBERT: Improving Pre-training by Representing and Predicting Spans | Official |
| Joshi et al. (2019)<sup>2</sup> | 76.9 | BERT for Coreference Resolution: Baselines and Analysis | Official |
| Kantor and Globerson (2019) | 76.6 | Coreference Resolution with Entity Equalization | Official |
| Fei et al. (2019) | 73.8 | End-to-end Deep Reinforcement Learning Based Coreference Resolution | |
| (Lee et al., 2017)+ELMo (Peters et al., 2018)+coarse-to-fine & second-order inference (Lee et al., 2018) | 73.0 | Higher-order Coreference Resolution with Coarse-to-fine Inference | Official |
| (Lee et al., 2017)+ELMo (Peters et al., 2018) | 70.4 | Deep contextualized word representations | |
| Lee et al. (2017) | 67.2 | End-to-end Neural Coreference Resolution |
<a name="myfootnote1">[1]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+SpanBERT (Joshi et al., 2019)
<a name="myfootnote2">[2]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+BERT (Devlin et al., 2019)
Experiments are conducted on GAP dataset. Metrics used are F1 score on Masculine (M) and Feminine (F) examples, Overall, and a Bias factor calculated as F / M.
| Model | Overall F1 | Masculine F1 (M) | Feminine F1 (F) | Bias (F/M) | Paper / Source | Code |
|---|---|---|---|---|---|---|
| Attree et al. (2019) | 92.5 | 94.0 | 91.1 | 0.97 | Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling | GREP |
| Chada et al. (2019) | 90.2 | 90.9 | 89.5 | 0.98 | Gendered Pronoun Resolution using BERT and an extractive question answering formulation | CorefQA |