english/summarization.md
Summarization is the task of producing a shorter version of one or several documents that preserves most of the input's meaning.
For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.
The CNN / Daily Mail dataset as processed by Nallapati et al. (2016) has been used for evaluating summarization. The dataset contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average). The processed version contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs. Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional). The multilingual version of CNN / Daily Mail dataset exists and is available for five different languages (French, German, Spanish, Russian, Turkish).
The following models have been evaluated on the entitiy-anonymized version of the dataset introduced by Nallapati et al. (2016).
The following models have been evaluated on the non-anonymized version of the dataset introduced by See et al. (2017).
The first table covers Extractive Models, while the second covers abstractive approaches.
The Gigaword summarization dataset has been first used by Rush et al., 2015 and represents a sentence summarization / headline generation task with very short input documents (31.4 tokens) and summaries (8.3 tokens). It contains 3.8M training, 189k development and 1951 test instances. Models are evaluated with ROUGE-1, ROUGE-2 and ROUGE-L using full-length F1-scores.
Below Results are ranking by ROUGE-2 Scores.
(*) Rush et al., 2015 report ROUGE recall, the table here contains ROUGE F1-scores for Rush's model reported by Chopra et al., 2016
X-Sum (standing for Extreme Summarization), introduced by Narayan et al., 2018, is a summarization dataset which does not favor extractive strategies and calls for an abstractive modeling approach.
The idea of this dataset is to create a short, one sentence news summary.
Data is collected by harvesting online articles from the BBC.
The dataset contain 204 045 samples for the training set, 11 332 for the validation set, and 11 334 for the test set. In average the length of article is 431 words (~20 sentences) and the length of summary is 23 words. It can be downloaded here.
Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L.
Similar to Gigaword, task 1 of DUC 2004 is a sentence summarization task. The dataset contains 500 documents with on average 35.6 tokens and summaries with 10.4 tokens. Due to its size, neural models are typically trained on other datasets and only tested on DUC 2004. Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L recall @ 75 bytes.
This dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain. For more details, refer to TL;DR: Mining Reddit to Learn Automatic Summarization
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper/Source | Code |
|---|---|---|---|---|---|
| Transformer + Copy (Gehrmann et al., 2019) | 22 | 6 | 17 | Generating Summaries with Finetuned Language Models | |
| Unified VAE + PGN (Choi et al., 2019) | 19 | 4 | 15 | VAE-PGN based Abstractive Model in Multi-stage Architecture for Text Summarization |
This dataset contains approximately 10 Million (webpage content, abstractive snippet) pairs and 3.5 Million (query term, webpage content, abstractive snippet) triples for the novel task of (query-biased) abstractive snippet generation of web pages. The corpus is compiled from ClueWeb09, ClueWeb12 and the DMOZ Open Directory Project. For more details, refer to Abstractive Snippet Generation
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Usefulness | Paper/Source | Code |
|---|---|---|---|---|---|---|
| Anchor-context + Query biased (Chen et al., 2020) | 25.7 | 5.2 | 20.1 | 66.18 | Abstractive Snippet Generation |
Sentence compression produces a shorter sentence by removing redundant information, preserving the grammatically and the important content of the original sentence.
The Google Dataset was built by Filippova et al., 2013(Overcoming the Lack of Parallel Data in Sentence Compression). The first dataset released contained only 10,000 sentence-compression pairs, but last year was released an additional 200,000 pairs.
Example of a sentence-compression pair:
Sentence: Floyd Mayweather is open to fighting Amir Khan in the future, despite snubbing the Bolton-born boxer in favour of a May bout with Argentine Marcos Maidana, according to promoters Golden Boy
Compression: Floyd Mayweather is open to fighting Amir Khan in the future.
In short, this is a deletion-based task where the compression is a subsequence from the original sentence. From the 10,000 pairs of the eval portion(repository) it is used the very first 1,000 sentence for automatic evaluation and the 200,000 pairs for training.
Models are evaluated using the following metrics:
| Model | F1 | CR | Paper / Source | Code |
|---|---|---|---|---|
| SLAHAN with syntactic information (Kamigaito et al. 2020) | 0.855 | 0.407 | Syntactically Look-Ahead Attention Network for Sentence Compression | https://github.com/kamigaito/SLAHAN |
| BiRNN + LM Evaluator (Zhao et al. 2018) | 0.851 | 0.39 | A Language Model based Evaluator for Sentence Compression | https://github.com/code4conference/code4sc |
| LSTM (Filippova et al., 2015) | 0.82 | 0.38 | Sentence Compression by Deletion with LSTMs | |
| BiLSTM (Wang et al., 2017) | 0.8 | 0.43 | Can Syntax Help? Improving an LSTM-based Sentence Compression Model for New Domains |