persian/summarization.md
Summarization is the task of producing a shorter version of one or several documents that preserves most of the input's meaning.
For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
pn_summary provide only a single reference.Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.
There are a few resources for the abstractive/extractive tasks in Persian, while some are not available online, or there are no curators for them. While surfing the academic papers, you might see some of them like Pasokh. Of course, thanks to some researchers' efforts in this field, a dataset called Persian News Summarization (known as pn_summary) has been prepared for both Persian summarization tasks and made available online.
The Persian News Summary (known as pn_summary) is a well-structured summarization dataset for the Persian language that consists of 93,207 online news articles (from 200,000 crawled news) from 6 different news agencies in 18 different news categories from economy to tourism. Each document (article) includes the long original text as well as a human-generated summary. Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional).
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
|---|---|---|---|---|---|---|
| BERT2BERT (ParsBERT) + mT5 (Farahani et al., 2020) | 44.01 | 25.07 | 37.76 | - | Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization | Official |
Pasokh is a summarization dataset covering 6 news categories from 7 news agencies in two forms: Single-Document (SD) and Multi-Document (MD) with 100, 1000 records. Each document covers 5 samples for extractive and abstractive example.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
|---|---|---|---|---|---|---|
| Based on NER (SD) (Khademi, Fakhredanesh, 2020) | 47.20 | 33.40 | - | - | Persian Automatic Text Summarization Based on Named Entity Recognition | - |
| Based on NER (SD) (Khademi et al., 2020) | 45.40 | 30.10 | - | - | Conceptual Text Summarizer: A new model in continuous vector space | - |
| Feature Extraction (SD) (Rezaei et al., 2019) | 78.00 | 71.00 | 74.00 | - | Features in Extractive Supervised Single-document Summarization: Case of Persian News | Official |
| Multi-Feature Extraction (SD) (Kermani, Ghanbari, 2019) | 48.70 | 42.60 | - | - | Extractive Persian Summarizer for News Websites | - |