The impact of term-weighting schemes and similarity measures on extractive multi-document text summarization

Automatic text summarization is currently a topic of great interest in many knowledge fields. Extractive multi- document text summarization methods aim to reduce the textual information from a document collection by covering the main content and reducing the redundant information. In the scientific literature, there are different approaches related to term-weighting schemes and similarity measures, which are necessary for implementing an automatic summary system. However, to the best of the authors" knowledge, there are no studies to analyze the performance of the different schemes and measures. In this paper, all possible combinations of the most common term-weighting schemes and similarity measures used in the extractive multi-document text summarization field have been implemented, compared, and analyzed. Experiments have been performed with Document Understanding Conferences (DUC) datasets, and the model performance has been assessed with eight Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics and the execution time. Results show that the best term- weighting scheme is the term-frequency inverse-sentence-frequency scheme, and the best similarity measure is the cosine similarity. Even more, the combination formed by both of them has obtained the best average results in 87.5% of ROUGE scores compared to the other combinations.

The impact of term-weighting schemes and similarity measures on extractive multi-document text summarization Articles