Improving NCD accuracy by combining document segmentation and document distortion Articles uri icon

publication date

  • January 2014

start page

  • 223

end page

  • 245

issue

  • 1

volume

  • 41

International Standard Serial Number (ISSN)

  • 0219-1377

Electronic International Standard Serial Number (EISSN)

  • 0219-3116

abstract

  • Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature. © 2013, Springer-Verlag London.

keywords

  • algorithmic information theory data compression document representation information filtering word removal data compression information filtering information theory algorithmic information theory document representation document segmentation substring word removals information retrieval systems