Analysis and study on text representation to improve the accuracy of the normalized compression distance Articles
Overview
published in
- AI COMMUNICATIONS Journal
publication date
- January 2012
start page
- 381
end page
- 384
issue
- 4
volume
- 25
Digital Object Identifier (DOI)
International Standard Serial Number (ISSN)
- 0921-7126
Electronic International Standard Serial Number (EISSN)
- 1875-8452
abstract
- This thesis takes a small step towards better understanding both the nature of texts and the nature of compression distances. Broadly speaking, the way in which this is done is exploring the effects that several distortion techniques have on one of the most successful distances in the family of compression distances, the Normalized Compression Distance (NCD). The experimental results show that changing the representation of texts applying one of the explored distortion techniques can be beneficial both in NCD-based document clustering and in NCD-based document search. © 2012 - IOS Press and the authors. All rights reserved.
Classification
subjects
- Information Science
keywords
- algorithmic information theory data compression document clustering document retrieval information filtering normalized compression distance text representation word removal algorithmic information theory document clustering document retrieval information filtering normalized compression distance text representation information retrieval information retrieval systems information theory data compression