compression-based text clustering contextual information word removal contextual information distortion techniques intrinsic nature main characteristics subject matters text clustering textual data textual information information systems mathematical models cluster analysis