A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks

authors

CORRALES MÚÑOZ, DAVID CAMILO
LEDEZMA ESPINO, AGAPITO ISMAEL
CORRALES, JUAN CARLOS

published in

APPLIED SOFT COMPUTING Journal

publication date

February 2020

start page

1

end page

13

volume

90

Digital Object Identifier (DOI)

https://doi.org/10.1016/j.asoc.2020.106180

full text

http://hdl.handle.net/10016/35917

International Standard Serial Number (ISSN)

1568-4946

Electronic International Standard Serial Number (EISSN)

1872-9681

abstract

Recently, advances in Information Technologies (social networks, mobile applications, Internet of Things, etc.) generate a deluge of digital data; but to convert these data into useful information for business decisions is a growing challenge. Exploiting the massive amount of data through knowledge discovery (KD) process includes identifying valid, novel, potentially useful and understandable patterns from a huge volume of data. However, to prepare the data is a non-trivial refinement task that requires technical expertise in methods and algorithms for data cleaning. Consequently, the use of a suitable data analysis technique is a headache for inexpert users. To address these problems, we propose a case-based reasoning system (CBR) to recommend data cleaning algorithms for classification and regression tasks. In our approach, we represent the problem space by the meta-features of the dataset, its attributes, and the target variable. The solution space contains the algorithms of data cleaning used for each dataset. We represent the cases through a Data Cleaning Ontology. The case retrieval mechanism is composed of a filter and similarity phases. In the first phase, we defined two filter approaches based on clustering and quartile analysis. These filters retrieve a reduced number of relevant cases. The second phase computes a ranking of the retrieved cases by filter approaches, and it scores a similarity between a new case and the retrieved cases. The retrieval mechanism proposed was evaluated through a set of judges. The panel of judges scores the similarity between a query case against all cases of the case-base (ground truth). The results of the retrieval mechanism reach an average precision on judges ranking of 94.5% in top 3, for top 7 84.55%, while in top 10 78.35%.

A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks Articles

Overview

authors

published in

publication date

start page

end page

volume

Digital Object Identifier (DOI)

full text

International Standard Serial Number (ISSN)

Electronic International Standard Serial Number (EISSN)

abstract

Classification

subjects

keywords