Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

authors

published in

EXPERT SYSTEMS WITH APPLICATIONS Journal

publication date

July 2020

start page

1

end page

44

issue

113264

volume

149

Digital Object Identifier (DOI)

https://doi.org/10.1016/j.eswa.2020.113264

full text

http://hdl.handle.net/10016/30751

International Standard Serial Number (ISSN)

0957-4174

Electronic International Standard Serial Number (EISSN)

1873-6793

abstract

Hellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems for methods based on Bagging of trees, while also showing good performance for balanced problems. Given that Random Forests (RF) use Bagging as one of two fundamental techniques to create diversity in the ensemble, it could be expected that HD is also effective for this ensemble method. The main aim of this article is to carry out an extensive investigation on important aspects about the use of HD in RF, including handling of multi-class problems, hyper-parameter optimization, metrics comparison, probability estimation, and metrics combination. In particular, HD is compared to other commonly used splitting metrics (Gini and Gain Ratio) in several contexts: balanced/imbalanced and two-class/multi-class. Two aspects related to classification problems are assessed: classification itself and probability estimation. HD is defined for two-class problems, but there are several ways in which it can be extended to deal with multi-class and this article studies the performance of the available options. Finally, even though HD can be used as an alternative to other splitting metrics, there is no reason to limit RF to use just one of them. Therefore, the final study of this article is to determine whether selecting the splitting metric using cross-validation on the training data can improve results further. Results show HD to be a robust measure for RF, with some weakness for balanced multi-class datasets (especially for probability estimation). Combination of metrics is able to result in a more robust performance. However, experiments of HD with text datasets show Gini to be more suitable than HD for this kind of problems.

Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets Articles

Overview

authors

published in

publication date

start page

end page

issue

volume

Digital Object Identifier (DOI)

full text

International Standard Serial Number (ISSN)

Electronic International Standard Serial Number (EISSN)

abstract

Classification

subjects

keywords