Electronic International Standard Serial Number (EISSN)
1873-765X
abstract
Feature selection is a recurrent research topic in modern regression analysis, which strives to build interpretable models, using sparsity as a proxy, without sacrificing predictive power. The best subset selection problem is central to this statistical task: it has the goal of identifying the subset of covariates of a given size that provides the best fit in terms of an empirical loss function. In this work, we address the problem of feature and functional form selection in additive regression models under a mathematical optimization lens. Penalized splines ( splines) are used to estimate the smooth functions involved in the regression equation, which allow us to state the feature selection problem as a cardinality-constrained mixed-integer quadratic program (MIQP) in terms of both linear and non-linear covariates. To strengthen this MIQP formulation, we develop tight bounds for the regression coefficients. A matheuristic approach, which encompasses the use of a preprocessing step, the construction of a warm-start solution, the MIQP formulation and the large neighborhood search metaheuristic paradigm, is proposed to handle larger instances of the feature and functional form selection problem. The performance of the exact and the matheuristic approaches are compared in simulated data. Furthermore, our matheuristic is compared to other methodologies in the literature that have publicly available implementations, using both simulated and real-world data. We show that the stated approach is competitive in terms of predictive power and in the selection of the correct subset of covariates with the appropriate functional form. A public Python library is available with all the implementations of the methodologies developed in this paper.
Classification
subjects
Computer Science
Statistics
keywords
penalized splines; feature selection; functional form selection; matheuristic; mixed-integer optimization