Feature and functional form selection in additive models via mixed-integer optimization Articles uri icon

publication date

  • April 2025

start page

  • 106945

volume

  • 176

International Standard Serial Number (ISSN)

  • 0305-0548

Electronic International Standard Serial Number (EISSN)

  • 1873-765X

abstract

  • Feature selection is a recurrent research topic in modern regression analysis, which strives to build interpretable models, using sparsity as a proxy, without sacrificing predictive power. The best subset selection problem is central to this statistical task: it has the goal of identifying the subset of covariates of a given size that provides the best fit in terms of an empirical loss function. In this work, we address the problem of feature and functional form selection in additive regression models under a mathematical optimization lens. Penalized splines (
    splines) are used to estimate the smooth functions involved in the regression equation, which allow us to state the feature selection problem as a cardinality-constrained mixed-integer quadratic program (MIQP) in terms of both linear and non-linear covariates. To strengthen this MIQP formulation, we develop tight bounds for the regression coefficients. A matheuristic approach, which encompasses the use of a preprocessing step, the construction of a warm-start solution, the MIQP formulation and the large neighborhood search metaheuristic paradigm, is proposed to handle larger instances of the feature and functional form selection problem. The performance of the exact and the matheuristic approaches are compared in simulated data. Furthermore, our matheuristic is compared to other methodologies in the literature that have publicly available implementations, using both simulated and real-world data. We show that the stated approach is competitive in terms of predictive power and in the selection of the correct subset of covariates with the appropriate functional form. A public Python library is available with all the implementations of the methodologies developed in this paper.

subjects

  • Computer Science
  • Statistics

keywords

  • penalized splines; feature selection; functional form selection; matheuristic; mixed-integer optimization