Distributed nonlinear semiparametric support vector machine for big data applications on spark frameworks

In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector machines (SVMs) suffer from scalability problems due to their nonparametric nature and the complexity of their training procedures. In this paper, we propose a new and efficient distributed implementation of a training procedure for nonlinear semiparametric (budgeted) SVMs called distributed iterative reweighted least squares (IRWLS). This algorithm uses k-means to select the centroids of the semiparametric model and a new distributed algorithmic implementation of the IRWLS optimization procedure to find the weights of the model. We have implemented the proposed algorithm in Apache Spark and we have benchmarked it against other state-of-the-art methods, either full SVM (p-pack SVM) or budgeted (budgeted stochastic gradient descent). Experimental results show that the proposed algorithm achieves higher accuracy while controlling the size of the final model, and also offers high performance in terms of run time and efficiency, when processing very large datasets (the computation time grows linear with the number of training patterns).

Distributed nonlinear semiparametric support vector machine for big data applications on spark frameworks Articles