Handling Ill-Conditioned Omics Data with Deep Probabilistic Models Articles uri icon

publication date

  • September 2023

issue

  • 9

volume

  • 27

International Standard Serial Number (ISSN)

  • 2168-2194

Electronic International Standard Serial Number (EISSN)

  • 2168-2208

abstract

  • The advent of high-throughput technologies has produced an increase in the dimensionality of omics datasets, which limits the application of machine learning methods due to the great unbalance between the number of observations and features. In this scenario, dimensionality reduction is essential to extract the relevant information within these datasets and project it in a low-dimensional space, and probabilistic latent space models are becoming popular given their capability to capture the underlying structure of the data as well as the uncertainty in the information. This article aims to provide a general classification and dimensionality reduction method based on deep latent space models that tackles two of the main problems that arise in omics datasets: the presence of missing data and the limited number of observations against the number of features. We propose a semi-supervised Bayesian latent space model that infers a low-dimensional embedding driven by the target label: the Deep Bayesian Logistic Regression (DBLR) model. During inference, the model also learns a global vector of weights that allows it to make predictions given the low-dimensional embedding of the observations. Since this kind of dataset is prone to overfitting, we introduce an additional probabilistic regularization method based on the semi-supervised nature of the model. We compared the performance of the DBLR against several state-of-the-art methods for dimensionality reduction, both in synthetic and real datasets with different data types. The proposed model provides more informative low-dimensional representations, outperforms the baseline methods in classification, and can naturally handle missing entries.

keywords

  • bayesian; classification; deep generative model; dimensionality reduction; latent space model; missing data; semi-supervised; vae