netNMF-sc: leveraging gene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis

  1. Benjamin J. Raphael2
  1. 1Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA;
  2. 2Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA;
  3. 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08540, USA;
  4. 4Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
  • Present addresses: 5SAMSI and Department of Statistical Science, Duke University, USA; 6Genomics plc.

  • Corresponding author: braphael{at}princeton.edu
  • Abstract

    Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene–gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene–gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.

    Footnotes

    • Received April 18, 2019.
    • Accepted November 19, 2019.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    Related Article

    Preprint Server