Automated quality control and cell identification of droplet-based single-cell data using dropkick
- Cody N. Heiser1,2,
- Victoria M. Wang1,3,
- Bob Chen1,2,
- Jacob J. Hughey2,4,5 and
- Ken S. Lau1,2,6,7
- 1Epithelial Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA;
- 2Program in Chemical and Physical Biology, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA;
- 3Department of Computer Science, Vanderbilt University, Nashville, Tennessee 37232, USA;
- 4Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA;
- 5Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37232, USA;
- 6Department of Cell and Developmental Biology, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA;
- 7Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, USA
Abstract
A major challenge for droplet-based single-cell sequencing technologies is distinguishing true cells from uninformative barcodes in data sets with disparate library sizes confounded by high technical noise (i.e., batch-specific ambient RNA). We present dropkick, a fully automated software tool for quality control and filtering of single-cell RNA sequencing (scRNA-seq) data with a focus on excluding ambient barcodes and recovering real cells bordering the quality threshold. By automatically determining data set–specific training labels based on predictive global heuristics, dropkick learns a gene-based representation of real cells and ambient noise, calculating a cell probability score for each barcode. Using simulated and real-world scRNA-seq data, we benchmarked dropkick against conventional thresholding approaches and EmptyDrops, a popular computational method, showing greater recovery of rare cell types and exclusion of empty droplets and noisy, uninformative barcodes. We show for both low- and high-background data sets that dropkick's weakly supervised model reliably learns which genes are enriched in ambient barcodes and draws a multidimensional boundary that is more robust to data set–specific variation than existing filtering approaches. dropkick provides a fast, automated tool for reproducible cell identification from scRNA-seq data that is critical to downstream analysis and compatible with popular single-cell Python packages.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.271908.120.
-
Freely available online through the Genome Research Open Access option.
- Received October 1, 2020.
- Accepted March 3, 2021.
This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.











