An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data
- ↵* Corresponding author; email: jtlu{at}bcm.edu
Abstract
Next generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remains challenging. We describe herein, methods for high quality discovery, genotyping and phasing of SNPs for low coverage (~5X) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low coverage population sequencing: (1) Effective Base Depth (EBD), a non-parametric statistic which enables more accurate statistical modeling of sequencing data, (2) Variance Ratio Scoring, a variance based statistic that discovers polymorphic loci with high sensitivity and specificity and, (3) BAM -specific Binomial Mixture Modeling (BBMM), a clustering algorithm which generates robust genotype likelihoods from heterogeneous sequencing data. Lastly, we develop an imputation engine that refines raw genotype likelihoods to produce high quality phased genotypes/haplotypes. Designed for large population studies, SNPTools' input/output (I/O) and storage aware design leads to improved computing performance on large sequencing datasets. We apply SNPTools to the International 1000 Genomes Project (1000G) Phase 1 low-coverage dataset and obtain genotyping accuracy comparable to that of SNP microarray.
- Received July 16, 2012.
- Accepted December 27, 2012.
- © 2013, Published by Cold Spring Harbor Laboratory Press
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/.











