A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets

  1. Joshua M Akey1
  1. University of Washington
  1. * Corresponding author; email: akeyj{at}u.washington.edu

Abstract

A continuing challenge in the analysis of massively large sequencing datasets is quantifying and interpreting non-neutrally evolving mutations. Here, we describe a flexible and robust approach based on the site frequency spectrum to estimate the fraction of deleterious and adaptive variants from large-scale sequencing datasets. We applied our method to ~1 million SNVs identified in high-coverage exome sequences of 6,515 individuals. We estimate that the fraction of deleterious nonsynonymous SNVs is higher than previously reported, quantify the effects of genomic context, codon bias, chromatin accessibility, and number of protein-protein interactions on deleterious protein-coding SNVs, and identify pathways and networks that have likely been influenced by positive selection. Furthermore, we show that the fraction of deleterious nonsynonymous SNVs is significantly higher for Mendelian versus complex disease loci and in exons harboring dominant versus recessive Mendelian mutations. In summary, as genome-scale sequencing data accumulates in progressively larger sample sizes, our method will enable increasingly high-resolution inferences into the characteristics and determinants of non-neutral variation.

  • Received December 7, 2015.
  • Accepted April 14, 2016.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

ACCEPTED MANUSCRIPT

This Article

  1. Genome Res. gr.203059.115 Published by Cold Spring Harbor Laboratory Press

Article Category

Share

Preprint Server