Discordant calls across genotype discovery approaches elucidate variants with systematic errors

  1. Mark J. Daly1,2,11,12
  1. 1Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts 02114, USA;
  2. 2Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
  3. 3Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA;
  4. 4Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
  5. 5The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, Ohio 43215, USA;
  6. 6Department of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio 43210, USA;
  7. 7ITMO University, Saint-Petersburg, 197101, Russia;
  8. 8Almazov National Medical Research Center, St. Petersburg, 197341, Russia;
  9. 9Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts 02114, USA;
  10. 10Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, Darlinghurst, New South Wales 2010, Australia;
  11. 11Institute for Molecular Medicine Finland, University of Helsinki, FI-00290 Helsinki, Finland
  1. 12 These authors contributed equally to this work.

  • Corresponding authors: elizabeth.atkinson{at}bcm.edu, mykyta.artomov{at}nationwidechildrens.org
  • Abstract

    Large-scale high-throughput sequencing data sets have been transformative for informing clinical variant interpretation and for use as reference panels for statistical and population genetic efforts. Although such resources are often treated as ground truth, we find that in widely used reference data sets such as the Genome Aggregation Database (gnomAD), some variants pass gold-standard filters, yet are systematically different in their genotype calls across genotype discovery approaches. The inclusion of such discordant sites in study designs involving multiple genotype discovery strategies could bias results and lead to false-positive hits in association studies owing to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across genotype discovery approaches, characterize the error mode of wrong calls, provide a list of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other data sets. We find that different genotype discovery approaches have different sets of variants at which this problem occurs, but there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, although different populations are powered for the discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one approach and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls.

    Footnotes

    • Received March 20, 2023.
    • Accepted May 19, 2023.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents

    Preprint Server