Rapid SARS-CoV-2 surveillance using clinical, pooled, or wastewater sequence as a sensor for population change

  1. Barun Mathema6
  1. 1Institute for Comparative Genomics, American Museum of Natural History, New York, New York 10024, USA;
  2. 2Section for Hologenomics, The Globe Institute, University of Copenhagen, DK-1353 Copenhagen, Denmark;
  3. 3Department of Ecology, Evolution, and Environmental Biology, Columbia University, New York, New York 10027, USA;
  4. 4Division of Pediatric Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA;
  5. 5Department of Pediatrics, Perelman College of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA;
  6. 6Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, New York 10032, USA
  • Corresponding authors: anarechania{at}amnh.org, bm2055{at}cumc.columbia.edu, planetp{at}chop.edu
  • Abstract

    The COVID-19 pandemic has highlighted the critical role of genomic surveillance for guiding policy and control. Timeliness is key, but sequence alignment and phylogeny slow most surveillance techniques. Millions of SARS-CoV-2 genomes have been assembled. Phylogenetic methods are ill equipped to handle this sheer scale. We introduce a pangenomic measure that examines the information diversity of a k-mer library drawn from a country's complete set of clinical, pooled, or wastewater sequence. Quantifying diversity is central to ecology. Hill numbers, or the effective number of species in a sample, provide a simple metric for comparing species diversity across environments. The more diverse the sample, the higher the Hill number. We adopt this ecological approach and consider each k-mer an individual and each genome a transect in the pangenome of the species. Structured in this way, Hill numbers summarize the temporal trajectory of pandemic variants, collapsing each day's assemblies into genome equivalents. For pooled or wastewater sequence, we instead compare days using survey sequence divorced from individual infections. Across data from the UK, USA, and South Africa, we trace the ascendance of new variants of concern as they emerge in local populations well before these variants are named and added to phylogenetic databases. Using data from San Diego wastewater, we monitor these same population changes from raw, unassembled sequence. This history of emerging variants senses all available data as it is sequenced, intimating variant sweeps to dominance or declines to extinction at the leading edge of the COVID-19 pandemic.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.278594.123.

    • Freely available online through the Genome Research Open Access option.

    • Received October 3, 2023.
    • Accepted September 11, 2024.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server