Accurate estimation of intraspecific microbial gene content variation in metagenomic data with MIDAS v3 and StrainPGC

  1. Katherine S. Pollard1,2,8
  1. 1The Gladstone Institute of Data Science and Biotechnology, San Francisco, California 94158, USA;
  2. 2Chan Zuckerberg Biohub San Francisco, San Francisco, California 94158, USA;
  3. 3Department of Biomedical Engineering, University of Calgary, Calgary, Alberta T2N 1N4, Canada;
  4. 4Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel;
  5. 5Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 7610001, Israel;
  6. 6Department of Gastroenterology, University of California, San Francisco, California 94115, USA;
  7. 7Benioff Center for Microbiome Medicine, Department of Medicine, University of California San Francisco, San Francisco, California 94143, USA;
  8. 8Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94158, USA
  • Corresponding author: katherine.pollard{at}gladstone.ucsf.edu
  • Abstract

    Metagenomics has greatly expanded our understanding of the human gut microbiome by revealing a vast diversity of bacterial species within and across individuals. Even within a single species, different strains can have highly divergent gene content, affecting traits such as antibiotic resistance, metabolism, and virulence. Methods that harness metagenomic data to resolve strain-level differences in functional potential are crucial for understanding the causes and consequences of this intraspecific diversity. The enormous size of pangenome references, strain mixing within samples, and inconsistent sequencing depth present challenges for existing tools that analyze samples one at a time. To address this gap, we updated the MIDAS pangenome profiler, now released as version 3, and developed StrainPGC, an approach to strain-specific gene content estimation that combines strain tracking and correlations across multiple samples. We validate our integrated analysis using a complex synthetic community of strains from the human gut and find that StrainPGC outperforms existing approaches. Analyzing a large, publicly available metagenome collection from inflammatory bowel disease patients and healthy controls, we catalog the functional repertoires of thousands of strains across hundreds of species, capturing extensive diversity missing from reference databases. Finally, we apply StrainPGC to metagenomes from a clinical trial of fecal microbiota transplantation for the treatment of ulcerative colitis. We identify two Escherichia coli strains, from two different donors, that are both frequently transmitted to patients but have notable differences in functional potential. StrainPGC and MIDAS v3 together enable precise, intraspecific pangenomic investigations using large collections of metagenomic data without microbial isolation or de novo assembly.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279543.124.

    • Freely available online through the Genome Research Open Access option.

    • Received May 3, 2024.
    • Accepted March 6, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    OPEN ACCESS ARTICLE

    Preprint Server