Microbial strain-level population structure and genetic diversity from metagenomes

Among the human health conditions linked to microbial communities, phenotypes are often associated with only a subset of strains within causal microbial groups. Although it has been critical for decades in microbial physiology to characterize individual strains, this has been challenging when using culture-independent high-throughput metagenomics. We introduce StrainPhlAn, a novel metagenomic strain identification approach, and apply it to characterize the genetic structure of thousands of strains from more than 125 species in more than 1500 gut metagenomes drawn from populations spanning North and South American, European, Asian, and African countries. The method relies on per-sample dominant sequence variant reconstruction within species-specific marker genes. It identified primarily subject-specific strain variants (<5% inter-subject strain sharing), and we determined that a single strain typically dominated each species and was retained over time (for >70% of species). Microbial population structure was correlated in several distinct ways with the geographic structure of the host population. In some cases, discrete subspecies (e.g., for Eubacterium rectale and Prevotella copri) or continuous microbial genetic variations (e.g., for Faecalibacterium prausnitzii) were associated with geographically distinct human populations, whereas few strains occurred in multiple unrelated cohorts. We further estimated the genetic variability of gut microbes, with Bacteroides species appearing remarkably consistent (0.45% median number of nucleotide variants between strains), whereas P. copri was among the most plastic gut colonizers. We thus characterize here the population genetics of previously inaccessible intestinal microbes, providing a comprehensive strain-level genetic overview of the gut microbial diversity.


SUPPLEMENTARY TABLES
. Comparison of the markers reconstructed from the HMP mock communities with StrainPhlAn and MIDAS, versus the markers from the known reference genomes used to build the mock communities. For two mock communities with evenly and staggered distributed abundances, we report the rate of single-nucleotide errors, the absolute number of single nucleotide errors, and the length of the concatenated marker alignment obtained by StrainPhlAn and MIDAS. StrainPhlAn obtained errors below 0.1% for all reconstructed strains except for Staphylococcus aureus and Clostridium beijerinckii; however, when we performed metagenomic assembly, we confirmed that these two organisms have divergent genomes compared to the reference (only 98.98% and 99.57% average identity for the reconstructed contigs). These two genomes thus represent outliers for biological rather than validation reasons. MIDAS could only reconstruct 3 strains from both mock communities (missing reconstructions are marked with "NA") and it showed substantially higher error rates compared to those of StrainPhlAn. For StrainPhlAn, we compared the sequence of the reconstructed markers against the sequence of the markers of the genomes of the strains used in the mock community (the "true" genomes). For MIDAS we similarly compared the "true" genomes against the reference genome selected by the method edited with the suggested nucleotide variations, limiting the comparison to the regions of the genome that MIDAS reports in the output. Table S2. The comparison of StrainPhlAn and MIDAS performance on the synthetic dataset of Bacteroides dorei 20x coverage. We report the alignment length of strains obtained by two methods and the single nucleotide variant (SNV) between them and the reference genomes. MIDAS failed to reconstruct all Bacteroides dorei strains.  Table S7. StrainPhlAn accurately reconstructs the marker sequences of strains from gut metagenomes. We evaluated the accuracy of StrainPhlAn in reconstructing the markers of the genome of Bifidobacterium animalis subsp. lactis CNCM I-2494 in 7 MetaHIT samples (the subset of the 19 individuals subjected to B. animalis intake in which this species is present in the metagenome at >2x coverage). These 7 samples are from subjects that consumed a predefined fermented milk product containing Bifidobacterium animalis subsp. lactis CNCM I-2494 whose genome is publicly available. Our pipeline reconstructed the targeted strain for the 7 samples with coverage >2x achieving less than 0.01% single nucleotide errors, which is more than ten times smaller than the average nucleotide variation observed between strains (1.3%) computed on the markers from the sequenced reference genomes in this species ( Fig. S1. The StrainPhlAn pipeline. The input for StrainPhlAn is a set of metagenomic samples and, optionally, a collection of reference genomes. StrainPhlAn then reconstructs all dominant strains present in the samples by mapping the reads against the MetaPhlAn2 markers with Bowtie2 (8) and reconstructing the sample-specific consensus sequence for all the markers. Markers are also extracted form reference genomes (if provided by the user) using Blastn (10). For each species, the pipeline then reconstructs the concatenated multiple sequence alignment from the single consensus sequences and uses it to build the phylogenetic tree. Other output formats include ordination plots and heatmaps representing strain-level genetic relations.

StrainPhlAn MIDAS
Supplemental Fig. S2. Marker genes have a genetic variability consistent with that of core genes. For the 32 most prevalent gut microbiome species with at least three sequenced genomes, we analyzed the SNV variability of core genes (genes present in all the genomes in a species at >90% percentage identity) and the SNV variability of the marker genes used by StrainPhlAn. The boxplot reports the median SNV rate of each core or marker gene across the pairs of orthologues of the genes in the available reference genomes. In only 7 cases (asterisk appended to the species name) the twosided Kolmogorov-Smirnov statistical test found that SNV variabilities associated with core genes and marker genes have a significantly difference distribution, but the effect size of the variation is still small.
Supplemental Fig. S3. The distribution of SNV rates of marker genes is consistent with that of core genes. We report here the histograms of the genetic variability of marker genes and core genes computed as reported in Supplemental Fig. S2. The histograms confirm that the genetic variability of marker genes and core genes is distributed consistently. It also highlights that, for some species, a considerable fraction of core (and marker) genes show almost no variations in different strains, although this may be the consequence of a reduced diversity of the strains with sequenced genomes.

Supplemental Fig. S4. StrainPhlAn performance in reconstructing strains from 4 species (Bacteroides dorei, Bacteroides fragilis, Bacteroides ovatus, Bifidobacterium longum) using 72 synthetic and semi-synthetic datasets.
The synthetic samples were generated from the reference genomes of the four target species with simulated sequencing noise and confounding non-target species at increasing coverages (see Methods). Semi-synthetic datasets also include a large fraction of reads from real gut metagenomes (see Methods). The boxplots show the distances (in nucleotide difference rates) between the reconstructed markers of target strains from synthetic and semi-synthetic samples and the reference genomes of the target strains. StrainPhlAn could recover the strains precisely (with the distance of <0.025% SNVs rate) even at coverages as low as 3x for all species. Additionally, the SNV rates tend to converge to zero when increasing the coverage.

Supplemental Fig. S6. The comparison of the strains reconstructed by StrainPhlAn and ConStrains on the synthetic datasets of two species Bacteroides ovatus, and Bifidobacterium longum at 20x coverage. (A, C)
The phylogenetic trees built from reference genomes and from corresponding synthetic samples by StrainPhlAn. In parenthesis, we report the strain IDs assigned by ConStrains. (B, D) The heatmap of SNV distances for the reference genomes on the whole core genome as obtained by the application of Prokka (5) and Roary (6) with the corresponding ConStrains strain ID.
Supplemental Fig. S9. Distribution of non-polymorphic site prevalence in samples for the 40 most prevalent gut bacterial species. Each point represents, for each sample-species pair, the fraction of reconstructed marker positions that are non-polymorphic. In parenthesis we quantify the percentage of strains with >99.9% of non-polymorphic sites. The fraction of non-polymorphic sites varies from sample to sample and from species to species. Supplemental Fig. S10. Rates of non-polymorphic positions and within-species dominant strain dominance are not correlated with relative abundance of the species. We contrasted here the non-polymorphic rates (A) and dominant strain frequency (B) against the relative abundance of the species the strain belongs to. Each point represent a different sample/species combination.

Supplemental Fig. S11. Strain retention rates and strain divergence in multiple longitudinal samples from the same subjects (131 from the HMP and 78 from MetaHIT). (A, C)
The scatter plot of the retention rate versus the sampling interval between two samples of the same subject at two different time-points in the HMP and MetaHIT datasets respectively. For each subject, the retention rate is the proportion of species in which the subject harbors the same strain in the second time-point as the first. Two strains are considered to be the same if their normalized distance is less than µintrametahit + 3intra-metahit where µintra-metahit and intra-metahit are the median and standard deviation of the intra-metahit dominant distribution, respectively. (B, D) The scatter plot and histogram of the normalized distance versus sampling interval of two samples of the same subject at both time-points in the HMP and MetaHIT strains.