Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome

  1. David A. Relman1,2,3,4
  1. 1March of Dimes Prematurity Research Center at Stanford University, Stanford, California 94305, USA;
  2. 2Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, California 94305, USA;
  3. 3Department of Medicine, Stanford University School of Medicine, Stanford, California 94305, USA;
  4. 4Infectious Diseases Section, Veterans Affairs Palo Alto Health Care System, Palo Alto, California 94304, USA;
  5. 5Department of Earth and Planetary Science, University of California, Berkeley, California 94720, USA;
  6. 6Department of Pediatrics, Stanford University School of Medicine, Stanford, California 94305, USA;
  7. 7Department of Statistics, Stanford University, Stanford, California 94305, USA;
  8. 8Earth and Environmental Science, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
  1. 9 These authors contributed equally to this work.

  • Corresponding author: relman{at}stanford.edu
  • Abstract

    Recent studies suggest that the microbiome has an impact on gestational health and outcome. However, characterization of the pregnancy-associated microbiome has largely relied on 16S rRNA gene amplicon-based surveys. Here, we describe an assembly-driven, metagenomics-based, longitudinal study of the vaginal, gut, and oral microbiomes in 292 samples from 10 subjects sampled every three weeks throughout pregnancy. Nonhuman sequences in the amount of 1.53 Gb were assembled into scaffolds, and functional genes were predicted for gene- and pathway-based analyses. Vaginal assemblies were binned into 97 draft quality genomes. Redundancy analysis (RDA) of microbial community composition at all three body sites revealed gestational age to be a significant source of variation in patterns of gene abundance. In addition, health complications were associated with variation in community functional gene composition in the mouth and gut. The diversity of Lactobacillus iners-dominated communities in the vagina, unlike most other vaginal community types, significantly increased with gestational age. The genomes of co-occurring Gardnerella vaginalis strains with predicted distinct functions were recovered in samples from two subjects. In seven subjects, gut samples contained strains of the same Lactobacillus species that dominated the vaginal community of that same subject and not other Lactobacillus species; however, these within-host strains were divergent. CRISPR spacer analysis suggested shared phage and plasmid populations across body sites and individuals. This work underscores the dynamic behavior of the microbiome during pregnancy and suggests the potential importance of understanding the sources of this behavior for fetal development and gestational outcome.

    The importance of the microbiota in human nutrition, immune function, and physiology is well known (Fujimura et al. 2010; Garrett et al. 2010; Charbonneau et al. 2016), and disturbance (such as antibiotic use and disease) has important effects on the microbiome (Cho and Blaser 2012; Costello et al. 2012). Pregnancy is a natural disturbance that can be understood as a special immune state. During gestation, significant hormonal, physiological, and immunological changes allow for and promote the growth of a developing fetus (Nuriel-Ohayon et al. 2016). For example, some of the changes that occur in pregnancy resemble metabolic syndrome (weight gain, glucose intolerance, and low-level inflammation, among other symptoms). Given the roles of the microbiome during other states of health, it is presumed to provide fundamental support for fetal development as well (Charbonneau et al. 2016). Despite this, few studies of the human microbiome in pregnancy, beyond community composition-based analyses, have been reported. One study investigating the gut microbiota at one time point during each of the first and third trimesters suggested that community composition changes during pregnancy (Koren et al. 2012). Similarly, abundances of viable counts of common oral bacteria have been found to be altered during gestation, especially during early pregnancy, compared to nonpregnant women (Nuriel-Ohayon et al. 2016). However, a detailed longitudinal 16S ribosomal RNA (rRNA) gene survey of the gut and oral microbiota from 49 women during pregnancy demonstrated relative stability over time (DiGiulio et al. 2015).

    The vaginal bacterial community composition of pregnant and nonpregnant individuals has been classified into community state types (CST) based on the presence and relative abundance of organisms characterized with the 16S rRNA gene (Ravel et al. 2011; Hickey et al. 2012; DiGiulio et al. 2015). More recently, an alternative classification for vaginal communities was proposed based on the frequency of Lactobacillus iners, Lactobacillus crispatus, and Gardnerella vaginalis (Callahan et al. 2017). Studies of vaginal community composition based on 16S rRNA gene amplicon sequence analysis in pregnant and nonpregnant individuals have reported relative stability throughout gestation, although transitions between CSTs within subjects have been observed (Gajer et al. 2012; Romero et al. 2014; DiGiulio et al. 2015; MacIntyre et al. 2015; Stout et al. 2017). In addition, studies have established associations between the vaginal microbiota and race/ethnicity (Hillier et al. 1995; Zhou et al. 2007; Ravel et al. 2011; The Human Microbiome Project Consortium 2012; Hyman et al. 2014) and, more recently, associations with premature birth (DiGiulio et al. 2015; Callahan et al. 2017; Kindinger et al. 2017; Stout et al. 2017; Tabatabaei et al. 2018).

    Despite this recent attention to the microbiome during pregnancy, relatively few studies have discriminated among bacterial strains, and few have described the phage populations in the human microbiome during pregnancy. Since CRISPR spacers are mainly derived from phage, plasmids, and other mobile elements (for extensive review, see Horvath and Barrangou 2010; Karginov and Hannon 2010; Marraffini and Sontheimer 2010), the CRISPR-Cas system can be used to identify these sequences (Bolotin et al. 2005; Mojica et al. 2005; Pourcel et al. 2005; Andersson and Banfield 2008; Sun et al. 2016) where spacers reflect exposures to phage and plasmids in a whole community metagenomic data set.

    The goal of this work was to survey the genome content and functional potential of microbial communities during gestation. We describe the vaginal, gut, and oral microbiomes from 10 pregnant subjects from first trimester through delivery, with the goal of broadening our understanding of the functional potential and the dynamics of the human microbiome during pregnancy.

    Results

    Community structure from assembly-driven metagenomics

    Ten pregnant women whose microbiota have been studied previously using 16S rRNA gene amplicon data (DiGiulio et al. 2015) were selected for further gene and genome composition analysis. Subjects were selected in part based on their compliance in providing samples on a regular basis from multiple body sites. Six subjects delivered at term and four delivered preterm (i.e., <37 wk gestation). Five subjects (four who delivered at term, and one who delivered preterm) were diagnosed with having some type of pregnancy complication: preeclampsia, type 2 diabetes, and oligohydramnios (Supplemental Table S1).

    Shotgun metagenomic sequencing was performed on 292 vaginal, oral, and gut samples that had been collected, on average, every 3 wk over the course of gestation (Table 1; Supplemental Table S1). Community composition and estimates of diversity were evaluated from full-length 16S rRNA gene sequences that were reconstructed from metagenomics reads using EMIRGE (Miller et al. 2011) and from the average number of single copy ribosomal protein (RP) sets encoded in assembled scaffolds. Based on reconstructed 16S rRNA gene sequences, 1553 taxa (clustered at 97% identity) were identified. In total, 22,737 predicted proteins annotated as one of the 16 RPs described for phylogenetic classification in Hug et al. (2016) were encoded in 4650 assembled scaffolds at all body sites. The RP sets were clustered at 99% average amino acid identity (AAI) into 4024 taxa (see Methods; Table 1).

    Table 1.

    Metagenomic sequencing statistics

    Estimates of richness and evenness from reconstructed 16S rRNA genes differed depending on body site: Vaginal communities were the least diverse and gut communities were the most diverse (Supplemental Fig. S1A). Vaginal communities appeared to be largely undersampled with the metagenomic data in comparison to the previously published 16S rDNA amplicon data (DiGiulio et al. 2015), perhaps due to high levels of human sequence contamination in the metagenomic data (Supplemental Fig. S1B, vagina). It is also possible that the amplification of 16S rDNA leads to higher richness than expected. In contrast, gut communities showed higher richness and evenness (visualized as the slope of the curves) when viewed with metagenomics than with 16S rRNA gene surveys (Supplemental Fig. S1B, gut), possibly due to higher resolution from deep sequencing in the metagenomics data and/or from the so-called “universal” 16S rRNA gene primers not amplifying all microbial targets. On the other hand, the richness of the oral communities recovered from metagenomics appeared to approximate that observed in the 16S rDNA amplicon data (Supplemental Fig. S1B, saliva).

    Unlike gut or salivary communities, vaginal communities were generally dominated by one organism based on whole-genome abundance measures (Fig. 1A) and reconstructed 16S rRNA gene analysis (Supplemental Fig. S2A). The taxonomic classifications, community state types, and community dynamics revealed through analysis of 16S rRNA gene sequences assembled by EMIRGE as well as from ribosomal protein sequences resembled those revealed by 16S rRNA gene amplicon sequence data (Supplemental Fig. S2; DiGiulio et al. 2015). Communities dominated by L. iners showed a clear increase in taxonomic richness toward the end of pregnancy (subjects Term2, Term3, Term4, and Pre4), whereas those dominated by L. crispatus appeared to remain stable over time (subjects Term5, Term6, Pre1, and Pre3) (Fig. 1A). To explore the observed association between increasing community richness and community domination by L. iners, we re-analyzed the 16S rRNA amplicon data reported previously (Supplemental Fig. S3; DiGiulio et al. 2015; Callahan et al. 2017). Shannon's diversity significantly increased with gestational age in L. iners- and L. crispatus-dominated communities in the Stanford population (F = 32.07, P < 0.0001; and F = 8.32, P = 0.004, respectively) but not in the UAB population. When L. iners- and L. crispatus-dominated communities were each pooled from both populations, gestational age trends achieved significance in both cases (L. iners, F = 23.23, P < 0.0001; L. crispatus, F = 8.61, P < 0.0034). These data suggest dynamic vaginal community composition over the course of gestation.

    Figure 1.

    Community structure of the pregnancy microbiome over time in 10 subjects. (A) Relative abundance of vaginal genome bins (y-axis). Abundance was estimated from the number of reads that mapped to each bin and normalized by the length of the bin. The top 10 most abundant vaginal taxa are displayed (see key at bottom). (Unbinned) Sequences that were not assigned to a classified genome bin. (B,C). Relative abundance (y-axis) of the top 50 most abundant taxa across all subjects in saliva (B) and gut (C) samples, respectively. Gut samples from subject Pre3 were not available. Species abundance was estimated from the average read counts of single-copy ribosomal protein (RP) sets (at least one of 16), summed over scaffolds sharing RPs clustered at 99% amino acid identity. Each taxon is represented by a distinct color (see key for selected taxa at bottom) and is classified at the most resolved level possible. Co-occurring strains of Rothia mucilaginosa, Streptococcus sanguinis, and Streptococcus parasanguinis in B, and of Bacteroides vulgatus and Akkermansia spp. CAG:344 in C are highlighted with black boxes.

    Due to the high diversity of gut and saliva communities, the relative abundances of 16 single-copy ribosomal protein sets were used to estimate the dynamics of gut and saliva communities over time. The relative abundances of the most abundant taxa in the gut and saliva showed relative stability over time within subjects but high inter-individual variability (Fig. 1B,C). For example, the nine most abundant taxa in the saliva samples of subject Term5 were relatively consistent in abundance over all time points, representing, at most, 20% of the community within that subject, but were distinct from the most abundant taxa in the saliva of the other subjects (Fig. 1B). In addition, strains could be resolved from RP cluster variation within scaffolds, and co-occurring bacterial strains were observed in gut and saliva samples within subjects (Fig. 1B,C, black boxes).

    Most taxa identified by 16S rRNA gene reconstruction (classified at 97% average nucleotide identity, or ANI) and RP sets (classified at 99% average amino acid identity, or AAI) were shared between body sites among all subjects, but the level of sharing was most evident between vaginal and gut samples (Supplemental Fig. S1C). Phyla found uniquely in gut samples included the Bacteria, Verrucomicrobia and Synergistetes, the Archaea, Euryarchaeota, and the Eukarya, Stramenopiles (Blastocystis hominis), whereas phyla found uniquely in oral samples included the Bacteria, Spirochaetes and SR1. Novel taxa were detected in oral and gut samples: Three 16S rRNA sequences in saliva could not be classified to a known phylum and three taxa represented by novel RP sets could not be classified at the domain level based on searches in the public databases. Similarly, six 16S rRNA gene sequences in gut samples could not be classified at the phylum level, and 10 RP sets remained unclassified at the domain level.

    Sources of variation in community-wide gene profiles

    To gain insights into all potential sources of variation in the patterns of gene abundance across all subjects and samples, nonmetric multidimensional scaling (NMDS) was performed on variance-stabilized abundances of gene families represented in the UniRef90 database. NMDS revealed three major sources of variation in the patterns of gene abundance: subject, gestational age, and health complication. Samples clustered primarily based on subject for each body site (Fig. 2A; Supplemental Fig. S4B,C). In addition, when considering vaginal samples, gestational age was a major source of variation for subjects with high diversity communities (Supplemental Fig. S4A, Pre4, Term3, and Term4). Taxonomic assignment was also associated with gene abundance profiles in the vagina. Specifically, gene profiles from Gardnerella vaginalis appeared correlated with positive scores on NMDS2 (samples from subject Term4); Lactobacillus gasseri, Sneathia, and Prevotella gene abundances were associated with positive NMDS1 scores (samples from subject Term2); and L. crispatus, L. iners, and Lactobacillus jensenii gene abundances were associated with negative scores along NMDS2 (samples from all other subjects) (Fig. 2B). NMDS plots of variance-stabilized UniRef90 gene family abundances in the distal gut microbiota showed a subtle effect of Shannon's diversity index on sample variation (Supplemental Fig. S5), and plotting Shannon's diversity index from ribosomal protein sets, from reconstructed 16S rRNA gene sequences via EMIRGE, and from UniRef90 gene families indicated a decrease in diversity over time in our 10 subjects (Supplemental Fig. S5). Finally, in oral and gut communities, health complications explained some variation in community gene composition: Samples from the three subjects who were diagnosed with preeclampsia (two who delivered at term and one preterm) appeared to cluster separately from samples collected from subjects with other health states and pregnancy outcomes (Fig. 2D,F).

    Figure 2.

    Sources of variation in abundance of UniRef90 gene families across all subjects and samples. Top: (A,B,D,F) Nonmetric multidimensional scaling (NMDS) plots from Bray-Curtis distance matrices of variance-stabilized gene family abundances. The stress, “s” (the amount of variability unexplained by the NMDS ordination), is shown on each plot. Bottom: (C,E,G) Redundancy analysis (RDA) plots from variance-stabilized gene family abundances. The P-values for the RDA plots were estimated with the anova.cca function from the vegan package in R. (A,B) NMDS split plot from vaginal samples: The “samples” plot (A) was color-coded based on subject, whereas the “genes” plot (B) was color-coded based on the taxonomic classification of genes (gray dots: genes belonging to other taxa). (C) RDA plot from vaginal samples, constrained by gestational age (GA) in weeks and by the most abundant taxon in each sample. Samples are color-coded based on the most abundant taxon. (D,E) NMDS and RDA plots from gut samples. (F,G) NMDS and RDA plots from saliva samples. Gestational age and health complication were used to constrain the RDA analysis in E and G. Complication: uncomplicated (five subjects); preeclampsia (three subjects); other, i.e., type 2 diabetes (one subject); and oligohydramnios (one subject).

    To quantify the relative influence of gestational age, dominant taxon (vaginal microbiota), and health complication on community composition, redundancy analysis (RDA) was performed on variance-stabilized gene family abundances. Although the effect of gestational age was significant only for vaginal communities (F = 2.71, P = 0.017) when data from all subjects and samples were considered (Fig. 2C,E,G), gestational age trends were significant at all body sites and for most subjects and showed stronger effects when the data for each subject were examined separately (Fig. 3; Supplemental Fig. S6). For vaginal samples, significant trends with gestational age were observed in six of 10 subjects (subjects Pre1, Pre2, Pre3, Term2, Term3, Term4) even in low-diversity communities dominated by L. crispatus (subjects Pre1 and Pre3) (Supplemental Fig. S6A). The most abundant taxon in vaginal communities and health complications (preeclampsia, diabetes, oligohydramnios, or uncomplicated pregnancy) in gut and oral communities were highly significant sources of variation in gene family abundances based on RDA of the overall data (vagina: F = 14.97, P = 0.001; saliva: F = 11.20, P = 0.001; gut: F = 12.34, P = 0.001) (Fig. 2C,E,G), although the numbers of subjects with each of these complications were small.

    Figure 3.

    Gestational age trends for abundances of gene families in saliva samples for each subject. Gestational age was used to constrain the redundancy analysis of variance-stabilized gene family abundances within individuals. Gestational age effect is observed along the x-axis (RDA1 axis), and points within plots were connected based on the resulting ordination scores. P-values were calculated with ANOVA on the RDA ordination constraint using the anova.cca function of the vegan package in R.

    Overall pathway composition and abundance remained stable over time at all body sites when all subjects were viewed together (Supplemental Fig. S7, top plots). Not surprisingly, core metabolic pathways consistently accounted for most of the overall pathway composition and abundance in all subjects. When pathways were viewed individually, their abundances were stable over time (e.g., glycolysis at all body sites) (Supplemental Fig. S7, bottom plots); however, some pathways were variable over gestational age. For example, the relative abundance of the enterobactin biosynthesis pathway (a siderophore-mediated iron uptake system [Raymond et al. 2003]) decreased toward the end of pregnancy in most gut and oral samples. Yet, the enterobactin biosynthesis pathway identified in some samples remains stable at low basal levels (ENTBACSYN) (Supplemental Fig. S7, gut, bottom plots). These results suggest that iron may have been in plentiful supply as pregnancy progressed in the oral and gut sites, leading to enrichment of organisms without enterobactin biosynthesis capabilities. In addition, the increasing relative abundance of the pyruvate fermentation to acetate and lactate pathway toward the end of pregnancy in gut samples (PWY-5100) (Supplemental Fig. S7, gut) suggests that fermenters might become enriched in the gut during pregnancy.

    Gardnerella vaginalis strains co-occur in the vagina

    Gardnerella vaginalis is an important species associated with both bacterial vaginosis (Schwebke et al. 2014) and an increased risk of preterm birth (Callahan et al. 2017). The near-complete draft genomes of six strains of G. vaginalis from the vaginal samples of four subjects were recovered in this study (Fig. 1A, colored light and dark green). G. vaginalis genomes were abundant early in pregnancy in two subjects who delivered preterm (Pre2 and Pre4), and they dominated samples later in pregnancy in two subjects who delivered at term (Term3 and Term4). Two subjects carried co-occurring G. vaginalis strains (Figs. 1A, 4A,B, subjects Term4 and Pre2). The strains shared >99% average pairwise nucleotide identity in the 16S rRNA gene, yet single nucleotide polymorphism (SNP) patterns were evident in this gene (Supplemental Fig. S8A). One strain was generally more abundant than the other throughout pregnancy: In subject Term4, both G. vaginalis strains were found at low relative abundance (<7% for each strain in any sample) (Fig. 4B), whereas in subject Pre2, the two strains co-existed at roughly equal abundances early in pregnancy and then one assumed dominance as gestation progressed (Fig. 4A).

    Figure 4.

    Gardnerella vaginalis genome analysis. 16S rRNA gene abundance of G. vaginalis strains recovered from each of two subjects, Pre2 (A) and Term4 (B). G. vaginalis genotypic groups, C1 and C2, are colored according to the classification in C. Relative abundance (left y-axis). Estimated iRep values are plotted for G. vaginalis strains in subject Pre2 (right y-axis). (C) Phylogenomic tree of 40 G. vaginalis strains genomes, including 34 available in GenBank. The six genomes recovered in the current study are shown in bold. Colored bars represent genotypic groups within the G. vaginalis phylogeny, where colors for clades 1–4 match the G. vaginalis genotypic groups defined by Ahmed et al. (2012). FastTree branch support for the most visible nodes is shown. (D) Radial representation of the same phylogenomic tree displayed in C, where leaves are colored based on 16S rRNA V4 sequence variant classification defined by Callahan et al. (2017). Genomes for which a full-length 16S rRNA sequence or V4 sequence were not available are shown in black.

    To determine whether replication rate may have been associated with the relative abundance differences in G. vaginalis strains within each subject, iRep values were determined. iRep values have been employed to infer replication rates of microbial genome types from draft quality genomes reconstructed from metagenomic data and can be used to suggest that a particular microbial type is actively growing at the time of sampling (Brown et al. 2015). As described in Brown et al. (2015), an iRep value of 2 would indicate that most of the population is replicating one copy of its chromosome (Brown et al. 2015). The iRep values for the co-occurring G. vaginalis strains in subject Pre2 indicate both strains were actively replicating at the beginning of pregnancy (iRep ∼2.5) when both strains co-existed at relatively equal abundances. After gestational week 16, the Pre2_C1 strain remained dominant, with a stable replication rate value of ∼1.5 (that is, roughly 50% of the population replicating), whereas the Pre2_C2 strain dropped below iRep calculation limits due to low abundance (Fig. 4A).

    A full-length multiple genome alignment and phylogenomic tree of the six genomes recovered in this study and 34 complete and draft genomes available in GenBank revealed five divergent clades (Fig. 4C,D). The dominant genomes recovered from subjects Term4 and Pre2 (Term4_C1 and Pre2_C1) were more similar to those of a similar genome type found in other subjects than to the co-occurring genome within the same subject (Fig. 4C). When using the ATCC 14019 type strain genome as reference, the total number of SNPs across the genomes did not correlate with the group phylogeny. In other words, a divergent clade does not necessarily harbor more SNPs than strains within Clade 1 to which the ATCC 14019 strain belongs (SNP counts are displayed next to the genome ID in Supplemental Fig. S8B).

    From over 50,600 genes predicted in all 40 strain genomes, ∼4000 genes were unique to a genome, and the rest could be clustered at >60% AAI over 70% alignment coverage into >3300 orthologous groups. In total, 41,245 orthologous genes were present in at least half of the genomes within each of the five clades described above (1763 orthologous groups), and ∼44% of these represented the core genome (cluster 4, Supplemental Fig. S9A). Furthermore, hierarchical clustering suggested groups of orthologous genes specific to each genome clade: genes shared within Clade 1 genomes (cluster 6), genes in Clade 2 genomes (cluster 13), genes in Clade 3 genomes (cluster 1), genes in Clade 4 genomes (cluster 3), genes in Clade 5 genomes (cluster 5), and one cluster composed of genes shared between Clade 3, Clade 4, and Clade 5 genomes (Supplemental Fig. S9A). A survey of the functional categories represented within these clusters indicated that, although there is redundancy of functions across all clusters (most of them also represented in the core genome cluster), some functions are enriched in clade-specific genomes. For example, hypothetical proteins are enriched in Clade 5 genomes; proteins involved in vitamin and cofactor metabolism, and CRISPR Cas genes are enriched in Clade 3 genomes; and membrane proteins, DNA recombination/repair proteins, and proteins with transferase activity appear enriched in Clade 2 genomes. The results suggest a clade-specific enrichment of functional categories in G. vaginalis genomes.

    A recent detailed survey of SNP patterns in the V4 region of the 16S rRNA gene from a longitudinal study of vaginal communities from pregnant women classified G. vaginalis strains into amplicon sequence variants (ASVs) (Callahan et al. 2017), and the three most frequent sequence variants could be mapped onto divergent clades on a phylogenomic tree (Fig. 4D). The genomes and full-length 16S rRNA gene of two of the described ASVs were recovered in this study: variant G2 (Term4_C1 and Pre2_C1) (tree leaves colored in orange, Fig. 4D); and variant G3 (Term3_C2, Term4_C2, and Pre4_C5), which intermingle across the phylogenomic tree (tree leaves colored in green, Fig. 4D).

    Lactobacillus genomes vary in their mobile elements

    Vaginal bacterial communities of many women are dominated by Lactobacillus iners (Ravel et al. 2011; DiGiulio et al. 2015), an organism that has been associated with both health and disease (Macklaim et al. 2011). Recent studies have shown that L. iners is associated with high diversity CSTs, whereas L. crispatus (dominant species in CST I) is thought to be more protective against disease (Callahan et al. 2017; Smith and Ravel 2017). We recovered six near-complete L. iners genomes, as well as four fragmented L. crispatus genomes (completeness was assessed based on single-copy genes and average genome size).

    A comparison of the L. iners genomes with the KEGG Automatic Annotation Server (Moriya et al. 2007) revealed no major differences in terms of the presence or absence of genes of known function. Alignment of the consensus L. iners genome sequences from five of the six strains to the NCBI reference strain DSM 13335 using Mauve showed that the genomes share high levels of synteny (the genome from Pre4 was not used due to its high fragmentation) (see Methods; Fig. 5A). Primary differences between strains were the variable presence of mobile element-like genes including classic phage and plasmid genes (such as phage integrases, phage lysins, phage portal genes, as well as genes putatively involved in restriction-modification systems, toxin-antitoxin systems, and antibiotic transport), and many hypothetical genes (Fig. 5A, islands). Analysis of orthologous gene groups among the six genomes assembled here and 18 other publicly available genomes confirmed that L. iners genomes are highly conserved, sharing 90.5% of their genes (Supplemental Fig. S10).

    Figure 5.

    Comparative genomics of Lactobacillus spp. and estimated replication rates. (A) Multiple genome alignment of five L. iners genomes recovered through metagenomics along with the reference strain DSM 13335 (see Methods). A modified alignment created in Mauve shows the shared genomic context, as well as genomic islands unique to a strain (white areas within a conserved block). (B,C) Relative abundance and estimated genome replication values (iRep) for L. iners and L. crispatus, respectively.

    The community composition of L. crispatus-dominated communities remained generally stable through pregnancy and tended to have low richness (Figs. 1, 5C). A full genome alignment and analysis of orthologous genes of 41 draft genomes of L. crispatus, including six reconstructed in this study, grouped 80,484 genes into 4705 orthologous groups present in at least two genomes. Hierarchical clustering of the presence/absence of orthologous groups suggested two divergent clusters, with more than half of the genes representing the core genome (Supplemental Fig. S11). The two L. crispatus types differed in the presence of core SNPs (SNPs present in all genomes) as well as in the presence of cluster-specific orthologous groups. A previous comparative genomics analysis of 10 L. crispatus genomes identified ∼1200 core genes as well as genome-specific genes encoding up to 40% genes of unknown function (Ojala et al. 2014). Characterization of the L. crispatus group-specific genes might improve our understanding of the role of this species in human urogenital health.

    iRep values were estimated for L. iners in subjects Term2, Term3, Term4, and Pre4 and for L. crispatus in subjects Term5, Term6, Pre1, and Pre3 to determine whether the different Lactobacillus genomes were replicating (see Methods). iRep values fluctuated around 2 (most members of the population were replicating one copy of their chromosome) for both organisms and appeared independent of relative abundance (Fig. 5B,C). On average, L. crispatus had a slightly higher iRep value (2.0) than did L. iners (1.8) (∼100% of the L. crispatus population and 80% of the L. iners population replicating one copy of their genome) (Fig. 5B,C).

    CRISPR-Cas diversity and CRISPR spacer targets

    In order to investigate the phage and plasmid populations at the different body sites over the course of pregnancy, we characterized the types and distribution of the CRISPR-Cas systems in the vaginal, gut, and saliva assemblies (Fig. 6A). Prior studies have defined several CRISPR-Cas type systems based on the composition and genomic architecture of Cas genes (Makarova et al. 2015; Burstein et al. 2016). Here, we identified CRISPR-Cas systems using the five-type classification scheme derived from Makarova et al. (2015), four of which were identified in our metagenomics data. Type I (thought to be the most common CRISPR-Cas system type [Makarova et al. 2015]) was the dominant CRISPR-Cas system in gut samples, whereas Type II was the dominant system in saliva samples (Fig. 6A).

    Figure 6.

    Diversity and distribution of CRISPR-Cas systems and CRISPR spacer targets (mobile elements). (A) Distribution of CRISPR-Cas system types in the gut (gray) and saliva (black) samples in this study (I, II, III, V). Vaginal samples are not shown due to low bacterial diversity. The black lines indicate standard deviation. (B) Frequency of matches between spacers and scaffolds. This graph shows the total number of matches by body site (outlined gray, vaginal spacers; light gray, saliva spacers; and dark gray, gut spacers). There were a total of 78,054 matches between spacers and all pregnancy scaffolds. Overall, we detected 3254 spacer types from vaginal samples, 36,477 spacer types from gut samples, and 36,279 spacer types from saliva samples.

    In total, there were 137, 618, and 1255 “repeat types” among vaginal, saliva, and gut samples, respectively (Supplemental Table S2). Each repeat type represents a repeat sequence found in a CRISPR locus; different CRISPR loci (within or between organisms) can share the same repeat type. Many more spacer types were found in the gut (36,477) and saliva (36,279) samples than in the vaginal samples (3254). Although there were twice as many repeat types in the gut than in saliva, the similar total number of spacers in the samples from each site suggested there were either longer CRISPR loci in organisms (on average) in saliva or there were more CRIPSR loci that shared the same repeat type among organisms in saliva.

    We used the CRISPR spacers to detect matches in the NCBI database of viral genomes as well as across the data generated in this study. We found significantly more matches (and sequence similarity) to the data generated from our samples than to the NCBI virus database (Supplemental Fig. S12A). Within the metagenomic data, there were 78,054 total matches between spacers and scaffolds (Fig. 6B). On average, ∼32% of the spacers had at least one match to a scaffold of one of the body sites (vaginal: 31%; saliva: 38%; gut: 28%). Gut and saliva spacers had the most matches to scaffolds from the same body site. However, vaginal spacers had more matches to the gut scaffolds, both in terms of total matches between spacers and scaffolds (Fig. 6B) as well as in terms of the number of spacers with at least one match (Supplemental Fig. S12B). This was also true when vaginal spacers were viewed between and within individuals (Supplemental Table S3). It is possible that sequencing depth may have resulted in under-sampling of vaginal scaffold matches or that there were more scaffolds from the gut that serve as potential targets for vaginal spacers due to the higher diversity in the gut.

    Typical vaginal taxa are detected in the gut

    Whereas common gut Lactobacillus species such as L. acidophilus, L. rhamnosus, and L. lactis (The Human Microbiome Project Consortium 2012) were identified in the metagenomic sequences assembled from gut communities, species typically seen in vaginal communities were found in seven of the 10 subjects studied here. Near-complete genomes and shorter genomic fragments (3.5–180 kb binned) of L. iners (in Term2, Term4, and Pre4), L. crispatus (in Term5, Term6, and Pre1), and G. vaginalis (in Pre2 and Pre4) were recovered from gut samples.

    In all cases, the typical vaginal taxon found in the gut (at very low abundance) was the dominant bacterial species in the corresponding vaginal community of the same subject. Furthermore, genomic fragments from both L. iners and G. vaginalis, which were assembled from only one vaginal sample in subject Pre4 (Fig. 1), were detected in that subject's gut. Sharing of L. iners, L. crispatus, and G. vaginalis between the vagina and gut was supported by an assessment of the presence of these vaginal bacteria in gut samples from the larger Stanford cohort. Of 41 subjects who provided both vaginal and gut samples (both stool and rectal swabs) (DiGiulio et al. 2015), 34 had 16S rRNA gene sequences of typical vaginal bacteria in their gut samples, and 33 had sequences from as many as five of their own vaginal species present in at least one of their gut samples (Supplemental Table S4). Evaluation of whole-genome core SNPs and clusters of orthologous genes in L. iners and L. crispatus reconstructed genome sequences indicated that the vaginal and gut sequences recovered within the same subject were more closely related to each other than to genome sequences assembled or identified in other subjects, suggesting that strains of vaginal Lactobacillus species from within the same subject come from the same ancestor (Supplemental Figs. S11A, S12A).

    The vaginal and gut genomes recovered here likely came from different populations, as evidenced by the amount of variation observed in full-genome and genomic fragment alignments within subjects and the average pairwise nucleotide identity between the co-assembled genomes (Supplemental Fig. S13). Furthermore, SNPs within the vaginal read data mapping to the co-assembled genomes of L. iners and L. crispatus from the vagina indicated strain variation (L. iners and L. crispatus were deeply sequenced to a median of 219× coverage, with a range between 17× and 1180×). However, the frequency of SNPs unique to the gut read data that mapped to the corresponding vaginal strain genome from the same subject indicates that the pool of sequences used to assemble each of the genomes in the vagina and gut were not the same (Table 2, number of SNPs deemed significant based on the probability that a SNP is a product of random variation or sequencing error). For example, 1% of the SNPs that mapped to the vaginal L. iners genome from subject Term4 were unique to that subject's gut read data (>400 SNPs distributed across the genome). Given that L. iners was assembled from the vagina of subject Term4 at 168× coverage, we would expect to recover almost all SNPs present in the gut read data (which achieved 7× coverage) if the sequences came from the same source.

    Table 2.

    SNP analysis of the read data from vaginal and gut samples that mapped to the near-complete genomes of L. iners and L. crispatus recovered from the vagina

    Discussion

    The human vagina, oral cavity, and gut in combination with the microbial organisms, their genomes, and functions within these body sites (microbiomes) reveal mutualistic relationships in which the host provides nutrients and a specialized niche, and the microbiota provides protection against pathogens and beneficial signals to the immune system, among other benefits (Fujimura et al. 2010; Ma et al. 2012). Here, we used whole-community shotgun sequencing of human microbiome samples from three body sites (vagina, oral cavity, and gut) to characterize metagenome composition patterns over the course of pregnancy. High inter-individual variability and dynamic community composition across gestational age were observed at all body sites.

    In the vagina, a direct relationship between gestational age and increasing taxonomic richness was observed in L. iners-dominated communities. This finding was also evident from re-evaluating the previously published 16S rRNA gene amplicon data from the Stanford population (DiGiulio et al. 2015; Callahan et al. 2017). Although vaginal communities from subjects with abundant Shuttleworthia in the UAB cohort also showed statistically significant increases in taxonomic richness with gestational age, this association might be spurious due to the low sample size in Shuttleworthia-dominated communities. A significant gestation-associated increase in Shannon's diversity index in L. iners- and L. crispatus-dominated communities from the Stanford, but not the UAB population, may be due to their different racial backgrounds, with a higher frequency of Caucasian and Asian individuals at Stanford and a higher frequency of African American women at UAB (Callahan et al. 2017). Another difference is that, unlike at Stanford, women at UAB were treated with progesterone because of their history of prior preterm birth (Callahan et al. 2017). Although others have shown no effect of progesterone administration on microbiota composition over time (Kindinger et al. 2017), we cannot rule out an effect here.

    L. crispatus has been correlated with stability over time in pregnancy, whereas L. iners has been associated with higher CST transition frequencies and instability (DiGiulio et al. 2015; Kindinger et al. 2017). L. iners is thought to be less competitive with Gardnerella vaginalis and thus, to permit more frequent transitions to a high diversity community type (Ravel et al. 2011; Brooks et al. 2017; Callahan et al. 2017). Our data lend further support to the distinct ecology and less protective properties of this particular Lactobacillus species.

    Analyses of community gene content indicated that although the individual was the strongest source of sample-to-sample variation, a significant gestational age (time) effect was observed for most subjects at all body sites. In addition, the dominant taxon (in vaginal communities) and adverse pregnancy outcome, e.g., preeclampsia (in gut and saliva communities) also appeared to be significant factors. For example, samples from subjects with a BMI > 25 appear to cluster closer to each other than to samples from other subjects. The small number of subjects in this study and the even smaller number with adverse pregnancy outcomes render any effort at this point to infer biological significance to this association problematic.

    A previous analysis of the gut microbiota at two time points during pregnancy from a larger cohort of pregnant women reported an increase in diversity from the first to the third trimester (Koren et al. 2012). In contrast, our data indicated a decrease in diversity over time in our 10 subjects. Detailed longitudinal analyses of larger numbers of subjects, well-characterized for the presence or absence of gestational diabetes, BMI, and adverse outcomes, will be needed for identification of biological leads.

    The genomes of six strains of Gardnerella vaginalis were reconstructed and analyzed. Although G. vaginalis has been associated with bacterial vaginosis and with increased risk of preterm birth (Schwebke et al. 2014; DiGiulio et al. 2015; Callahan et al. 2017), it is also commonly found in vaginal CST IV communities of healthy women (Romero et al. 2014). G. vaginalis “genotypes” have been proposed previously: Four main types were identified based on ∼473 concatenated core-gene alignments (Ahmed et al. 2012); three groups were suggested based on conserved chaperonin cpn60 genes (Schellenberg et al. 2016); four genotypes were proposed with molecular genotyping (Santiago et al. 2011); and four clades were suggested from phylogenies reconstructed from core-gene sets (Malki et al. 2016). Phylogenomic trees reconstructed from full-genome alignments, as well as from core SNPs, based on 34 publicly available genomes along with the genomes recovered here suggested a fifth divergent clade within the G. vaginalis tree. We identified groups of genes enriched in specific functions within each clade. None of the publicly available genome sequences were generated from isolates from pregnant subjects: Some came from asymptomatic healthy women, whereas other “pathogenic” strains were isolated from subjects with bacterial vaginosis (BV) (Yeoman et al. 2010). However, healthy versus “pathogenic” strains did not map to specific clades. Consistent with previous comparative genomics reports of G. vaginalis (Ahmed et al. 2012; Malki et al. 2016), the G. vaginalis genomes recovered here show high degrees of re-arrangements with many unique islands. Although multiple G. vaginalis strains were isolated from the same subject previously (Ahmed et al. 2012), the co-occurring genomes were not compared and the importance of co-occurring strains was not addressed. The results presented here suggest specialized functions encoded by closely related G. vaginalis strains and the possibility that different strains play different roles in health and disease.

    CRISPR-Cas systems are intrinsically related to the mobile elements (such as phage and plasmid populations) that have targeted the genomic host in which they reside. Thus, CRISPR spacers allow an indirect assessment of the mobile element pool in the sampled habitat. The relative number of spacer matches in gut and saliva samples from this study indicate that phage populations are better adapted to their local environments than to environments found in other individuals. Previous findings have shown that samples from the same individual provide the highest number of matches for spacers from that same individual (Pride et al. 2011; Rho et al. 2012; Robles-Sikisaka et al. 2013; Abeles et al. 2014). However, we detected phage populations that are common enough between body sites and subjects to be targeted by the same CRISPR spacers, suggesting that similar mobile elements have been encountered by the microbial communities at different body sites and by different individuals.

    Although many Lactobacillus species have been identified in the human gut, they represent relatively small proportions in 16S rRNA gene amplicon surveys (for review, see Walter 2008). Lactobacillus crispatus, Lactobacillus iners, Lactobacillus gasseri, and Gardnerella vaginalis are generally found in the vagina (The Human Microbiome Project Consortium 2012; Romero et al. 2014; for review, see Hickey et al. 2012), although L. crispatus and L. gasseri have also been detected in the human gut (for review, see Walter 2008), and other species, such as Lactobacillus rhamnosus, have been isolated from the vagina (Pascual et al. 2008). Near-complete genomes and genomic fragments of L. iners, L. crispatus, and G. vaginalis were recovered from gut samples in this study, and the Lactobacillus species found in the gut of a specific subject was the dominant species found in the vagina of the same subject. Genome reconstruction of identical strains from samples of different body sites in the same subject has been reported previously based on metagenomic data from infants (Olm et al. 2017). One explanation for the finding of the same species in vaginal and gut samples of the same subject is contamination during the self-sampling procedure (likely when rectal swabs were provided in lieu of stool samples) or contamination during library preparation. However, given that the concurrent vaginal species occurred at high frequencies in both stool and rectal swab samples, and given the variation observed in the reconstructed genomes (from re-arrangements and genomic islands) and in the SNP patterns, we believe that the strains are not identical and are, therefore, likely not due to contamination. Colonization of the vagina and gut body sites is unlikely to occur independently (the same vaginal species was always found in the woman's gut, with only one exception in the Stanford cohort). Colonization may occur from the mother at birth (especially if the subjects were delivered vaginally) or colonization from self at an earlier time point. Although there are no previous reports, to our knowledge, about shared bacterial species in the vagina and gut of the same individual, the phenomenon is unlikely to be limited to pregnant women. Examination of taxa reported in the Human Microbiome Project (HMP) (The Human Microbiome Project Consortium 2012) indicated the presence of L. iners and Gardnerella in gut samples from a few female subjects but in none of the male subjects. The paucity of evidence of L. iners, L. crispatus, and G. vaginalis in gut samples in the HMP study may be due to the difficulty in classifying short amplicon sequences at the species level, in addition to the shallow sequencing achieved in the HMP (The Human Microbiome Project Consortium 2012).

    The results reported here suggest dynamic behavior in the microbiome during pregnancy and highlight the importance of genome-resolved strain analyses to further our understanding of the establishment and evolution of the human microbiome.

    Methods

    Sample selection, DNA sample preparation, and whole-community metagenomic sequencing

    A subset of subjects from the cohort examined with 16S rRNA gene amplicon analysis in DiGiulio et al. (2015) was selected for this study. Six of these subjects (Term1–6) gave birth at term (after 37 wk) whereas four of these subjects (Pre1–4) gave birth prematurely. Two of the term pregnancies and one preterm pregnancy were complicated by preeclampsia, one by oligohydramnios, and one by type 2 diabetes. Subjects and samples were selected based on whether samples were available at regular intervals (every three gestational weeks) from all three body sites and on similar gestational dates across subjects. A total of 292 samples (101 vaginal swabs, 101 saliva, and 90 stool or rectal swabs) were selected for shotgun metagenomic analysis. No negative controls were performed for this study.

    Sequence processing and assembly

    More than 7 billion paired-end reads were obtained, low quality bases were trimmed using Sickle with default parameters (https://github.com/najoshi/sickle), adapter sequences were removed with SeqPrep (https://github.com/jstjohn/SeqPrep), and sequences >100 bp were retained. Trimmed reads were mapped to the human genome version GRCh37 with BWA (Li and Durbin 2009), and human reads were filtered out. Alignments were originally performed before release of Genome Reference Consortium Human Build 38 in December 2013. Regardless of the reference, some human-derived reads usually remain after filtering due to natural variation in the human genome. We do not expect that there would be any differences in our conclusions if we were to have used Build 38, as the mapping across our data sets was done relative to the same human genome Build. Nonhuman reads were co-assembled across all time points per subject, per body site, with IDBA-UD (Peng et al. 2012) using default parameters. Scaffolds from vaginal samples were binned based on %GC and coverage using the public knowledgebase, ggKbase (http://ggkbase.berkeley.edu), as well as time-series information using ESOM (Dick et al. 2009). Genes were predicted with Prodigal (Hyatt et al. 2010), tRNAs with tRNA-scan (Lowe and Eddy 1997), and functional annotations were assigned from BLAST (Camacho et al. 2008) searches against the KEGG (Kanehisa and Goto 2000) and UniRef100 (The UniProt Consortium 2017) databases.

    Assessing the community composition

    Full-length 16S rRNA gene sequences were reconstructed per body site with EMIRGE (Miller et al. 2011), using the SILVA database SSURef 111 (Pruesse et al. 2007) and run with 100 iterations. Sequences that were identified as chimeras with any two of the three chimera check software packages, UCHIME (Edgar et al. 2011), DECIPHER (Wright et al. 2012), and Bellerophon (Huber et al. 2004), were removed. The full-length 16S rRNA gene sequences from the six G. vaginalis strain genomes and those from 22 publicly available genomes were aligned with MUSCLE (Edgar 2004) and visualized with Geneious (Kearse et al. 2012). 16S rRNA gene amplicon data from our previous work (DiGiulio et al. 2015; Callahan et al. 2017) were used to evaluate richness trends with gestation.

    Calculations of relative abundances of genes and genome bins

    Nonhuman reads were mapped to the co-assembled scaffolds from each subject (per sample) using Bowtie 2 (Langmead and Salzberg 2012) with default parameters. Vaginal genome bin abundance was calculated by dividing the numbers of reads that mapped to all scaffolds within a bin by the total number of reads that mapped to all scaffolds within a subject.

    Sixteen single-copy ribosomal protein gene sets were used to estimate bacterial diversity within body sites, as previously described (Hug et al. 2016). In total, 22,737 ribosomal proteins annotated as L14b/L23e, L15, L16/L10E, L18P/L5E, L2 rplB, L22, L24, L3, L4/L1e, L5, L6, S10, S17, S19, S3, and S8 were identified in 4650 scaffolds from all body sites. Individual RP gene sets were clustered at 99% average amino acid identity with usearch64 (Edgar 2010), and scaffolds containing the same sets of clustered RPs were considered to be derived from the same taxon (accounting for a total of 4024 taxa). Scaffold coverage was used to estimate read counts for predicted genes, UniRef90 gene families, and RP abundance tables (taxa abundance tables).

    Statistical analysis

    NMDS ordination was performed on Bray-Curtis distance matrices of variance-stabilizing transformed UniRef90 gene family abundance tables using the DESeq2 (Love et al. 2014) and phyloseq (McMurdie and Holmes 2013) packages in R (R Core Team 2017). RDA was performed directly on variance-stabilizing transformed UniRef90 gene family abundance tables using the same R packages used for NMDS, with the formula: genes ∼ gestational age + complication (for gut and oral samples), and genes ∼ gestational age + top taxon (for vaginal samples). The 16S rRNA gene tables, UniRef90 gene family tables, RP tables, pathway abundance tables, and R code are available at the Stanford Digital Repository at https://purl.stanford.edu/vp282bh3698 (.rmd files contain the code, .RData files contain data tables).

    Functional pathway presence and abundance analysis

    Functional pathways were predicted with HUMAnN2 (Abubucker et al. 2012) using the read FASTQ files. Pathway abundance tables were variance-stabilized with DESeq2 in R and are available in the .RData files at the Stanford Digital Repository at https://purl.stanford.edu/vp282bh3698.

    G. vaginalis, L. iners, and L. crispatus phylogenetic trees and genome comparisons

    Thirty-four publicly available G. vaginalis strain genomes plus the six G. vaginalis genomes recovered in this study were aligned with progressiveMauve (Darling et al. 2010). Similarly, six L. iners genomes assembled from vaginal samples, the two from gut samples, and 16 publicly available L. iners genomes were aligned with progressiveMauve, with L. iners DSM 13335 as a reference. In addition, 37 publicly available L. crispatus genomes (assembled in <120 scaffolds), as well as six L. crispatus genomes reconstructed from vaginal and gut samples, were aligned with progressiveMauve, with the L. crispatus ST1 representative genome as a reference. The output of progressiveMauve was visualized and exported in standard xmfa format using Gingr (Treangen et al. 2014) and converted to FASTA format using a publicly available Perl script (https://github.com/kjolley/seq_scripts/blob/master/xmfa2fasta.pl). For G. vaginalis, the multiple genome alignment in FASTA format was used to construct the phylogenetic trees using FastTree (Price et al. 2009) version 2.1.8 with parameters –nt –gtr –gamma and visualized using Geneious (Kearse et al. 2012). Orthologous gene groups between G. vaginalis, L. iners, and L. crispatus genomes were exported from the Mauve alignments based on at least 60% identity over 70% alignment coverage. Orthologous groups were classified into functional categories represented in the Gene Ontology (GO) database by mapping GI gene numbers to GO IDs in bioDBnet (Mudunuri et al. 2009). Hierarchical clustering on the Jaccard distances of the presence/absence of orthologous groups was performed with the vegan package (https://cran.r-project.org/package=vegan), and the heat map was created with the pheatmap package (https://cran.r-project.org/package=pheatmap) in R (R Core Team 2017). The number of clusters was determined visually from the hierarchical clustering, and the list of genes within each cluster was obtained with the function cutree from the R package “stats” (R Core Team 2017). The enrichment of genes within functional categories was estimated from the number of genes in each functional category within each cluster divided by the total number of orthologous genes in each cluster.

    kSNP3.0 (Gardner et al. 2015) was used to build phylogenetic trees from whole-genome SNP patterns in Lactobacillus iners, Lactobacillus crispatus, and Gardnerella vaginalis. iRep (with default parameters) was used to estimate the replication rate of genomes reconstructed from the vaginal metagenomic data sets as in Brown et al. (2015).

    Strain comparison across body sites

    Bowtie 2 (Langmead and Salzberg 2012) was used to identify reads from the gut data set that mapped to the genome of each Lactobacillus species assembled from the gut. Then, Bowtie 2 was used to map that subset of reads against the corresponding vaginal population genome assembled from the same subject. Reads from the vaginal samples were also mapped to the corresponding Lactobacillus species assembled from the vagina. VarScan (Koboldt et al. 2012) was used to identify the SNPs, with parameters minimum coverage >3 and P value < 0.05.

    Identification of CRISPR repeats/spacers, predicted Cas proteins, and CRISPR-Cas systems

    The program Crass (Skennerton et al. 2013), with default settings, was used to identify CRISPR repeats and spacers from all reads from each sample. The output was parsed using CRISPRtools (http://ctskennerton.github.io/crisprtools/). HMMER3 (Mistry et al. 2013) was used to search all predicted proteins against a Cas proteins HMM database created from alignments from references (Makarova et al. 2015; Shmakov et al. 2015). The HMM database can be found in Supplemental Data 6 from Burstein et al. (2016). CRISPR-Cas system types were identified based on the presence of two predicted Cas proteins. CRISPR-Cas types were only calculated for gut and saliva samples (vaginal samples had low organismal diversity). Cas types that were incomplete were marked as unknown/unclassified.

    Detection of potential phage and plasmid sequences and dynamics of spacer matching

    For body site-specific analysis, CRISPR repeats and spacers were extracted from the reads combined across all sites. For individual analyses, CRISPR repeats and spacers were combined for each body site within subjects (Term1 vaginal reads, Term 1 gut reads, Term1 saliva reads, etc.). A spacer was recorded as a match to a scaffold if the spacer matched with at least 85% identity across 85% of the spacer length against a scaffold using BLASTN (Camacho et al. 2008), with the parameter “-task blastn-short”.

    Data access

    The sequencing data generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject) under accession number PRJNA288562. The specific BioSample accession numbers (292 in total) are listed in Supplemental Table S5. Assemblies, protein sequences, gbk files, gene and protein abundance tables, metadata, and R code are publicly available at the Stanford Digital Repository at https://purl.stanford.edu/vp282bh3698. R markdown and data files are available as Supplemental Data. The 29 genome assemblies have been submitted to the NCBI GenBank (https://www.ncbi.nlm.nih.gov/genbank/) under accession numbers QWBX00000000, QWBY00000000, QWBZ00000000, QWCA00000000, QWCB00000000, QWCC00000000, QWCD00000000, QWCE00000000, QWCF00000000, QWCG00000000, QWCH00000000, QWCI00000000, QWCJ00000000, QWCK00000000, QWCL00000000, QWCM00000000, QWCN00000000, QWCO00000000, QWCP00000000, QWCQ00000000, QWCR00000000, QWCS00000000, QWCT00000000, QWCU00000000, QWCV00000000, QWCW00000000, QWCX00000000, QWCY00000000, and QWCZ00000000.

    Acknowledgments

    We thank study participants, as well as Cele Quaintance, Nick Scalfone, Ana Laborde, March of Dimes Prematurity Research Center study coordinators, and nursing staff in the obstetrical clinics and the labor and delivery unit of Lucille Packard Children's Hospital. We also thank Alvaro Hernandez at the University of Illinois Roy J. Carver Biotechnology Center. We thank David Burstein for advice and scripts for Cas protein analysis and Elizabeth Costello for helpful discussions. This research was supported by the March of Dimes Prematurity Research Center at Stanford University School of Medicine, the Thomas C. and Joan M. Merigan Endowment at Stanford University (D.A.R.), the Chan Zuckerburg Biohub (D.A.R.), and the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health under Award 1K99HD090290 (D.S.A.G.).

    Footnotes

    • Received February 15, 2018.
    • Accepted August 28, 2018.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents

    Preprint Server