Accurate reconstruction of bacterial pan- and core genomes with PEPPAN

Bacterial genomes can contain traces of a complex evolutionary history, including extensive homologous recombination, gene loss, gene duplications, and horizontal gene transfer. To reconstruct the phylogenetic and population history of a set of multiple bacteria, it is necessary to examine their pangenome, the composite of all the genes in the set. Here we introduce PEPPAN, a novel pipeline that can reliably construct pangenomes from thousands of genetically diverse bacterial genomes that represent the diversity of an entire genus. PEPPAN outperforms existing pangenome methods by providing consistent gene and pseudogene annotations extended by similarity-based gene predictions, and identifying and excluding paralogs by combining tree- and synteny-based approaches. The PEPPAN package additionally includes PEPPAN_parser, which implements additional downstream analyses, including the calculation of trees based on accessory gene content or allelic differences between core genes. To test the accuracy of PEPPAN, we implemented SimPan, a novel pipeline for simulating the evolution of bacterial pangenomes. We compared the accuracy and speed of PEPPAN with four state-of-the-art pangenome pipelines using both empirical and simulated data sets. PEPPAN was more accurate and more specific than any of the other pipelines and was almost as fast as any of them. As a case study, we used PEPPAN to construct a pangenome of approximately 40,000 genes from 3052 representative genomes spanning at least 80 species of Streptococcus. The resulting gene and allelic trees provide an unprecedented overview of the genomic diversity of the entire Streptococcus genus.


Introduction
Soon after the first bacterial genome was sequenced (Fleischmann et al. 1995), it became clear that the genomic contents varied between individual strains within a prokaryotic species. Variable genomic content is caused by the gain or loss of singleton ORFan genes (Daubin and Ochman 2004), genomic islands, selfish DNA (plasmids, bacteriophages, integrative conjugative elements) and/or widespread horizontal gene transfer (Abby et al. 2012;Szöllösi et al. 2012;Croucher et al. 2014). Thus, the designation "pan-genome" was introduced to refer to the entire gene contents of a 3 bacterial species or set of strains (Tettelin et al. 2005). Bacterial pan-genomes can be divided into the core genome, which consists of the subset of genes that are present in all genomes and the accessory genome, which consists of genes which are variably present among individual genomes. The core genome often contains phylogenetic signals reflecting the vertical accumulation of mutations, and can be used for assignments of bacterial strains to populations. Early pan-genome analyses (Tettelin et al. 2005;Hogg et al. 2007) addressed these problems by running TBLASTN gene-against-genome comparisons, but such inconsistencies between genome annotations are not addressed by the latest generation of pan-genome pipelines, which do not include a reannotation step. Namely, genes which have been fragmented by assembly errors or pseudogenisation may still be relevant to cell function (Goodhead and Darby 2015) and should therefore be included in pan-genomes. The identification of such gene fragments requires comparisons against intact analogs (Lerat and Ochman 2005), but automatic annotation pipelines instead annotate them as multiple intact genes, reducing the size of the estimated core genome and overestimating the overall size of the pan-genome.
The second problem in computing a pan-genome is that of differentiating orthologous genes, which have evolved by vertical descent, from paralogous genes derived from gene duplications or horizontal gene transfer (HGT) events. Paralogous genes can become fixed in populations, but many are gained or lost multiple times. This generates complex patterns of presence/absence along the phylogeny. Therefore, including paralogous genes in a phylogenetic analysis may lead to inaccurate interpretations.
State-of-the-art pan-genome pipelines implement either graph-or tree-based algorithms for the identification of paralogous genes (Altenhoff et al. 2019). However, tree-based algorithms (used by PanX) which reconcile gene trees with a species tree do not scale well to large datasets. Graph-based algorithms (used by Roary and PIRATE) run faster because they ignore phylogenetic relationships between genomes, but perform poorly on benchmark datasets (Ding et al. 2018).
Here we present PEPPAN, a novel pipeline for calculating pan-genomes that specifically deals with the problems described above. We describe the algorithms implemented within PEPPAN, and show that it outperforms other pan-genome methods on both empirical and simulated datasets. As a demonstration of PEPPAN's capabilities, we present a pan-genome calculated from 3052 representatives of Streptococcus, a highly diverse genus.

A brief overview of PEPPAN
PEPPAN's workflow consists of the following five successive groups of operations ( Fig.   1 and Supplemental Fig. S1) with additional details in Supplemental Text 1.

5
(1) Identifying representative gene sequences. The inputs for PEPPAN consist of GFF3 formatted genome assemblies (Ensembl Release 98 2019). PEPPAN also accepts inputs of additional nucleotide sequences, which are used to refine gene predictions. To reduce the number of genes used in downstream analyses, PEPPAN iteratively clusters genes using Linclust (Steinegger and Soding 2017), resulting in a single representative gene sequence per 90% nucleotide homology cluster.
(2) Identifying all gene candidates in all genomes. Each representative gene is aligned to all genomes using both BLASTn (Altschul et al. 1990), which accurately locates short inserts and deletions (INDELs), and DIAMOND (Buchfink et al. 2015), which generates amino acid alignments and has greater sensitivity with divergent sequences than BLASTn. Alignments are re-scored and all sequences with homology ≥ 50% across ≥ 50% of the representative sequence (Supplemental Text 1.2) are clustered in a Neighbour-Joining tree using RapidNJ (Simonsen et al. 2011).
(3) Identifying clusters of orthologous genes. PEPPAN identifies putative orthologs by calculating a paralogous score for each branch in a gene cluster tree (see Supplemental Text 1.3.2) based on ratio of the pairwise genetic distances of candidate genes within each cluster to the average genetic distances of their host genomes. Using average genetic distances avoids potential errors that can be introduced by using a 'species' tree to reconcile individual gene cluster trees (Altenhoff et al. 2019). Branches with a paralogous score of >1 are iteratively pruned until none remain. The remaining monophyletic subtrees are treated as putative orthologs.
The genomic locations of multiple putative orthologs may overlap in some genomes due to either inconsistent genome annotations or a failure to cluster divergent orthologous sequences in the first stage. These conflicts are resolved by retaining the ortholog with the greatest information score (see Supplemental Text 1.3.3), and eliminating all other gene candidates for that region.
The remaining gene candidates from each genome are ordered according to their genomic coordinates, and the final set of orthologous genes is identified based on synteny (see Supplemental Text 1.3.4). 7 genome assemblies. Subsequently, we designed a new simulation tool, SimPan, which uses SimBac (Brown et al. 2016) to simulate the dynamics of pan-genome evolution via recombination, HGT, gene gain and loss as well as the creation of paralogs (Supplemental Text 2).

Benchmarking pan-genome pipelines on 15 curated genomes
Nuccio and Bäumler re-annotated 15 complete genomes of S. enterica (Nuccio and Bäumler 2014). They removed existing annotations for unreliable short genes, performed new BLASTN and TBLASTN alignments to identify previously not annotated genes, corrected the start positions of falsely annotated genes and predicted the existence of pseudogenes based on alignments with orthologous intact CDSs. The result of these efforts is a unique set of consistently annotated genomes from a single species, which we equated with the 'ground truth' with which to compare the results from the pan-genome pipelines.
First, we compared the manual re-annotation with three sets of gene annotations for each of the 15 S. enterica genomes: (1) the original annotation that had been submitted to GenBank ("Submitter"), (2) an automated re-annotation from RefSeq (Haft et al. 2018) that was generated with PGAP (Tatusova et al. 2016), (3) a novel annotation using PROKKA (Seemann 2014), another popular bacterial annotation pipeline. Genes that had been eliminated by Nuccio and Bäumler as being "unreliable" were removed from all three annotations for consistency. We then examined the degree of concordance between the pan-genome published by Nuccio and Bäumler with the pan-genomes calculated by each of the pipelines. Concordance was estimated by calculating the adjusted Rand index (ARI) (Rand 1971), which is a measure of similarity between clustering results. For Roary or PIRATE we only report results from the run with the greatest ARI among three parallel runs with varying minimum identity (50, 80 or 95%), because the optimal value of this parameter differs for various levels of diversity ((Ding et al. 2018) and our own observations).
All pipelines successfully calculated a pan-genome from each of the four annotations, except that "Submitter" annotations never ran to completion with PanX. The PEPPAN pan-genomes consistently yielded ARIs of ~0.98 relative to the manual pan-genome 8 ( Fig. 2A, histograms). This is not surprising because PEPPAN re-calculates gene annotations in a fashion which resembles that of the manual curation. All the other pipelines yielded lower ARI values which varied between the annotation methods. The PROKKA annotations yielded ARIs of 0.97 with Roary, PanX, and OrthoMCL and 0.96 with PIRATE. The ARIs were 0.95-0.96 for the PGAP annotations from RefSeq, and 0.93-0.94 for the "Submitter" annotations. We also performed hierarchical clustering using the Neighbor-Joining algorithm on pair-wise comparisons of the ARI scores across all 14 pan-genomes (Fig. 2B). The three pan-genomes predicted by PEPPAN formed a tight cluster with high pair-wise ARI (0.99), which clustered tightly with the curated pan-genome (ARI=0.98). In contrast, pan-genomes generated by the other pipelines clustered according to annotation source rather than pipeline methodology.
For each of the three annotation sources, the pan-genome predicted by Roary was the most distinct whereas pan-genomes predicted by OrthoMCL, PanX, and PIRATE clustered more tightly. These results may reflect the fact that Roary differs from the other pipelines by performing an additional splitting of paralogs on the basis of synteny.
Pseudogenes prediction. The core genome defined by Nuccio and Bäumler contained 2,838 CDSs that were intact in all 15 genomes and 783 others that were disrupted in at least one genome. PEPPAN predicted marginally more intact CDSs, and slightly fewer pseudogenes, from all three annotations than were present in the manual annotations ( Fig. 2A, circles). The number of pseudogenes for each genome was also very similar between the manual curations and PEPPAN's automated predictions. We note that PEPPAN consistently predicts fewer pseudogenes for extraintestinal strains than those for those linked to gastrointestinal disease (Fig. 2C), This is an interesting observation, as accumulation of pseudogenes has been linked to host specialization in Salmonella (Parkhill et al. 2001;Holt et al. 2008;Nuccio and Bäumler 2014;Zhou et al. 2014;Zhou et al. 2018b).
Roary, OrthoMCL and PanX do not predict any disrupted genes. PIRATE reports 'gene diffusion', a measure of the frequency with which CDSs that are intact in some genomes are split into two or more fragments in others. However, it did not detect any gene diffusion in the RefSeq and GenBank annotations, and only one instance with the 9 PROKKA annotations. PIRATE also failed to predict fragmented genes. Similar to the ARI comparisons described above, the total numbers of predicted core CDSs varied according to annotation source for all pipelines other than PEPPAN. The four pipelines reported 3,301-3,515 core CDSs from PROKKA annotations ( Fig. 2A right). These numbers are similar to the total number of intact core CDSs plus pseudogenes within the curated pan-genome, indicating that PROKKA predicted many pseudogenes as intact CDSs. Roary, PIRATE and OrthoMCL only detected 2,418-2,484 core genes in the originally submitted genomes, suggesting inconsistencies between individual genome annotations. In contrast, all four pipelines predicted 2,901-2,957 core CDSs from the RefSeq annotations, and these numbers were similar to the numbers of intact core CDSs in the curated pan-genome (2,838), or as predicted by PEPPAN (2,918-2,961).
Inaccurate prediction of orthologs. Inconsistent ortholog calls relative to the manually curated pan-genome (Nuccio and Bäumler 2014) also contributed to variation in the numbers of core CDS predicted by the different pipelines. We designated as 'false splits' those cases where a single ortholog cluster in the curated pan-genome was split into multiple ortholog clusters by a pipeline. Similarly, 'false merges' occurred when multiple orthologous clusters in the curated pan-genome were assigned to a single orthologous cluster. We identified 4,695 'backbone genes' in the curated pan-genome which were present in the most recent common ancestor (MRCA) and 3,364 'mobile' genes, which were associated in one or more genomes with mobile genetic elements, and which were absent from the MRCA. For backbone genes, PEPPAN made the fewest false splits and false merges of all five pipelines, followed by PanX (Fig. 2D). False merges were made four times as often by all pipelines for mobile genes than for backbone genes, and false splits were up to two times as frequent (Fig. 2E vs. 2D). Roary generated the highest number of false calls, while PEPPAN generated the lowest.

Simulating pan-genome datasets
The analyses above indicate that the backbone and mobile genes might differ in their rates of gain and loss during evolution. In order to test the abilities of pan-genome pipelines to handle varying rates of gene gain and loss, we created SimPan (https://github.com/zheminzhou/SimPan) to simulate the evolution of real bacterial pangenomes (Supplemental Fig. S2 and Table S1; see Supplemental Text 2 for details). In brief, SimPan uses SimBac (Brown et al. 2016) to generate a clonal genomic phylogeny.
This clonal phylogeny is subjected to random homologous recombination, resulting in different "local trees" that reflect the individual ancestries of backbone and mobile genes.
Random INDEL events leading to loss or gain of blocks of genes are simulated along the branches of these local trees until the average number of genes per genome and in the core genome attain user-specified parameters '--aveSize' and '--nCore' (Supplemental Table S1). This results in a presence/absence matrix of all backbone and mobile genes. We simulated five genomic datasets each containing 15 genomes, using parameters derived from the curated S. enterica pan-genome, with each genome containing a mean of 3,621 core genes and 879 accessory genes (simulations a-e). We arbitrarily assigned 5% of the backbone genes and 40% of the mobile genes to paralogous clusters, and varied their mean percentage sequence identities between each set of simulations (Fig   3, inset). Simulation c represents the simplest pan-genome construction scenario, with high sequence identity (98%) between genes in an ortholog cluster and low sequence identity (60%) between genes in a paralog cluster. Simulations a and b have decreasing levels of identity between orthologs to simulate more diverse species while simulations d and e have increasing levels of identity between paralogs in order to simulate recent gene duplications.

Pipeline performance on simulated genomes
Pan-genomes calculated from each simulated dataset by PEPPAN, Roary, PIRATE, PanX, and OrthoMCL were compared to the original pan-genomes produced by SimPan PEPPAN correctly predicted all core genes in simulations b, c, and e, and only missed 2-3 core genes in the two remaining datasets (Fig. 3A, circles). Roary correctly predicted all single-copy core genes for simulation c, but failed to identify any multi-copy core genes for any dataset, likely due to its aggressive synteny-based paralog identification step. PIRATE, PanX, and OrthoMCL significantly underestimated the number of core genes when only single-copy core genes were counted, suggesting a high frequency of false splitting of paralog clusters. Indeed, the frequency of false merges was particularly high for backbone genes with these three pipelines, and the frequency of false splits was high with Roary and OrthoMCL (Fig. 3B). All pipelines made multiple false merges of mobile genes, possibly because of their predominance among paralog clusters, and Roary also made large numbers of false splits (Fig. 3C).
Overall, PEPPAN made the fewest false calls for both backbone and mobile genes, which explains its higher ARI scores.
The effects of missing gene annotations on the pan-genome. As shown above, inconsistent or inaccurate gene annotations are problematic for calculating reliable pangenomes. We simulated this effect by randomly deleting 0.1%, 1% or 2% of the gene annotations from simulation c ( Fig 3D). Because PEPPAN re-assigns individual genes to ortholog clusters, it was unaffected by these manipulations. However, the missing annotations yielded drastically reduced ARI scores ( Fig. 3D. histograms) and core genome sizes (Fig. 3D, circles) for the other pipelines, and ARI scores became progressively worse with the proportion of missing annotations.
Computation time. We generated 10 additional simulations of 20 to 200 genomes with the same parameters as simulation c, and measured the running wall times to calculate a pan-genome for all five pipelines using 4 processors on a server with 1 TB of memory and 40 CPU cores (Fig. 3E). OrthoMCL was the slowest and needed >24 h for 60 genomes. PanX was at least eightfold slower than the other three pipelines, and needed 500 min for 200 genomes, despite using a divide-and-conquer algorithm on datasets 12 with >50 genomes. Both Roary and PIRATE scaled very well, and each completed the calculations on 200 genomes within 30 minutes. PEPPAN is about twice as slow as either Roary or PIRATE, and needed 63 minutes for 200 genomes. The good scalability of these pipelines is likely related to the pre-clustering step, which reduces the number of genes used in downstream comparisons. However, this pre-clustering step becomes less efficient with increasing genetic diversity: in an independent simulation of 200 genomes with only 90% sequence identity, the runtime for all three pipelines increased by at least twofold relative to simulation c (PEPPAN: 144 min; Roary: 132; PIRATE: 60). We generated a dataset of 3052 high-quality genomes (Supplemental Table S2A) representing the entire taxonomic diversity of Streptococcus (see Methods). PEPPAN took 5 days to construct a pan-genome from this dataset. The resulting pan-genome contained 39,042 genes, twice as many as a previous pan-genome based on 138 Streptococcus genomes (Gao et al. 2014). In agreement with the earlier conclusions by Gao et al., the rarefaction curve showed no sign of plateauing, and the pan-genome continued to expand with each new genome added ( Fig 4A). Gao et al. estimated that the pan-genome would expand by 62 genes for each new genome, whereas we estimate a lower rate of 39 genes per new genome for a randomly sampled set of 138 genomes. However, the growth rate dropped dramatically with the increased number of genomes, and we estimate that the future expansion rate of the pan-genome is only 4.4 new genes for every newly added genome.
In contrast to earlier studies (Gao et al. 2014), which defined a strict core genome of 278 orthologs, we found only 182 genes that were shared across all Streptococcus 13 genomes ( Fig 4B, inset). Each of these was disrupted in at least one of the 14,115 Streptococcus genomes in RefSeq. This is a common problem for core genome analyses, especially because the multiple contigs within draft genomes can result in the absence of multiple genes from genome assemblies. Core genome schemes used for cgMLST are therefore usually based on a relaxed core, consisting of single-copy genes present in the large majority of representative isolates (Moura et al. 2016;Alikhan et al. 2018;Zhou et al. 2020). Our analyses identified 754 genes that were present in at least 2900 (95%) of the representative streptococcal genomes (Supplemental Table S3, Fig.   4B). However, most of the 754 genes were present in multiple copies in some genomes, leaving a final relaxed core of 292 single-copy genes that are suitable for identifying core genomic relationships and evolutionary history (Table 1; Supplemental Table S3).

Taxonomic clusters within Streptococcus
Streptococcus taxonomy is a highly dynamic area of research ( Single-linkage agglomerative clustering of pair-wise ANI values calculated from the 3052 representative genomes revealed 223 clusters (Supplemental Table S2). For the 29 clusters containing ≥ 10 genomes, we also identified a dominant species designation from NCBI metadata, as shown in Supplemental Table S4. Information on each cluster's pan-genome can be found in Supplemental Text 4 and Supplemental Table S5.
We used PEPPAN_parser to generate two trees of the 3052 representative genomes based on the presence or absence profiles of 39,042 pan genes (Fig. 5A) and on the allelic variation profiles of 292 relaxed core genes (Fig, 5B). The topology of the first tree reflects similarities in pan-genome content and the topology of the second tree reflects sequence similarities within core genes. Unsurprisingly, the details of these two topologies differed somewhat. In particular, the core gene tree contained an unresolved, 14 star-like radiation which we attribute to distinct sequences in all of the core genes from highly diverse species. However, despite these differences in deep branching topology, both trees showed comparable tight clustering of genomes corresponding to each of the 29 common taxonomic groupings. This tight clustering indicates that the topologies of both trees are congruent at the ANI95% level. Both trees also support published taxonomic assignments of subspecies. For example, MG_29 corresponds to S. gallolyticus and includes its three subspecies gallolyticus, macedonicus, and  Table 1). After excluding multi-copy genes, the sizes of the group-specific, 95% relaxed core genomes ranged from 380 (Mutans) to 672 (Suis) genes ( Fig. 4C-D, Table 1).
In accord with prior observations (Kilian et al. 2008; Kilian and Tettelin 2019), numerous discrepancies differentiate the ANI95% groups and the taxonomic designations in RefSeq. Some discrepancies reflect inaccurate metadata, but others reflect true discrepancies between ANI95% clusters and taxonomic designations made by expert microbiologists. For example, S. mitis spans 44 distinct ANI95% clusters (  Table S4). Similarly, S. oralis straddles multiple, distinct ANI95% clusters, as did each of the three S. oralis subspecies oralis, tigurinus and dentisani defined by Roary and Pirate both use MCL, a graph-based clustering approach (Enright et al. 2002) to split paralogous clusters. MCL identifies a strict optimal threshold that separates orthologous genes from paralogous genes, and scales well with the numbers of genes.
This approach is accurate for closely related genomes, but is error-prone when datasets contain both closely-related and distantly-related genomes, because a single optimal clustering threshold does not exist for both extremes. PIRATE thus failed to split many paralogous clusters from real (Figs. 2D, E) and simulated (Figs. 3B, C) genomes, especially for more diverse datasets (Fig. 3A). Roary implements an additional syntenybased approach to identify and split unresolved paralog clusters, but this approach also failed to correctly split orthologs into multiple clusters (Figs. 2D, E, 3B, C). In contrast, PEPPAN identifies an optimal threshold for each gene and uses that threshold to split paralogous branches in the gene trees. This allows accurate estimates of pan-genomes even in datasets of highly divergent genomes.
PanX uses a "divide and conquer" strategy for the gene comparisons, which is computationally demanding. In addition, PanX constructs a gene tree for every potential gene cluster, which, similar to other tree-based approaches, involves the alignment of

Effects of internal annotations by PEPPAN. Our benchmarking analyses on real and
simulated genomes revealed the strong impact of inconsistent annotations on the pangenome predictions ( Fig. 2A). Indeed, differences in annotation influenced the quality of the pan-genome more than pipeline algorithms (Fig. 2B), and decreased the number of core genes by up to one-third for some pipelines ( Fig. 2A). PEPPAN avoids this problem by implementing a similarity-based gene prediction step. Accordingly, pangenomes predicted by PEPPAN varied only slightly with different annotations ( Fig. 2A,   B). Draft genome assemblies based on 454 or IonTorrent sequencing include elevated numbers of single-base insertions and deletions due to inaccurate sequencing (Shao et al. 2013;Zhang et al. 2015). Including such genomes in an analysis reduces the quality of the pan-genome for all state-of-the-art pipelines. However, PEPPAN simply scores genes disrupted by artificial INDELs as frameshifts, making such inaccurate genomes easier to identify.
Finally, it is worth noting that the current similarity-based internal annotation algorithm implemented in PEPPAN is optimized for prokaryotes, and does not work for eukaryotic genomes, where multiple exons of a gene can be separated by introns of >10kb. Apart from this limitation, however, the other technological advantages in PEPPAN will also work on eukaryotic genomes. PEPPAN could therefore be extended for use on eukaryotes with collaboration from experts in eukaryotic genomics.
Relevance to MLST schemes. Alikhan et al. (2018) described a pan-genome for all of Salmonella based on 537 genomes that had been derived by a precursor of PEPPAN in 2015. That pan-genome was used to develop a wgMLST scheme of 21,065 loci and a cgMLST scheme of 3002 genes. The same publication also described a reference set of 926 genomes that represented the diversity of almost 120,000 Salmonella genomes on the basis of rMLST. After completion of this manuscript, we became aware of a new publication (Park and Andam 2020) which used Roary to calculate a pan-genome of 84,041 S. enterica genes and 2085 soft-core genes from those 926 representative genomes after re-annotation with PROKKA. Such applications of Roary are strongly discouraged by its documentation, which recommends against using Roary on diverse groups of organisms such as all Salmonella. We ran PEPPAN on the same 926 representative genomes. The resulting pan-genome contained 30,000 fewer pan-genes and 1200 more soft-core genes than the calculations by Park and Andam (Supplemental Table S6), confirming that Roary struggled with this task. The high resolution and continued reliability that EnteroBase offers in downstream analyses of phylogenetic relationships between genomes are in part due to the accurate, smaller pan-genome and larger core-genome that was calculated by PEPPAN. The analyses presented here identified a reliable relaxed soft-core genome consisting of 292 single copy genes for Streptococcus, which is currently being used to establish an EnteroBase database for this diverse genus. and Bäumler excluded some "unreliable" short genes from their manual re-curation. In order to exclude these genes in our analyses as well, the genomic coordinates of each gene in each of the three annotations (Submitter, RefSeq, PROKKA) were compared with the coordinates of "reliable genes" in Table S1 of Nuccio and Bäumler. Only genes with co-ordinates overlapping those of a reliable gene by ≥ 90% were used here for further comparisons.

Preparation of simulated datasets.
All simulated datasets were generated using SimPan (Supplemental Text 2) with the Pan-genome pipelines.
The following versions of the individual pipelines and command lines were used for all benchmark datasets.
(1) PEPPAN with a Git HEAD of f721513 was run in the Python 3.6 environment as: python PEPPA.py -t 4 -p PEPPAN --pseudogene 0.9 --min_cds 45 *.gff (2) Roary 3.6.0+dfsg-4 was installed as a Ubuntu APT package and run as: roary -p 4 -o roary -f roary -i <identity> -s -v -y *.gff Three runs of Roary were performed for each dataset with the additional parameters "-i 50", "-i 80" or "-i 95". The data reported here are from the runs with the parameter '-i 80' because that consistently yielded the best ARI values.

Generating ANI95% clusters of Streptococcus genomes.
A summary table of all genomes deposited in RefSeq was downloaded on 20 June of 2019 (see S. enterica genomes above). 14,115 bacterial records that contained "Streptococcus" in the "organism_name" field were extracted from the table (Supplemental Table S7), and the files for each record were downloaded as described above. MASH (Ondov et al. 2016) Table S2B), leaving a dataset of 3052 high-quality genomes (Supplemental Table S2A) that represents the entire taxonomic diversity of Streptococcus. Pair-wise ANI values were calculated from the 3052 representative genomes with FastANI v1.2 (Jain et al. 2018), and these genomes were grouped into ANI95% clusters using the AgglomerativeClustering function with linkage=single and distance_threshold=0.05.

Data Access
Source code for PEPPAN is accessible at https://github.com/zheminzhou/PEPPAN and as Supplemental Code S1. Source code for SimPan is accessible at https://github.com/zheminzhou/SimPan and as Supplemental Code S2. Trees presented in Figure

Competing interest statement
The authors declare no competing interests.  R  e  c  o  g  n  i  z  i  n  g  t  h  e  p  s  e  u  d  o  g  e  n  e  s  i  n  b  a  c  t  e  r  i  a  l  g  e  n  o  m  e  s  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s  3  3   :  3  1  2  5  -3  1  3  2  .   L  i  L  ,  S  t  o  e  c  k  e  r  t  C  J  ,  J  r  .  ,  R  o  o  s  D  S  .  2  0  0  3  .  O  r  t  h  o  M  C  L  :  i  d  e  n  t  i  f  i  c  a  t  i  o  n  o  f  o  r  t  h  o  l  o  g  g  r  o  u  p  s  f  o  r  e  u  k  a  r  y  o  t  i  c  g  e  n  o  m  e  s  .  G  e  n  o  m  e  R  e  s  1  3   :  2  1  7  8  -2  1  8 S  i  m  o  n  s  e  n  M  ,  M  a  i  l  u  n  d  T  ,  P  e  d  e  r  s  e  n  C  N  S  .  I  n  f  e  r  e  n  c  e  o  f  l  a  r  g  e  p  h  y  l  o  g  e  n  i  e  s  u  s  i  n  g  N  e  i  g  h  b  o  u  r  -J  o  i  n  i  n  g  .  2  0  1  1  .  B  i  o  m  e  d  i  c  a  l  E  n  g  i  n  e  e  r  i  n  g  S  y  s  t  e  m  s  a  n  d  T  e  c  h  n  o  l  o  g  i  e  s  :  3  r  d  I  n  t  e  r  n  a  t  i  o  n  a  l  J  o  i  n  t  C  o  n  f  e  r  e  n  c  e  ,  B  I  O  S  T  E  C  2  0  1  0  .  C  o  m  m  u  n  i  c  a  t  i  o  n  s  i  n  C  o  m  p  u  t  e  r  a  n  d  I  n  f  o  r  m  a  t  i  o  n  S  c  i  e  n  c  e  ,  3  3  4  -3 P  e  n  t  o  n  C  R  ,  X  u  e  C  ,  W  a  n  g  Q  ,  Z  h  e  n  g  T  ,  T  i  e  d  j  e  J  M  .  2  0  1  5  .  E  v  a  l  u  a  t  i  o  n  o  f  t  h  e  I  o  n  T  o  r  r  e  n  t  p  e  r  s  o  n  a  l  g  e  n  o  m  e  m  a  c  h  i  n  e  f  o  r  g  e  n  e  -t  a  r  g  e  t  e  d  s  t  u  d  i  e  s  u  s  i  n  g  a  m  p  l  i  c  o  n  s  o  f  t  h  e  n  i  t  r  o  g  e  n  a  s  e  g  e  n  e  n  i  f  H  .  A  p  p  l  E  n  v  i  r  o  n  M  i  c  r  o  b  i  o  l  8  1   :  4  5  3  6  -4  5  4  5  .   Z  h  a  o  Y  ,  W  u  J  ,  Y  a  n  g  J  ,  S  u  n  S  ,  X  i  a  o  J  ,  Y  u  J  .  2  0  1  2  .  P  G  A  P  :  p  a  n  -g  e  n  o  m  e  s  a  n  a  l  y  s  i  s  p  i  p  e  l  i  n  e  .  B  i  o  i  n  f  o  r  m  a  t  i  c  s  2        clusters as shown in the inset. Many Streptococcus species have been assigned to one of six traditional taxonomic groups whose names are shown outside colored arcs.

28
These trees define from the Suis group which contains S. suis. A black arrow in Figure   B shows the root of the tree, where multiple branches radiate directly outwards due to lack of resolution of cgMLST for such distant taxa. All ANI95% cluster information can be found in Supplemental    No. core genes (circles) P E P P A P E P P A R o a r y R o a r y S im u la t io n P I R A T E P I R A T E P a n X P a n X