Complete avian malaria parasite genomes reveal features associated with lineage-specific evolution in birds and mammals

Avian malaria parasites are prevalent around the world and infect a wide diversity of bird species. Here, we report the sequencing and analysis of high-quality draft genome sequences for two avian malaria species, Plasmodium relictum and Plasmodium gallinaceum. We identify 50 genes that are specific to avian malaria, located in an otherwise conserved core of the genome that shares gene synteny with all other sequenced malaria genomes. Phylogenetic analysis suggests that the avian malaria species form an outgroup to the mammalian Plasmodium species, and using amino acid divergence between species, we estimate the avian- and mammalian-infective lineages diverged in the order of 10 million years ago. Consistent with their phylogenetic position, we identify orthologs of genes that had previously appeared to be restricted to the clades of parasites containing Plasmodium falciparum and Plasmodium vivax, the species with the greatest impact on human health. From these orthologs, we explore differential diversifying selection across the genus and show that the avian lineage is remarkable in the extent to which invasion-related genes are evolving. The subtelomeres of the P. relictum and P. gallinaceum genomes contain several novel gene families, including an expanded surf multigene family. We also identify an expansion of reticulocyte binding protein homologs in P. relictum, and within these proteins, we detect distinct regions that are specific to nonhuman primate, humans, rodent, and avian hosts. For the first time in the Plasmodium lineage, we find evidence of transposable elements, including several hundred fragments of LTR-retrotransposons in both species and an apparently complete LTR-retrotransposon in the genome of P. gallinaceum.


Supplemental Figures
Supplemental Figure S1. Conservation of synteny in the core regions of chromosomes. ACT (Artemis Comparison Tool) screenshot showing a comparison of centromere-proximal regions of an illustrative chromosome in P. berghei ANKA (Pberg) (chromosome 7), P. gallinaceum (Pgal) (scaffold 61), P. relictum (Prel) (chromosome 5) and P. falciparum 3D7 (Pf3D7) (chromosome 4). This conservation of synteny in the core regions is representative for all other Plasmodium chromosomes. The red blocks represent sequence similarity (TBLASTX). The centromere is shown in green. Yellow coloured boxes represent genes. The graph shows the GC-content, the highest GC content is 40%, the GC content at the centromere is 3%.   Figure S4. Comparisons of P. knowlesi, P. relictum , P. gallinaceum and P. falciparum 3D7 reveal avian malaria-specific genes Screenshots of comparisons using ACT that show (A) an avian malaria specific ApiAP2 protein (PGAL8A_00142800, PRELSG_113400) and (B) a gene of unknown function (green) within the core region of a chromosome that is only present in P. relictum (PRELSG_0909800). Grey bars represent the forward and reverse DNA strands. The red blocks between sequences represent sequence similarity (TBLASTX). PKNH, P. knowlesi; Prel, P. relictum; Pgal, P. gallinaceum; and Pf3D7, P. falciparum 3D7. Figure S5. 3D modeling of Ku70. (A) A three-dimensional model of Ku70 (PGAL8A_000014200) created using I-TASSER (Yang et al. 2015). The structure is visualized as a rainbow cartoon by the JSmol applet. (B) Ku70 (PGAL8A_000014200) structure is shown in cartoon format, while the highest scoring identified structural analog in PDB (1JEQ, Ku heterodimer, homo sapiens) is displayed using backbone trace. Figure S6. Differentially distributed pseudogenes between P. relictum and P. gallinaceum. Screenshots from ACT showing: (A) An avian-malaria specific gene in a core region of a chromosome and encoding a protein kinase. The kinase is a pseudogene in P. relictum (PRELSG_0314000) but a functional gene in P. gallinaceum (PGAL8A_00181200). (B) A degenerate multidrug-resistance associated protein 1, MRP1 (PRELSG_0028300) in P. relictum. There is an additional MRP1 (PRELSG_1445800) in P. relictum that is not pseudogenised. The red blocks between sequences represent sequence similarity (TBLASTX). Coloured boxes represent genes. Pseudogenes in P. relictum are shown in grey with multiple frameshifts. Figure S7. Comparisons of P. knowlesi, P. relictum, P. gallinaceum and P. falciparum 3D7 redefine lineage specific distributions. Screenshot of comparisons using ACT that show (A) an ATPase1 that was thought to be Laveraniaspecific (PF3D7_0516100) but is present in P. gallinaceum (PGAL8A_00025200) and P. relictum (PRELSG_1015800). (B) An ApiAP2 protein (PKNH_1015400) that was thought to be specific to the knowlesi/vivax/cynomolgi clade but is also present in P. gallinaceum (PGAL8A_00039400) and P. relictum (PRELSG_1014000). PKNH, P. knowlesi; Prel, P. relictum; Pgal, P. gallinaceum; and Pf3D7, P. falciparum 3D7. The red blocks between sequences represent sequence similarity (TBLASTX). Coloured boxes represent genes. Figure S8. Shikimate Pathway highlighting differences in the avian malaria genomes. Chorismate synthase (PGAL8A_00151500, PRELSG_1125300) and aminodeoxy-chorismate synthase (PGAL8A_00435300, PRELSG_0720000) are pseudogenised in P. gallinaceum and P. relictum. The pentafunctional AROM polypeptide is missing in P. gallinaceum and P. relictum (orthologue in P. falciparum is PF3D7_0206300). The metabolic pathway illustration is from the Malaria Parasite Metabolic Pathway site (http://mpmp.huji.ac.il; Ginsburg and Abdel-Haleem 2016).

Supplemental Figure S12. Maximum likelihood tree of avian malaria retrotransposons.
Tree showing the differences between Eimeria and P. gallinaceum/P. relictum transposable elements (TE). It further shows the similarity of the avian TEs to the TEs found in Haemoproteus tartakovskyi. Sequences were selected from the top 100 best hits from an NCBI BLAST of the P. gallinaceum sequence, plus the Eimeria and yeast sequence.  Tables   Supplemental Table S1. Core genes only present in P. gallinaceum 8A, P. relictum SGS1. Pseudogenes are indicated (*).

Relationship between Plasmodium species
The phylogenetic tree presented in this paper is robust to changes in the substitution model used for phylogenetic inference. The same phylogeny is from a maximum-likelihood tree under both partitioned and non-partitioned models (Supplemental Fig. S3A) and maximum parsimony analysis also agrees. Our result is also not just a result of trimming the alignment to remove poorly-aligned regions, as this tree also maximises the likelihood of a non-trimmed concatenated alignment under a simple substitution model. Simpler, non-model based approaches produce different trees, with neighbour-joining and simple amino acid distance producing a phylogeny matching that shown by Pick et al (Pick et al. 2011), with a clade of Laverania and avian malaria species, and the primateinfective species outside Laverania forming a clade related to a clade of rodent malaria (Supplemental Fig. S3B).

Whole genome sequencing of P. gallinaceum
From 20ng of the enriched genomic DNA whole genome amplification (WGA) was performed with REPLI-g Mini Kit (Qiagen) following a modified protocol (Oyola et al. 2014). Nuclease-free water and all tubes were UV-treated before use. WGA reactions were performed in 0.2 ml PCR tubes. Buffer D1 stock solution (Qiagen) was reconstituted by adding 500 µl of nuclease-free water and a working solution was prepared by mixing the stock solution and nuclease-free water in the ratio of 1:3.5 respectively. Buffer N1 was modified to include Tetramethylammonium chloride (TMAC) at a concentration of 300 mM. To denature DNA templates, 5 µl of the DNA solution was mixed with 5ul of buffer D1 (working solution prepared as described above). The mixture was vortexed and centrifuged briefly before incubating at room for 3 min. Denatured DNA was neutralized by adding 10 µl of the modified buffer N1. Neutralized DNA was mixed by vortexing and centrifuged briefly. To amplify the DNA template, denatured and neutralized sample was mixed with 29 µl of REPLI-g Mini Reaction Buffer and 1ul of REPLI-g Mini DNA polymerase to obtain a final reaction volume of 50 µl. The reaction mixture was incubated at 30 o C for 16 hr using an MJ thermocycler with the heating lid set to track at +5 0 C. Amplified DNA was cleaned using Agencourt Ampure XP beads (Beckman Coulter) using sample to beads ration of 1:1 and eluted with 50 µl of EB (Qiagen).

Collection of genomic DNA from P. relictum
Heavily infected mosquito midguts were obtained from a laboratory line of Cx. pipiens quinquefasciatus (SLAB) that had been placed in a cage and allowed to blood feed from a heavily infected canary following standard laboratory protocols (Cornet et al. 2013). Two such cages, each with 70 mosquitoes, were set up in this way (bird parasitaemias were 4.45% and 7.89%). After the blood meal, mosquitoes were kept at 25 o C and 80% relative humidity and dissected 7 days later to coincide with the midgut (oocyst) stage of the Plasmodium infection. Midguts were dissected and oocyst numbers assessed using standard laboratory procedures (Zélé et al. 2014). Total DNA was extracted from a single pool of 50 heavily infected midguts (>100 oocysts) using the QIAGEN protocol and materials (DNeasy 96 Tissue Kit, Qiagen NV, Venlo, The Netherlands) and total DNA was eluted in the final step with 100µL RNase free water (Qiagen).

Phylogenetic analysis
OrthoMCL v2.0 was used to cluster predicted proteins from 19 species of Apicomplexan parasites, including 11 previously published Plasmodium species: *P. berghei, *P. chabaudi, *P. yoelli, *P. cynomolgi, *P. falciparum, *P. knowlesi (Pain et al. 2008 with default parameters and an inflation parameter of 1.5. The output was parsed to identify a total of 881 clusters that were single-copy and present in all 19 species. Amino acid sequences for all of these clusters were aligned using mafft v7.205 (Katoh and Standley 2013) with the '--auto' flag and other parameters as defaults. These alignments were trimmed using GBlocks v0.91b (Castresana 2000) to keep well-aligned blocks of at least four consecutive well-aligned columns, separated by up to four less-conserved columns, and to discard columns with gap characters in at least 50% of sequences. All trimmed gene cluster alignments with more than 10 amino acid residues (879 out of 881) were kept for subsequent analysis. Subsequent phylogenetic analyses were all based on this alignment of 289,315 amino acid residues, from 879 single-copy gene clusters.
Bayesian phylogenetic inference was performed using PhyloBayes 3.3f (Lartillot et al. 2009) under a CAT mixture model, allowing the rate of substitutions to vary between sites according to a discretised gamma distribution and the substitution process at each site to come from a mixture of amino acid composition matrices but with a single underlying Poisson process for the substitution process. We ran 8 independent MCMC chains of at least 60,000 steps each. The final 1500 trees from each chain were concatenated for inference (discarding approximately 20,000 steps per chain as burn-in). While model parameter estimates had not all converged across all chains, tree topologies appeared to be following visualisation with "R We There Yet?" (https://github.com/danlwarren/RWTY). Maximum-likelihood phylogenetic analysis using RAxML v.8.0.24 (Stamatakis 2014) was performed using a partitioned model where the alignment for each locus was assigned the best-fitting model under BIC from the set of empirical amino acid substitution matrices available in that version of RAxML and using observed amino acid composition, and under a single LG4X model for the whole alignment with maximum-likelihood estimates of amino acid composition. Additional analyses used PAUP v4.0b10 and Phylip v3.6.9 (Felsenstein 2005) for parsimony and neighbour-joining analysis of standard AA pairwise distances (under the JTT model) and Log-Det distances calculated using LDDist v1.3.2 (Thollesson 2004).
To generate the RBP and transposable element (TE) trees, we trimmed the alignments with Gblocks in Seaview version 4.3.1 (Galtier et al. 1996) allowing the loosest settings. The models were estimated with RAXML and 100 bootstraps. The models PROTGAMMALG4M and PROTGAMMAGTR were used for the TE and the RPB analyses, respectively. To select the sequence for the TE tree, we BLAST-searched the P. gallinaceum TE against the non-redundant nucleotide database, took all the hits and included the TE sequences from Eimeria and yeast (U6KAF4_9EIME, U6GBW4_9EIME and YG31B_YEAST). Using a simple randomisation approach the association of GC content with distinct TE clades was tested. 10,000 sets of random GC content were constructed with the same size as each of the four main TE clades and the frequency in which the observed GC partitioning was reproduced. For both P. gallinaceum clades, the results were significant (p ~= 0.0059 for the big clade of higher mean percentage GC and p < 0.0001 for the smaller clade of lower mean percentage GC).

Dating analysis
Using the Bayesian coalescence method G-PhoCS (Gronau et al. 2011) and several genotypes, the divergence times of P. malariae and P. malariae-like have previously been estimated to be similar to that of P. falciparum and the chimpanzee parasite P. reichenowi (Rutledge et al. 2017). Using the same method, the divergence of Plasmodium ovale wallikeri and P. o. curtisi was estimated to have occurred 5 times earlier. By analyzing mutation rates and in vivo data, the divergence of P. falciparum and P. reichenowi has been estimated to have occurred approximately 200,000 years ago (Otto et al. 2017) and the P. ovale split must have therefore occurred around 1 million years ago. This is a revised estimate from Rutledge et al (Rutledge et al. 2017) where the date of the P. ovale split had been calibrated on previously published estimates for the P. reichenowi-P.falciparum (3.5 -5.5 MYA) split.
With only a single representative sample for each avian-infective species, it is not possible to use G-PhoCS to estimate divergence times but the dates for other species dates can be used as guides.
We used a method based on a Total Least Squares regression and the existence of a molecular clock specific to Plasmodium (Silva et al. 2015) to estimate speciation dates. To implement and test the method, we first generated amino acid alignments of 18 species from 2,915 one-to-one orthologues.
This set included species whose speciation times have been dated using coalescence modelling (Rutledge et al. 2017;Otto et al. 2017). Different to the original work of Silva et al. (Silva et al. 2015) we generated an alignment including all the protein sequence of all the 18 species, rather than pairwise comparisons -which is now possible due to better genome sequences, for example for P. reichenowi (Otto et al. 2014b) that had just around 445 one-to-one orthologues in the original work. The alignments for each orthologous group were performed with mafft (--auto parameter), and further trimmed with Gblocks (Talavera and Castresana 2007) (parameter -t=p --b5=h -p=n -b4=2) to exclude gaps and badly aligned regions. Following the original work, we obtained the control file with PAML (Yang 2007) and the R code from the author for the Total Least Squares regression.
First, the known speciation timings were evaluated. As expected, the speciation of P. ovale walikeri and P. o. curtsi was predicted to be 5x earlier than the P. malariae and P. malariae-like split, confirming the validity of the method. Further, the relative time between the split of P. reichenowi with P. falciparum and P. praefalciparum with P. falciparum was predicted as in (Otto et al. 2017). In contrast, the relative timing of P. malariae separating from P. malariae-like and P. reichenowi from P. falciparum was predicted to be 2.5x apart. This apparent discrepancy is probably due to the huge difference in GC content and the resulting amino acid bias that influences the molecular clock.
Next the method was applied to all orthologous core genes ( Figure 2) but with the 250 genes with the highest dN values excluded, as non-neutrally evolving outliers, resulting in 3646 orthologues. Using previous estimates (Otto et al. 2017), the P. ovale split can be calibrated to 1 million years ago. The divergence of the avian and mammalian Plasmodium lineages therefore occurred between 10-13 million years ago ( Figure 2). The P. ovale split has previously been estimated to be around 3 million years ago (Sutherland et al. 2010), resulting in an estimate of 30-40 million years ago for the split of avian and mammalian Plasmodium, still an order of magnitude more recent than the estimated dates for the avian and mammal lineages diverging around 320 million years ago (Kumar and Hedges 1998).