Evidence That Plant-Like Genes in Chlamydia Species Reflect an Ancestral Relationship between Chlamydiaceae, Cyanobacteria, and the Chloroplast

  1. Fiona S.L. Brinkman1,2,3,12,13,
  2. Jeffrey L. Blanchard4,5,
  3. Artem Cherkasov3,
  4. Yossef Av-Gay6,
  5. Robert C. Brunham7,
  6. Rachel C. Fernandez2,
  7. B. Brett Finlay2,
  8. Sarah P. Otto8,
  9. B.F. Francis Ouellette9,
  10. Patrick J. Keeling10,
  11. Ann M. Rose3,
  12. Robert E.W. Hancock2, and
  13. Steven J.M. Jones11
  1. 1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6; 2Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z3; 3Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada, V6H 3N1; 4Promega Corporation, Madison, Wisconsin 53711, USA; 5National Center for Genome Resources, Santa Fe, New Mexico 87505, USA; 6Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada, V5Z 4E3; 7University of British Columbia Centre for Disease Control, Vancouver, British Columbia, Canada, V5Z 4R4; 8Department of Zoology, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4; 9Centre for Molecular Medicine and Therapeutics, Vancouver, British Columbia, Canada, V5Z 4H4; 10Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4; and 11Genome Sequence Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada, V5Z 4E6

Abstract

An unusually high proportion of proteins encoded inChlamydia genomes are most similar to plant proteins, leading to proposals that a Chlamydia ancestor obtained genes from a plant or plant-like host organism by horizontal gene transfer. However, during an analysis of bacterial–eukaryotic protein similarities, we found that the vast majority of plant-like sequences in Chlamydia are most similar to plant proteins that are targeted to the chloroplast, an organelle derived from a cyanobacterium. We present further evidence suggesting that plant-like genes in Chlamydia, and other Chlamydiaceae, are likely a reflection of an unappreciated evolutionary relationship between the Chlamydiaceae and the cyanobacteria-chloroplast lineage. Further analyses of bacterial and eukaryotic genomes indicates the importance of evaluating organellar ancestry of eukaryotic proteins when identifying bacteria-eukaryote homologs or horizontal gene transfer and supports the proposal that Chlamydiaceae, which are obligate intracellular bacterial pathogens of animals, are not likely exchanging DNA with their hosts.

[Supplementary Material is available online at http://www.genome.org and athttp://www.pathogenomics.bc.ca/BAE-watch.html.]

The Chlamydiaceae family of bacteria include several pathogens of animals and two important obligate human pathogens,Chlamydia trachomatis and Chlamydophila pneumoniae(Everett et al. 1999a; note that Chlamydophila pneumoniae was previously called Chlamydia pneumoniae). C. trachomatisis the causative agent of the sexually transmitted disease Chlamydia–the most frequently reported infectious disease in the U.S. and Canada and one of the leading causes of female infertility, ectopic pregnancy, and chronic pelvic pain (Division of STD Prevention 2000). It is also the causative agent of the ocular disease trachoma, one of the leading causes of blindness worldwide. C. pneumoniaecauses acute respiratory infections and has been implicated in the development of atherosclerosis (Campbell et al. 1998). All Chlamydiaceae require intracellular infection of a host cell to replicate, complicating efforts to study these pathogens and develop a vaccine. To aid research, genome sequences have been obtained for five chlamydial strains comprising three species (Stephens et al. 1998;Kalman et al. 1999; Read et al. 2000; Shirai et al. 2000), and one of the most surprising observations from genome analyses has been the relatively high proportion of genes with highest similarity to plant sequences (Stephens et al. 1998). The obligate intracellular lifestyle of these bacteria has led to proposals that a Chlamydiaancestor obtained such genes from a plant or plant-like amoebal host organism by horizontal gene transfer (Stephens et al. 1998; Wolf et al. 1999b; Lange et al. 2000; Royo et al. 2000). Presumably, the intimate association between the Chlamydiaceae and their host cells would increase the chance of horizontal exchange of genes between host and bacterium. However, we present evidence that such plant-like genes in the Chlamydiaceae do not reflect horizontal gene transfer between these bacteria and their hosts. Rather, the plant genes appear to be derived from the cyanobacterial endosymbiont that gave rise to the chloroplast, and their similarity to homologs in the Chlamydiaceae reflects an ancient evolutionary relationship between Chlamydiaceae, cyanobacteria, and the chloroplast. Further analyses support our proposal that Chlamydiaceae are not likely exchanging DNA with their hosts and indicate the importance of evaluating the organellar ancestry of eukaryotic proteins.

RESULTS AND DISCUSSION

Analysis of Unusual Bacteria–Eukaryote Protein Similarities: Confirmation They Disproportionately Involve Chlamydiaceae, Cyanobacteria, and Rickettsia

We have developed an automated analysis of protein similarity based on BLAST (Altschul et al. 1997) to detect bacterial proteins notably more similar in primary sequence to eukaryotic proteins over other bacterial or archaeal proteins (and, conversely, eukaryotic proteins notably more similar to bacterial proteins over eukaryotic or archaeal proteins). A publicly available version of our analysis is at www.pathogenomics.bc.ca/BAE-watch.html (under the first three options). Although this analysis has obvious limitations (see Methods) and is not a substitute for phylogenetic analysis, we found it to be a useful aid in investigating bacteria–eukaryotic protein similarities at the primary sequence level.

This analysis showed that 65% of bacterial proteins identified with the highest similarity to a eukaryotic protein involvedChlamydia, Chlamydophila, Synechocystis, andRickettsia, although these organisms only accounted for 14% of the genes analyzed (Fig. 1; Supplementary material; http://www.pathogenomics.bc.ca/BAE-watch.html). The proteins identified from Rickettsia were found to be disproportionately of the “energy production and conversion” functional category, and the Synechocystis and Chlamydiaceae proteins were found to be disproportionately similar to plant proteins. For Rickettsia and Synechocystis this was expected, due to the ancestral relationship betweenRickettsia (an α-proteobacterium) and the energy-producing mitochondria and the ancestral relationship betweenSynechocystis (a cyanobacterium) and the chloroplast of plants and algae (Andersson et al. 1998; Reumann and Keegstra 1999). It is well known that a large proportion of organellar proteins are encoded by nuclear genes and that these proteins are targeted to the organelle posttranslationally using a transit peptide. It is thought that most of these genes were transferred from the endosymbiotic bacterium to the host nucleus during the transition of endosymbiont to organelle (Gray 1992). The “eukaryotic” genes identified fromRickettsia and Synechocystis are, therefore, not surprisingly predominantly similar to genes encoding proteins that function in the mitochondria and the chloroplast, respectively. A report proposing many horizontal gene transfer events betweenRickettsia and eukaryote nuclear genes (Wolf et al. 1999b) did not include consideration of the movement of organellar genes into the nuclear genome, a phenomenon that has been known for some time (Weeden 1981) but is only now becoming more appreciated in eukaryotic genomics (Blanchard and Lynch 2000; Rujan and Martin 2001).

Figure 1.

Proportion of proteins, predicted from complete bacterial genomes, which share highest similarity to eukaryotic proteins (according to analysis with default stringency settings; seehttp://www.pathogenomics.bc.ca/BAE-watch.html). Results for those organisms with a higher proportion than expected are circled. Similar results are obtained when different stringency cutoffs are used (see Supplementary Material available online http://www.genome.org).

Plant-Like Genes in Chlamydiaceae: Plant Homologs Tend to Function in the Chloroplast

The notable number of plant-like genes in Chlamydiaceae genomes was more puzzling because Chlamydiaceae have no described relationship with any organelle. It was previously proposed that Chlamydiaspecies obtained the genes from a host their ancestor had previously infected, such as a plant-like amoeba (due to the existence of Chlamydia-like organisms that infect Acanthamoeba, although Acanthamoeba is actually closely related to animals and fungi), whereas others suggested that they had simply obtained the genes from a plant (Stephens et al. 1998; Wolf et al. 1999b; Lange et al. 2000; Royo et al. 2000). However, analysis of multiple Chlamydiaceae genomes revealed a high level of conservation, suggesting they have been subjected to little horizontal gene transfer with other genera (Read et al. 2000). So where do the plant-like genes in Chlamydiaceae come from? Our comparison of eukaryotic genomes to those of Chlamydiaceae revealed that of the 18 cases of Chlamydiagenes previously proposed to have been horizontally acquired from plants (Wolf et al. 1999a; Lange et al. 2000; Royo et al. 2000) 15 are similar to genes encoding proteins that function in the chloroplast in plants and the remaining 3 do not show a significantChlamydia-plant relationship when subjected to phylogenetic analysis (Table 1). Furthermore, with the completion of the first plant genome (The Arabidopsis Genome Initiative 2000), we identified an additional 19 Chlamydiaceae proteins that are most similar to plant proteins, and 15 of these plant proteins are chloroplast targeted, 2 are predicted to be mitochondrial, and the remaining 2 do not bear out a significant Chlamydiaceae-plant relationship after phylogenetic analysis (Table 1). Additional Chlamydiaceae genes have also been previously noted to share highest similarity with proteins encoded in the chloroplast genome (Wolf et al. 1999b). It therefore appears that the vast majority of plant-like genes in Chlamydiaceae correspond to plant genes that are derived from, and function in, the chloroplast.

Table 1.

Subcellular Localization in Plants of Proteins Similar toChlamydia Proteins According to Low-Stringency BAE- (bacteria, archaea, and eukarya) Watch Analysis

Evidence Chlamydiaceae, Cyanobacteria, and the Chloroplast Share an Ancient, Ancestral Relationship

With apparent links between Chlamydiaceae and chloroplast genes, we wondered whether Chlamydiaceae share a closer relationship with the chloroplast and cyanobacteria than is presently recognized. Previous phylogenetic analysis using small-subunit ribosomal RNA sequences did indeed suggest that Synechocystis and Chlamydiaceae form sister groups (Nelson et al. 2000) and this was confirmed through a bootstrapped analysis we performed with more cyanobacterial, Chlamydiaceae, and chloroplast sequences (data not shown). However, such analysis does not group these lineages with high confidence. This is most likely due to a significant divergence time between these lineages, which severely limits the phylogenetic information (informative sites) available, and also reduces the number of gene sequences that can be analyzed adequately. However, for analysis of such evolutionary relationships, it is becoming increasingly apparent that one should investigate multiple analyses and that such analyses should be carefully chosen for their appropriateness given the level of divergence being investigated. Character-based analyses of more slowly evolving molecular features is another approach (Qiu and Palmer 1999) that appears suitable in this case. Genomic characters, such as the presence or absence of signature sequences, introns, or genes in conserved operons, have been previously used to delineate a number of major groupings, including uniting certain charophycean green algae with plants (Baldauf et al. 1990; Manhart and Palmer 1990), grouping fungi and animals to the exclusion of plants and protists (Baldauf et al. 1996), and developing our picture of animal phylogeny (Boore et al. 1995). We therefore analyzed the ribosomal superoperon of 36 complete microbial genomes and 10 chloroplast genomes, investigating gene acquisition and loss from this operon as a slowly evolving character-based analysis. We identified several unique shared characters that unite Chlamydiaceae andSynechocystis/cyanobacteria exclusively and additional nonunique shared characters (Fig. 2). Another previously published slowly evolving character-based analysis of an unspliced group I intron in 23S rRNA also supports a link between Chlamydiaceae and the chloroplast lineage (Everett et al. 1999b). These results are also supported by analysis of the incomplete genome of the Cyanobacterium Synechococcus sp. strain WH8102 (preliminary sequence data obtained from the DOE Joint Genome Institute (JGI) athttp://www.jgi.doe.gov/JGI_microbial/html), which shares the same unique and nonunique characters. Thus multiple genomes from the cyanobacterial and Chlamydiaceae lineages support this sisterhood. In addition, all 10 completely sequenced chloroplast genomes that we analyzed also share these characters (see Fig. 2 for a representative chloroplast analysis and see Methods for a list of the others). However, there has been additional gene loss from the chloroplast ribosomal superoperon (primarily through apparent transfers of genes to the plant nuclear genome; Fig. 2; data not shown). These observations, together with the existence of a higher than expected proportion of apparent chloroplast protein homologs in Chlamydiaceae genomes (and some weak phylogenetic analyses), appear to link Chlamydiaceae with the cyanobacterial/chloroplast lineage.

Figure 2.

Unique shared-derived characters of the ribosomal super operon that unite cyanobacteria and Chlamydiaceae. Two unique shared-derived characters on the ribosomal super operon (the loss of ribosomal proteins S10 and S14) unite the Chlamydiaceae and cyanobacteria to the exclusion of other bacteria with genomes that have been completely sequenced (black boxes; note that S10 and S14 are present elsewhere on the chromosome). Loss of L30 (dashes; note that L30 does not appear to be present elsewhere in these genomes, according toTBLASTN analysis) is not a unique shared-derived character to the exclusion of all other bacteria but offers further support for a relationship between the Chlamydiaceae and cyanobacteria. In addition, all 10 chloroplast genomes examined (Porphyra purpurea chloroplast is shown as a representative) and an unfinished cyanobacterial genome (Synechococcus spp.) also share the same characters (i.e., loss of S10, S14, and L30 from the super operon); however, the chloroplasts are missing additional genes from this region (i.e., L15 in the region shown) that have been primarily transferred to the plant nucleus. Boxes with strikethroughs mark genes that have relocated in Deinococcus andAquifex to form a separate operon. Note that the genome annotation for Aquifex did not report L29; however, we did positively identify this gene in Aquifex usingTBLASTN. Another unique character uniting Chlamydiaceae, cyanobacteria, and the chloroplast, which is not illustrated in this figure, is that S10 is found as part of the separate S7/S12 operon in only the Chlamydiaceae, cyanobacteria, and chloroplast sequences examined.

Genome Composition Analysis Suggests Chlamydiaceae Are Not Exchanging Genes with Their Hosts

In further support of the lack of horizontal gene transfer between Chlamydiaceae and their eukaryotic hosts, we also find that chlamydial genomes have been subjected to a low rate of recent DNA exchange with organisms of differing G+C ratios. The average G+C ratio for the genome of a particular microbial organism is often characteristic, with regions of DNA of unusual G+C ratios sometimes thought to reflect recent horizontal transfer of DNA from an organism with a differing G+C ratio. For Chlamydiaceae that are thought to infect only humans, the average G+C ratio of all genes or open reading frames (ORFs) from their genomes is 41% ± 2.5% (Table 2), whereas for humans the G+C ratio of their genes averages ∼52% ± 8% (Nakamura et al. 2000; note that other mammals have a mean G+C ratio for genes that is similar to humans). Chlamydiaceae have a notably lower variance in their G+C ratio for genes than is observed for any other microbe whose genome has been sequenced to date (Table 2). In contrast, other bacteria, such as Neisseria species that have been shown to undergo frequent horizontal gene transfer, exhibit a much higher variance in %G+C for genes in their genomes (standard deviation up to ± 7%; Table 2). Although analysis of variance in gene %G+C for genomes cannot reveal horizontal acquisition of genes of the same G+C ratio and other factors such as level of gene expression can affect G+C ratios for a given gene, this low variance for whole chlamydial genomes is consistent with the lack of horizontal gene transfer suggested from the unrelated analysis of gene conservation and gene synteny in complete Chlamydiaceae genomes (Read et al. 2000). The apparently clonal nature of Chlamydia (and apparent lack of horizontal gene transfer) may be due to their ecological isolation from other bacteria, as a result of their intracellular lifestyle (Read et al. 2000).

Table 2.

Percent G + C Mean and Standard Deviations Determined from All Predicted Protein Coding Regions for Complete Genomes of Pathogenic Bacteria (as of April 2001)

Expanding the Analysis to Other Bacteria: Many Bacteria–Eukaryotic Protein Similarities May Reflect Bacterial Origin of Mitochondria and the Chloroplast

To further evaluate the involvement of organellar proteins in cases where bacterial genes are most similar to eukaryotic genes, we conducted a comparison of 162,003 genes from 37 bacterial and eukaryotic genome sequences (http://www.pathogenomics.bc.ca/BAE-watch.html). Although computational identification of organelle targeting signals has limitations (Emanuelsson et al. 2000), we found that the majority of bacterial proteins that are most similar to eukaryotic proteins share similarity to proteins that are known, or are proposed by TargetPanalysis, to function in mitochondria or chloroplast organelles (seehttp://www.pathogenomics.bc.ca/BAE-watch.html and the section entitled “Bacterial proteins most similar to eukaryotic proteins”). Although Chlamydia, Synechocystis, andRickettsia contain a far greater proportion of eukaryote-like genes than all other bacterial genomes analyzed (Fig. 1; Supplementary Material is available online at http://www.genome.org), this shows that one must be careful when examining proteins that share unusually high similarity between bacteria and eukaryotes to consider the possibility that a gene has organellar ancestry. In essence, it would appear that the bacterial origin of mitochondria and the chloroplast, coupled with the apparent horizontal transfer of genes from the organellar genome to the nuclear genome of eukaryotes, must be considered a potential complicating factor of any analysis of bacterial–eukaryotic protein similarity.

Implications

Our analysis indicates that that the plant-like genes in Chlamydiaceae are most similar to plant genes with protein products that function in the chloroplast. We propose that the high proportion of plant-like genes in Chlamydiaceae is not due to horizontal gene transfer with a plant or related organism, but rather is a reflection of an ancient, ancestral relationship between the Chlamydiaceae and the cyanobacterial ancestor of the chloroplast. Regardless of the degree of relatedness between Chlamydiaceae and cyanobacteria, analysis of both Chlamydiaceae and other bacteria indicates that organellar ancestry must be considered in any case where a eukaryotic gene shares higher-than-expected similarity to bacterial homologs. One may wonder why Chlamydiaceae and other bacteria contain genes that share notable sequence similarity with organellar genes when there are species such as Synechocystis and Rickettsia that share an even closer relationship with the ancestors of organelles. First it must be emphasized that the number of such genes is far fewer than the number of organellar genes that share a highest similarity to cyanobacterial or rickettsial genes (Fig. 1). This is particularly notable for nonchlamydial bacteria if a high step ratio filter is used (see Methods for step ratio description) because BLAST is known for ordering sequences poorly in its output (Koski and Golding 2001) and such filtering aids in the removal of such BLAST ordering artifacts. It is also becoming increasingly apparent that gene loss plays a significant role in bacterial genome evolution (Mira et al. 2001; Salzberg et al. 2001). From this study, and others (Salzberg et al. 2001), it is clear that many cases of unusual bacteria–eukaryotic gene similarities are most likely a reflection of gene loss in a related lineage, coupled with our currently small taxonomic sampling of data at the genomic level. For example, Synechocystis may have lost a gene that is still present in Chlamydiaceae and the chloroplast, making the chlamydial gene appear most similar to the chloroplast counterpart in our analysis. Indeed, our analysis is currently only based on a single completed cyanobacterial genome, so it is quite possible that other cyanobacteria may still have orthologs of the gene (and when identified, this gene would be expected to be most similar to the chloroplast homolog). Consistent with this, most cases of plant-Chlamydiaceae gene similarity notably lack aSynechocystis homolog for comparison (or the homolog appears to be a paralog). These isolated cases (far fewer than the number of cases ofSynechocystis genes resembling chloroplast genes) probably reflect gene loss in the Synechocystis lineage.

The apparent lack of horizontal gene transfer involvingChlamydia, both from their eukaryotic hosts (this paper) and from other bacterial genera (Read et al. 2000; this paper), suggests that Chlamydia may be a useful model for studies of gene evolutionary rates and for determining to what degree factors other than horizontal gene transfer can affect certain genomic properties. The observation of an evolutionary relationship between Chlamydia and cyanobacteria could have significance for Chlamydia research, as existing knowledge of cyanobacteria may stimulate new ways of thinking about the function and control of pathogenicChlamydia.

METHODS

Protein/Gene Datasets and Phylogenetic Analysis

We analyzed complete published eukaryotic genomes (Homo sapien, Arabidopsis thaliana, Drosophila melanogaster, Caenorhabditis elegans, andSaccharomyces cerevisiae) for genes most similar to bacteria and, conversely, complete published bacterial genomes for genes most similar to eukaryotes (all pathogens are listed in the Supplementary Table [available online at http://www.genome.org], as well asSynechocystis sp. PCC6803, Escherichia coli K12,Bacillus subtilis 168, Aquifex aeolicus VF5,Buchnera sp. APS, Bacillus halodurans,Lactococcus lactis ssp. lactis IL1403, and Thermotoga maritima MSB8). For the human proteins, the ENSEMBL March 2001 dataset freeze was used (originally called version 8.0). For the genomic character analyses of the ribosomal superoperon, additional analysis were performed on chloroplast genes from Porphyra,Cyanophora, Odontella, Plasmodium,Euglena, Marchantia, Rice, Tobacco,Chlorella, and Nephroselmis. (See Acknowledgments for links to associated genome sequence publications and genome centers.

Phylogenetic analysis was performed using the neighbor-joining method of PHYLIP(http://evolution.genetics.washington.edu/phylip.html) for prealigned 16S rRNA genes from the Ribosomal Database Project II (http://rdp.cme.msu.edu/) for the following organisms: Pyrococcus furiosus (i.e., an archaeal sequence used to root the tree),Thermotoga maritima, Aquifex pyrophilus,Bacillus subtilis, Chlamydophila pneumoniae,Chlamydophila psittaci, Chlamydia muridarum,Chlamydia trachomatis, Synechococcus PCC6301,Synechocystis PCC6803, Microcystis viridis,Escherichia coli, Caulobacter crescentus,Rickettsia prowazekii, Zea mays (mitochondrial sequence), and chloroplast sequences from Chlamydomonas reinhardtii, Klebsormidium flaccidum, Zea mays, and Nicotiana tabacum.

Bacteria–Eukarya Protein Comparison Method

All complete bacterial and eukaryotic genomes mentioned above were compared using BLAST (Altschul et al. 1997) andMSPCRUNCH to a database of all proteins, including SWISS-PROT, TREMBL, and human proteins from the ENSEMBL March 2001 dataset. The results were placed in an ACEDB database (http://www.acedb.org) and related using TaxIDs to taxonomy information from the National Center for Biotechnology Information (NCBI's) Taxonomy database. The resulting database was queried for those proteins most similar to bacterial proteins over eukaryotic proteins (and those eukaryotic proteins most similar to bacterial proteins). This approach capitalizes on the significant evolutionary distance between the three Domains of life of bacteria, archaea, and eukarya and the presence in genetic databases of a number of completely sequenced genomes from all three domains (this increases the significance of a protein from one domain being more similar to a protein from another domain). A step ratio scoring system (see below) was developed to further filter the results and identify proteins that are substantially more similar to a protein from another domain of life over proteins from the same domain. This scoring system is necessary to filter from the analysis any proteins that are highly conserved in all organisms that BLAST scoring alone may identify as most similar to another domain's protein by chance. Previous analyses of proteins with highest similarity to proteins from other domains of life have suffered from failing to use a sufficiently stringent scoring system or not, ensuring that their scoring system is flexible enough to handle varying rates of gene evolution. This scoring system has normalized, flexible cutoffs. The database front end also facilitates filtering of various taxonomic groups of organisms from the analysis to identify, for example, bacterial genes conserved in a genera or family that share significant similarity to eukaryotic genes. Proteins that are annotated by SWISS-PROT as being encoded in an organelle, or containing an organelle transit peptide according toTargetP (Emanuelsson et al. 2000), are specifically highlighted in the database because the ancestor of mitochondria and the chloroplast is known to be bacterial; so organellar genes, or organellar genes that have moved to the nucleus, tend to be most similar to bacterial genes (Andersson et al. 1998;Reumann and Keegstra 1999; Rujan and Martin 2001). A publicly available version of our analysis that has been expanded to analyze all bacterial genomes and to make all cross-domain comparisons between bacteria, archaea and eukarya is available atwww.pathogenomics.bc.ca/BAE-watch.html. Note that there are obvious limitations to this analysis: It only detects primary sequence similarities detected by BLAST, it is not useful for identification of proteins highly conserved between all domains of life, its effectiveness is limited by the number of known genes in databases (although this will improve over time), and it is limited by the accuracy of organellar transit peptide prediction algorithms.

Score Calculation for the Step Ratio Used to Calculate the Significance of a Match

The following is performed for each case of cross-domain similarity detected (for example, a query bacterial protein is found byBLAST to have highest similarity to a eukaryotic protein). First, a given query protein (in the example, the bacterial protein) is compared to itself using BLAST to generate a “self-blast” bit score for its alignment to itself. This value is used to normalize all bit scores in the BLAST output (i.e., each bit score in the BLAST output is divided by this self-blast bit score). The difference between each normalized bit score as you go down the list of hits is calculated and then the maximum of these differences (the most significant “step” down in the blast scores) is identified for all hits until a hit is observed to a protein belonging to the same domain as the query protein (for example, bacterial). The ratio of this maximum difference over the max ratio is the step ratio (the max ratio is this normalized bit score for the alignment of the query protein [i.e., bacterial protein] with its top hit [i.e., eukaryotic protein]). A high step ratio score therefore reflects a substantial drop in bit score between the top-hit (i.e., eukaryote) sequence and the first same-domain (i.e., bacterial) sequence in the BLAST output list. A high step ratio score cutoff therefore selects against proteins that are highly conserved in all organisms (highly conserved protein would not have much of a drop in bit score between a top hit protein and other proteins in theBLAST output). This facilitates the removal of proteins that BLAST records as being most similar to a protein of another domain that are essentially artifacts of the inability ofBLAST to order similarly related sequences in their correct order (Koski and Golding 2001). We have found a step ratio score cutoff of 10 removes the majority of such undesirable highly conserved proteins from the analysis. However, this value may be adjusted by the user and often a higher value is required to reduce false-positives.

WEB SITE REFERENCES

http://evolution.genetics.washington.edu/phylip.html; PHYLIP home page.

http://HypothesisCreator.net/iPSORT/; iPSORT.

http://rdp.cme.msu.edu/; Ribosomal Database Project II.

http://www.acedb.org; ACEDB genome database system.

http://www.jgi.doe.gov/JGI_microbial/html; DOE Joint Genome Institute Microbial Genomics.

http://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/euk_o.html; NCBI's list of organelle sequences.

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/linksOrg.html; NCBI's list of genome centers.

http://www.pathogenomics.bc.ca/BAE-watch.html; BAE-watch database.

http://www.pathogenomics.bc.ca; BC Pathogenomics Project web site.

http://www.pathogenomics.bc.ca/IslandPath.html; IslandPath.

http://www.tigr.org/tdb/mdb/mdbcomplete.html; TIGR Microbial Database.

Acknowledgments

We thank all Pathogenomics Project members (www.pathogenomics.bc.ca/people.html) for comments and suggestions, Olof Emanuelsson (Stockholm) for assistance with large-scale use ofTargetP before software licensing was available, and the many genome centers that published sequence data required for this analysis (see http://www.tigr.org/tdb/mdb/mdbcomplete.html,http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/linksOrg.html, andhttp://www.ncbi.nlm.nih.gov:80/PMGifs/Genomes/euk_o.html). This work was funded by the Peter Wall Institute for Advanced Studies. J.L.B.'s research was supported in part by the Promega Postdoctoral Fellowship program under the guidance of Michael Slater. Bioinformatics applications mentioned in this paper can be accessed through the Pathogenomics Project Web site at http://www.pathogenomics.bc.ca.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Footnotes

  • 12 Corresponding author.

  • 13 Present address: Department of Molecular Biology and Biochemistry, Simon Fraser University, 8888 University Drive, Burnaby, British Columbia, Canada, V5A 1S6.

  • E-MAIL brinkman{at}sfu.ca; FAX (604) 291-5583.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.341802. Article published online before print in July 2002.

    • Received April 9, 2002.
    • Accepted May 23, 2002.

REFERENCES

Articles citing this article

| Table of Contents

Preprint Server