|
|
|
|
Published online before print
June 2, 2008, 10.1101/gr.075556.107 Genome Res. 18:1163-1170, 2008 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08 $5.00 OPEN ACCESS ARTICLE
Methods Identification and analysis of ancestral hominoid transcriptome inferred from cross-species transcript and processed pseudogene comparisons1 Genomics Research Center, Academia Sinica, Taipei 11529, Taiwan; 2 Department of Computer Science and Information Engineering, National Chung Cheng University, Chia-yi County 600, Taiwan; 3 Division of Biostatistics and Bioinformatics, National Health Research Institutes, Miaoli County 350, Taiwan; 4 Institute of Bioinformatics, National Chiao-Tung University, Hsinchu City 300, Taiwan
Comparative transcriptomics studies in hominoids are difficult because of lack of EST information in the great apes. Nevertheless, processed pseudogenes (PPGs), which are reverse-transcribed ancient transcripts present in the current genome, can be regarded as a virtual transcript resource that may compensate for the paucity of ESTs in non-human hominoids. Here we show that chimpanzee PPGs can be applied to identification of novel human exons/alternatively spliced variants (ASVs) and inference of the ancestral hominoid transcriptome and chimpanzee exon loss events. We develop a method for comparatively extracting novel transcripts from PPGs (designated "CENTP") and identify 643 novel human exons/ASVs. RT-PCR-sequencing experiments confirmed >50% of the tested exons/ASVs, supporting the effectiveness of the CENTP pipeline. With reference to the ancestral transcriptome inferred by CENTP, 47 chimpanzee exon loss events are identified. Furthermore, by combining out-group and PPG information, we identify 20 chimpanzee-specific exon loss and 10 human-specific exon gain events. We also demonstrate that the ancestral transcriptome and exon loss/gain events inferred based on comparisons of current transcripts may be incomplete (or occasionally inappropriate) because ancestral transcripts may not be represented in the ESTs of existing species. Finally, functional analysis reveals that the novel exons identified based on chimpanzee transcripts are significantly enriched in genes related to translation regulatory activity and viral life cycle, suggesting different expression levels of the associated transcripts, and thus divergent splicing isoform composition between human and chimpanzee in these functional categories.
The complexity of a transcriptome is directly related to the proteome size and functional versatility of organisms (Graveley 2001
For genetically close but phenotypically divergent species, such as human and the common chimpanzee (Pan troglodytes), transcriptome evolution is considered relevant to interspecies functional divergence. However, comparative studies of transcriptomes in hominoids have been hampered by the paucity of expressed sequence tag (EST) information and experimentally validated transcripts in the great apes. Recently, Shemesh et al. (2006)
Meanwhile, it has been demonstrated that cross-species EST-to-genome comparisons are suitable for identification of uncharacterized exons/alternatively spliced variants (ASVs) (Chuang et al. 2004 In this study, we develop a method for comparatively extracting novel transcripts from PPGs (designated "CENTP"). By cross-species PPG-to-genome mapping, we cannot only detect unannotated human exons/ASVs, but also infer the transcriptome in the Homo–Pan common ancestor. With reference to the ancestral transcriptome, we can identify chimpanzee exon loss events without having to reference out-group information. In addition, we demonstrate that inference of exon loss events based on comparisons with out-group sequences may be inappropriate if PPGs are not considered. Finally, we functionally analyze the ASVs that are lost in chimpanzee, and briefly discuss the possible impacts of these events in Homo–Pan functional divergence.
More than 600 novel human exons/ASVs are identified by CENTP Table 1 lists the 643 CENTP-identified novel human exons (named "CENTP exons") that are absent in current annotation databases or EST libraries (see Fig. 1 and Methods). These novel exons also represent novel human ASVs because no transcripts that include the CENTP exons have been characterized. For simplicity, we term CENTP exons identified based on human PPGs, chimpanzee PPGs, and chimpanzee transcripts (collectively called "CENTP cDNAs") as CENTPH_PPG, CENTPC_PPG, and CENTPC_gene exons, respectively. As expected, the number of CENTPH_PPG exons (121) is much larger than that of CENTPC_PPG exons (29). This is understandable because the number of the extracted human PPGs is larger than that of chimpanzee PPGs, and human PPGs may have preserved more human expression information. As well, CENTP identifies a much larger number of human PPG-based novel exons as compared with a recent study that applied human PPGs to ASV detection (Shemesh et al. 2006
Notably, a large number of potentially novel human exons (469) are inferred from chimpanzee transcripts, lending solid support for the power of novel exon/ASV detection based on cross-species EST-to-genome comparisons. Previously, we have demonstrated that the power of such a comparative approach is negatively related to the interspecies divergence level (Chen et al. 2006
In terms of ASV types, CENTP totally identifies 434 cassette-on exons (Fig. 2A) and 209 retained introns (Fig. 2B). Note that two types of cassette-on exons are identified here: simple and complex (see Fig. 2A and Methods). Of these exons, 387 are located in coding sequences (CDSs). The remaining 256 are located in untranslated regions (UTRs). It is worth noting that the majority of CENTP cassette-on exons are located in CDSs rather than in UTRs. This is because UTRs in most cases are the initial/terminal exons in the transcripts in which they reside, and such exons cannot pass the CENTP filters (for accuracy, CENTP only identifies novel exons located between two well-known exons; see Methods). On the other hand, more CENTP retained introns are located in UTRs than in CDSs, which is consistent with Galante et al.s report (Galante et al. 2004
To validate the CENTP-identified exons/ASVs, three subsets from CENTPC_PPG, CENTPH_PPG, and CENTPC_gene (21, 29, and 28 events, respectively) are selected for RT-PCR-sequencing verification (Supplemental Table 1). More than 50% (40/78) of the tested exons are experimentally confirmed. These include simple/complex cassette-on exons and retained introns. The RT-PCR results of the confirmed exons are given in Supplemental Figures 1–3. Our results indicate that a considerable proportion of the ASVs identified based on human/chimpanzee PPGs or chimpanzee transcripts are still active in the human transcriptome.
Ancestral hominoid transcriptomes and chimpanzee exon loss events inferred from PPGs
We then examine these 27 exons using the CENTP pipeline to determine whether these exons are really lost or simply unannotated in chimpanzee. We align the chimpanzee PPGs that include these exons against the introns of their parent genes and examine the matches using the CENTP exon-checking rules stated in Methods. If novel exons are identified, they are considered as currently unannotated chimpanzee exons and being included in the active transcripts of both human and chimpanzee. Otherwise, exon loss events are thought to have occurred in the chimpanzee lineage. Two types of exon loss events are expected: exon deletion and pseudogenization. In the former case, the PPG-derived exons will be non-alignable against the introns of their parent genes. In the latter case, the exons should be conserved in the introns of the parent genes but have incurred frameshift or nonsense mutations, or loss of splicing signals. In fact (also see Table 2), both cases are observed for Set 1 and Set 6 exons. We therefore identify 16 chimpanzee exon loss events, five potentially novel chimpanzee exons (newly annotated by CENTP), and six exons of uncertain status for lack of information. For the chimpanzee exon loss events, we further estimate the time of pseudogenization by calculating the genetic distances between PPGs that support these 16 exons and their parent genes. All except one of the PPG–parent gene pairs have distances much larger than 2.6% (Supplemental Table 2), which is the largest background human–chimpanzee sequence divergence (in 1-Mb windows across the autosomes; ranging from 0.4 to 2.6%) (Chimpanzee Sequencing and Analysis Consortium 2005
To investigate whether these exon loss events are actually specific to chimpanzee, we retrieved the macaque/mouse orthologous genes and examined whether these 16 exons were present in these genes. In fact, 13 exons are well-annotated or can be identified by CENTP in the macaque or mouse genome (Supplemental Table 2), implying that they represent chimpanzee-specific exon loss events. Although the lineage-specificity of the other three exons remains unclear, current PPG evidence appears to suggest that they represent chimpanzee exon loss events (rather than human gain of exons). This example demonstrates the usability of PPGs as an indicator to distinguish between exon losses and gains when out-group information is unavailable.
Meanwhile, Set 3 exons (88 exons) are observed in neither Ensembl-annotated chimpanzee transcripts nor chimpanzee PPGs (Fig. 3). Again, these exons may be either lost or not yet annotated in chimpanzee. Among the 88 exons, 22 have no chimpanzee orthologs, and four are too short (<12 bp) for BLAST alignments. We examined the remaining 62 exons using the CENTP pipeline and identified nine potentially novel chimpanzee exons. With reference to the macaque/mouse orthologous genes, seven of the remaining 53 exons are found to result from chimpanzee-specific exon loss events (Table 2). Meanwhile, the other 46 exons, which are not found in the macaque/mouse genomes, may represent human exon gain events. However, considering the incompleteness of the macaque genomic sequences and annotations, more evidence is required before any conclusions can be drawn. We therefore calculated the genetic distances between PPGs that support the 46 exons and their parent genes. We find that 12 PPG–parent gene pairs (covering 15 exons) have distances <2.6% (Supplemental Table 3). Ten out of the 12 gene pairs (covering 10 exons) have distances even Overall, 69 potential absent-in-chimpanzee exons are identified from Sets 1, 3, and 6, of which 20 and 10 represent possible chimpanzee-specific exon loss and human-specific exon gain events, respectively. Note that 47 (16 plus 31) chimpanzee exon loss events are identified without referring to out-group information.
The implications of PPGs in comparative and evolutionary studies
In another case (see Fig. 5) when both the active transcript and PPG of ASV 1 are absent in chimpanzee, one may consider ASV 2 as the Homo–Pan ancestral form. In this case, out-group species (e.g., macaque) information may be used to distinguish between chimpanzee exon loss and human exon gain events. If the active transcript of ASV 1 is observed in macaque, most likely an exon loss event has occurred in chimpanzee and ASV 1 was present in the human–chimpanzee–macaque common ancestor. On the other hand, if ASV 2 rather than ASV 1 is active in macaque, one may speculate that ASV 2 was the ancestral form of these three primates, and an exon gain event has occurred in human. Nevertheless, if a macaque ASV 1 PPG is found, a second scenario is also likely, that an exon loss event has occurred in both chimpanzee and macaque, and ASV 1 represents the human–chimpanzee–macaque ancestral form.
In sum, these examples illustrate that PPGs may significantly affect our inference of the ancestral state of transcriptome. PPGs are therefore a valuable resource in view of evolutionary transcriptomics studies.
Functional influences of transcriptome evolution
The CENTP pipeline CENTP makes use of the "CENTP cDNAs," including chimpanzee PPGs, human PPGs, and chimpanzee genes, to identify potentially novel human exonic sequences. As shown in Figure 1, we first retrieved 6932 chimpanzee PPGs (5904 from Yale [Zhang et al. 2003 Subsequently, four exon-checking filters were used to eliminate potential false positives (Fig. 1). The meta-CENTP exons that passed all of these four rules were regarded as novel human exons (termed "CENTP exons"): Rule 1, For each identified exon, both of its flanking exonic regions must overlap with a well-annotated human transcript to avoid accidental matches; Rule 2, the identified cassette-on exons must be flanked by legal splicing sites (i.e., GT-AG/GC-AG); Rule 3, the identified exons that were located in CDSs must not disrupt the reading frame or contain any premature stop codons; Rule 4, the meta-CENTP exons that overlapped with human ESTs (GenBank UniGene) were discarded to ensure the novelty of the CENTP exons. Through these filtering processes, the majority of meta-CENTP exons were removed (from >4 million to 643 exons). Some of the CENTP exons were examined for validity using RT-PCR-sequencing (see Supplemental material for details).
In addition, CENTP can identify simple and complex cassette-on exons (Fig. 2). These two exon types are defined in the European Bioinformatics Institute Alternative Splicing Database (EBI-ASD) (Stamm et al. 2006
Computation of substitution rates
Data retrieval and availability
We thank Wen-Hsiung Li for experimental assistance. This work was supported by the Genomics Research Center, Academia Sinica, Taiwan (T.J.C.); the National Health Research Institutes (NHRI), Taiwan (under contract NHRI-EX97-9408PC) (T.J.C.); the National Science Council, Taiwan (under contract NSC 96-2628-B-001-005-MY3) (T.J.C.); and NHRI intramural funding (F.C.C.).
5 Corresponding authors. E-mail trees{at}gate.sinica.edu.tw; fax 886-2-27898757.
E-mail fcchen{at}nhri.org.tw; fax 886-37-586467. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.075556.107.
Akey, J.M., Eberle, M.A., Rieder, M.J., Carlson, C.S., Shriver, M.D., Nickerson, D.A., and Kruglyak, L. 2004. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2: e286. doi: 10.1371/journal.pbio.0020286.[CrossRef][Medline] Black, D.L. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 72: 291–336.[CrossRef][Medline] Black, D.L. and Grabowski, P.J. 2003. Alternative pre-mRNA splicing and neuronal function. Prog. Mol. Subcell. Biol. 31: 187–216.[Medline] Boue, S., Letunic, I., and Bork, P. 2003. Alternative splicing and evolution. BioEssays 25: 1031–1034.[CrossRef][Medline] Bracco, L. and Kearsey, J. 2003. The relevance of alternative RNA splicing to pharmacogenomics. Trends Biotechnol. 21: 346–353.[CrossRef][Medline] Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S., Reich, J., and Bork, P. 2000. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 474: 83–86.[CrossRef][Medline] Buratti, E., Baralle, M., and Baralle, F.E. 2006. Defective splicing, disease and therapy: Searching for master checkpoints in exon definition. Nucleic Acids Res. 34: 3494–3510. Caceres, J.F. and Kornblihtt, A.R. 2002. Alternative splicing: Multiple control mechanisms and involvement in human disease. Trends Genet. 18: 186–193.[CrossRef][Medline] Carlton, M.B., Colledge, W.H., and Evans, M.J. 1995. Generation of a pseudogene during retroviral infection. Mamm. Genome 6: 90–95.[CrossRef][Medline] Chen, F.C. and Chuang, T.J. 2005. ESTviewer: A web interface for visualizing mouse, rat, cattle, pig and chicken conserved ESTs in human genes and human alternatively spliced variants. Bioinformatics 21: 2510–2513. Chen, F.C. and Chuang, T.J. 2007. Different alternative splicing patterns are subject to opposite selection pressure for protein reading frame preservation. BMC Evol. Biol. 7: 179. doi: 10.1186/1471-2148-7-179.[CrossRef][Medline] Chen, F.C., Chen, C.J., Ho, J.Y., and Chuang, T.J. 2006. Identification and evolutionary analysis of novel exons and alternative splicing events using cross-species EST-to-genome comparisons in human, mouse and rat. BMC Bioinformatics 7: 136. doi: 10.1186/1471-2105-7-136.[CrossRef][Medline] Chen, F.C., Chaw, S.M., Tzeng, Y.H., Wang, S.S., and Chuang, T.J. 2007a. Opposite evolutionary effects between different alternative splicing patterns. Mol. Biol. Evol. 24: 1443–1446. Chen, F.C., Chen, C.J., and Chuang, T.J. 2007b. INDELSCAN: A web server for comparative identification of species-specific and non-species-specific insertion/deletion events. Nucleic Acids Res. 35: W633–W638. Chen, F.C., Chen, C.J., Li, W.H., and Chuang, T.J. 2007c. Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Res. 17: 16–22. Chen, F.C., Wang, S.S., Chaw, S.M., Huang, Y.T., and Chuang, T.J. 2007d. Plant gene and alternatively spliced variant annotator. A plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species. Plant Physiol. 143: 1086–1095. Chimpanzee Sequencing and Analysis Consortium. 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 69–87.[CrossRef][Medline] Chuang, T.J., Chen, F.C., and Chou, M.Y. 2004. A comparative method for identification of gene structures and alternatively spliced variants. Bioinformatics 20: 3064–3079. Cooper, T.A. and Mattox, W. 1997. The regulation of splice-site selection, and its role in human disease. Am. J. Hum. Genet. 61: 259–266.[Medline] Duma, D., Jewell, C.M., and Cidlowski, J.A. 2006. Multiple glucocorticoid receptor isoforms and mechanisms of post-translational modification. J. Steroid Biochem. Mol. Biol. 102: 11–21.[CrossRef][Medline] Esnault, C., Maestre, J., and Heidmann, T. 2000. Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24: 363–367.[CrossRef][Medline] Faustino, N.A. and Cooper, T.A. 2003. Pre-mRNA splicing and human disease. Genes & Dev. 17: 419–437. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967–974. Galante, P.A., Sakabe, N.J., Kirschbaum-Slager, N., and de Souza, S.J. 2004. Detection and evaluation of intron retention events in the human transcriptome. RNA 10: 757–765. Garcia-Blanco, M.A., Baraniak, A.P., and Lasda, E.L. 2004. Alternative splicing in disease and therapy. Nat. Biotechnol. 22: 535–546.[CrossRef][Medline] Gene Ontology Consortium. 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11: 1425–1433. Goncalves, I., Duret, L., and Mouchiroud, D. 2000. Nature and structure of human genes that generate retropseudogenes. Genome Res. 10: 672–678. Goodman, S.J., Branda, C.S., Robinson, M.K., Burdine, R.D., and Stern, M.J. 2003. Alternative splicing affecting a novel domain in the C. elegans EGL-15 FGF receptor confers functional specificity. Development 130: 3757–3766. Graur, D. and Li, W.-H. 2000. Fundamentals of molecular evolution. Sinauer Associates, Sunderland, MA. 2d ed. Graveley, B.R. 2001. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 17: 100–107.[CrossRef][Medline] Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R., and Shoemaker, D.D. 2003. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302: 2141–2144. Kan, Z., Rouchka, E.C., Gish, W.R., and States, D.J. 2001. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 11: 889–900. Kan, Z., Castle, J., Johnson, J.M., and Tsinoremas, N.F. 2004. Detection of novel splice forms in human and mouse using cross-species approach. Pac. Symp. Biocomput. 2004: 42–53. Karro, J.E., Yan, Y., Zheng, D., Zhang, Z., Carriero, N., Cayting, P., Harrrison, P., and Gerstein, M. 2007. Pseudogene.org: A comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35: D55–D60. Lee, H.K., Kwak, H.Y., Hur, J., Kim, I.A., Yang, J.S., Park, M.W., Yu, J., and Jeong, S. 2007. Beta-catenin regulates multiple steps of RNA metabolism as revealed by the RNA aptamer in colon cancer cells. Cancer Res. 67: 9315–9321. Li, X. and Manley, J.L. 2006. Alternative splicing and control of apoptotic DNA fragmentation. Cell Cycle 5: 1286–1288.[Medline] Lim, C.P. and Cao, X. 2006. Structure, function, and regulation of STAT proteins. Mol. Biosyst. 2: 536–550.[CrossRef][Medline] Maniatis, T. and Tasic, B. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418: 236–243.[CrossRef][Medline] Mighell, A.J., Smith, N.R., Robinson, P.A., and Markham, A.F. 2000. Vertebrate pseudogenes. FEBS Lett. 468: 109–114.[CrossRef][Medline] Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 29: 2850–2859. Musunuru, K. 2003. Cell-specific RNA-binding proteins in human disease. Trends Cardiovasc. Med. 13: 188–195.[CrossRef][Medline] Nekrutenko, A., Makova, K.D., and Li, W.H. 2002. The KA/KS ratio test for assessing the protein-coding potential of genomic regions: An empirical and simulation study. Genome Res. 12: 198–202. Scotlandi, K., Zuntini, M., Manara, M.C., Sciandra, M., Rocchi, A., Benini, S., Nicoletti, G., Bernard, G., Nanni, P., Lollini, P.L., et al. 2007. CD99 isoforms dictate opposite functions in tumour malignancy and metastases by activating or repressing c-Src kinase activity. Oncogene 26: 6604–6618.[CrossRef][Medline] Shemesh, R., Novik, A., Edelheit, S., and Sorek, R. 2006. Genomic fossils as a snapshot of the human transcriptome. Proc. Natl. Acad. Sci. 103: 1364–1369. Stamm, S., Riethoven, J.J., Le Texier, V., Gopalakrishnan, C., Kumanduri, V., Tang, Y., Barbosa-Morais, N.L., and Thanaraj, T.A. 2006. ASD: A bioinformatics resource on alternative splicing. Nucleic Acids Res. 34: D46–D55. Vanin, E.F. 1985. Processed pseudogenes: Characteristics and evolution. Annu. Rev. Genet. 19: 253–272.[CrossRef][Medline] Venables, J.P. 2004. Aberrant and alternative splicing in cancer. Cancer Res. 64: 7647–7654. Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555–556. Yang, Z. and Nielsen, R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17: 32–43. Zhang, Z., Harrison, P.M., Liu, Y., and Gerstein, M. 2003. Millions of years of evolution preserved: A comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13: 2541–2558. Zhang, Z., Carriero, N., Zheng, D., Karro, J., Harrison, P.M., and Gerstein, M. 2006. PseudoPipe: An automated pseudogene identification pipeline. Bioinformatics 22: 1437–1439.
Received December 12, 2007; accepted in revised format March 20, 2008.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||