Identification of Putative Noncoding RNAs Among the RIKEN Mouse Full-Length cDNA Collection

  1. Koji Numata1,2,
  2. Akio Kanai2,
  3. Rintaro Saito2,4,
  4. Shinji Kondo4,
  5. Jun Adachi4,
  6. Laurens G. Wilming6,
  7. David A. Hume7,
  8. RIKEN GER Group4,
  9. GSL Members5,8,
  10. Yoshihide Hayashizaki4,5, and
  11. Masaru Tomita2,3,9
  1. 1Graduate School of Media and Governance, Bioinformatics Program, Keio University, Fujisawa, Kanagawa 252-8520, Japan
  2. 2Institute for Advanced Biosciences, Keio University, Fujisawa, Kanagawa 252-8520, Japan
  3. 3Department of Environmental Information, Keio University, Fujisawa, Kanagawa 252-8520, Japan
  4. 4Laboratory for Genome Exploration Research Group, RIKEN Genomic Science Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
  5. 5Genome Science Laboratory, RIKEN, Hirosawa, Wako, Saitama 351-0198, Japan
  6. 6Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK
  7. 7Institute for Molecular Bioscience and School of Molecular and Microbial Sciences, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia

Abstract

With the sequencing and annotation of genomes and transcriptomes of several eukaryotes, the importance of noncoding RNA (ncRNA)—RNA molecules that are not translated to protein products—has become more evident. A subclass of ncRNA transcripts are encoded by highly regulated, multi-exon, transcriptional units, are processed like typical protein-coding mRNAs and are increasingly implicated in regulation of many cellular functions in eukaryotes. This study describes the identification of candidate functional ncRNAs from among the RIKEN mouse full-length cDNA collection, which contains 60,770 sequences, by using a systematic computational filtering approach. We initially searched for previously reported ncRNAs and found nine murine ncRNAs and homologs of several previously described nonmouse ncRNAs. Through our computational approach to filter artifact-free clones that lack protein coding potential, we extracted 4280 transcripts as the largest-candidate set. Many clones in the set had EST hits, potential CpG islands surrounding the transcription start sites, and homologies with the human genome. This implies that many candidates are indeed transcribed in a regulated manner. Our results demonstrate that ncRNAs are a major functional subclass of processed transcripts in mammals.

Noncoding RNA (ncRNA) is a global term for transcripts that lack an apparent open reading frame (ORF) and do not encode a protein product. Until recently, the only known functional ncRNAs were ribosomal RNA, transfer RNA, and several small nucleolar RNAs. Other classes of small ncRNAs—such as microRNAs (miRNAs), C/D box snoRNAs, small interfering RNAs (siRNAs), and small temporal RNAs (stRNAs)—have been identified and characterized in all three domains of life (bacteria, archaea, and eukarya) based on experimental expression analysis and computational screening (Sharp 2001; Eddy 2002; Grosshans and Slack 2002; Tang et al. 2002; Wassarman 2002). Lau et al. (2001) had reported that abundantly expressed 21- to 24-nt-long miRNAs are found in Caenorhabditis elegans, and the pattern of their expression varies with developmental stages (Lau et al. 2001). These classes of small ncRNAs are believed to contribute to processes such as transcriptional regulation, translational repression, and mRNA degradation (Storz 2002).

Longer ncRNAs, sometimes referred to as mRNA-like ncRNAs, form a quite distinct class. Unlike the classes noted above, they are processed like mRNA; that is, they are transcribed by RNA polymerase II, spliced, polyadenylated, and conceivably capped (Erdmann et al. 2000, 2001). For example, Xist, which acts as an X-chromosome inactivator to achieve dosage compensation in mammalian females, encodes a 17-kb-long RNA molecule with no significant ORFs, even though it is comprised of seven exons and polyadenylated (Hong et al. 1999; Nesterova et al. 2001). Such mRNA-like processed ncRNAs have been identified in plants and animals and are expressed in a tissue-specific manner (Erdmann et al. 2001). The diversity of such transcripts in the mammalian transcriptome has not been evident from genome sequencing, because exon boundaries are difficult to define, and ncRNAs tend to be less conserved between mammalian species. In this situation, the RIKEN mouse full-length cDNA collection, called FANTOM2 clone set (Okazaki et al. 2002), provides the largest available resource in any mammal for the discovery of candidate functional ncRNAs. The set contains many sequences that do not show apparent protein coding regions according to human-curated annotation. Some of these are likely to be the result of incomplete cDNA synthesis or incompletely processed transcripts (e.g., 3′ untranslated regions [UTRs]), or perhaps transcriptional “noise,” so the identification of strong candidate functional ncRNAs requires additional annotation criteria.

There are two ways to identify ncRNAs computationally. One is the “genome based” approach, which detects ncRNAs from genomic sequence. More than 200 candidate ncRNA genes are predicted in Escherichia coli by computational comparative genomics using “intergenic” sequence data from four related bacteria (Rivas et al. 2001). Two similar approaches are reported by other groups, and at least 20 ncRNA genes have been experimentally confirmed in E. coli (Argaman et al. 2001; Wassarman et al. 2001). This approach is not feasible in the more complex mammalian genomes.

The other approach is “transcripts-based.” MacIntosh et al. (2001) attempted to identify and characterize new ncRNAs by using Arabidopsis thaliana EST sequences. Through systematic computational screening, they extracted dozens of ncRNA candidates and putative RNAs encoding small peptides. The investigators concluded that there are numerous functional ncRNAs in A. thaliana.)

This study describes the initial effort at comprehensive identification of mammalian mRNA-like processed ncRNAs based on the comprehensive mouse transcriptome survey provided by the RIKEN Mouse Gene Encyclopedia project.

RESULTS AND DISCUSSION

Characterization of Previously Reported ncRNAs in FANTOM2 Clone Set

To assess the representation of ncRNAs in the FANTOM2 clone set (see below), we initially identified previously described ncRNAs. The set of previously reported ncRNAs was based on the Noncoding RNAs Database (http://biobases.ibch.poznan.pl/ncRNA/), which was constructed by Erdmann et al (2000, 2001). The query sequence set from the database contains ncRNAs from both mammalian and nonmammalian origins, including plants and flies.

We used a homology search based on BLASTN and Ssearch (Pearson 1991), to search for 18 of the murine known ncRNAs, and nine of them were identified in the FANTOM2 clone set. Likewise, putative homologs of several ncRNAs previously described in other mammalian organisms were also found (Table 1). We have newly identified homologs of rat NTAB (French et al. 2001), 7H4 (Velleca et al. 1994), human NTT (Liu et al. 1997), NCRMS (Chan et al. 2002), U19 snoRNA host gene (Bortolin and Kiss 1998), and hamster adapt33 (Wang et al. 1996) in the FANTOM2 clone set. These results indicate that ncRNAs are well represented in the FANTOM2 clone set, and sequence conservation across species is a useful criterion in annotation and validation.

Table 1.

Summary of Previously Identified ncRNAs or Their Possible Homologs Found in FANTOM2 Clone Set

Computational Screening of the Novel ncRNA Candidates

The FANTOM2 clone set contains 60,770 cDNA clones selected from >260 normalized, subtracted, and full-length enriched cDNA libraries of C57BL/6J strain of mouse. The 60,770 clones of FANTOM2-set were clustered into 33,409 transcriptional units (TUs), and approximately half of the TUs contained a deduced protein sequence (Representative Protein Set [RPS]) based on ORF prediction and/or homology with known proteins. The remaining set of cDNAs (15,815 sequences) that are defined as non–protein-coding TUs (Okazaki et al. 2002) represent the starting set for identification of ncRNAs. To eliminate other possible sources of transcripts that lack an apparent functional ORF, such as UTR-only sequences (incomplete cDNA synthesis), unprocessed mRNAs with retained introns, and chimeric cDNA clones, we applied the following strategy of computational filtering. In addition to removing the RPS, we eliminated sequences that showed any homology with known protein sequences, even if they did not contain any evidence of an ORF. We then mapped the remaining sequences to the mouse genome (MGSCv3). Comparison between the alignments with the genome and exon predictions by GENSCAN (Burge and Karlin 1997) was also considered. Because GENSCAN may fail to identify untranslated regions of protein-coding transcripts, we eliminated any sequences that mapped within 10 kb of any predicted exon on the grounds that they may be part of the same TU, for example, an alternative 3′UTR or splice variant (see Methods).

Consequently, we extracted 4280 transcripts as the candidate set of ncRNAs. The procedures for computational screening and number of remaining sequences in each filtering step are shown in Table 2. The average length of the candidate clones was 1778.9 nt, and that of predicted protein-coding transcripts in RTS was 2131.8 nt. Likewise, the average length of the longest ORF of resulting candidates was 200.6 nt, whereas that of transcripts in RPS was 1088.7 nt. A full set of accession numbers of the candidate set is provided in the Supplementary tables, and the annotation of each individual candidate can be assessed at the FANTOM2 Web interface as described in the overview of the RIKEN project (Okazaki et al. 2002).

Table 2.

Summary of Computational Screening for Novel ncRNA Candidates

Characterization of the Candidate Set

The strategy used here is conservative, and we know that not all known ncRNAs would meet these criteria. Several previously reported ncRNAs were eliminated at each filtering step of the strategy described above, but as we would hope, the frequency within the remaining set increased (albeit the number of known ncRNAs is too small to assess the statistical validity of this assessment). To characterize the candidate set further, we used several additional criteria. First, we conducted a homology search against publicly available EST sequences of mouse, human, and rat, as well as against the human genome. We additionally searched CpG islands in 5′ boundary regions and polyadenylation signals in 3′ ends (see more details in Methods). CpG islands and polyadenylation signal are observed not only in a large number of protein-coding genes but also in several known ncRNAs. This can be taken as supporting evidence that they are actually transcribed by RNA polymerase II. We also determined the intron–exon boundaries of the loci encoding the transcripts, and we identified the subset of candidates that is produced by splicing of a primary transcript. These results are summarized in Table 3.

Table 3.

Summary of Further Characterization of the ncRNA Candidates

There were 1200 (28.0%) ncRNA candidates, which were homologous with mouse EST sequences (BLASTN with e value lower than 1.0e - 100). This indicates that approximately one fourth of the candidates have independent evidence of reproducible expression. This is not an especially stringent criterion, because the RIKEN Project is itself the largest mouse EST project, and library construction involves strong selection to avoid redundancy. Furthermore, functional ncRNAs may not be abundant transcripts. More interestingly, 111 (2.6%) clones showed strong homology with human ESTs, and 252 (5.8%) clones showed strong homology to rat ESTs. Furthermore, 454 (10.6%) clones could be aligned with the human genome sequence at ≥50% homology and ≥70% of length. Sequence conservation is also not a strict criterion for exclusion or functional significance, because several known ncRNAs would fail to meet this criterion.

We identified CpG islands in the 5′ boundary genomic region for 919 (21.5%) of the clones. CpG islands are associated with ∼40% of promoters for mammalian genes, most commonly those of housekeeping genes that are widely expressed (Takai and Jones 2002). A polyadenylation signal was found at the 3′ end of 1395 (32.6%) of the clones. This is a rather more stringent criterion, implying that the transcript is a genuine polyadenylated mRNA-like molecule. However, cDNAs that lack this signal might have arisen by internal oligo-dT priming but might still be bona fide ncRNAs. To clarify the robustness of these criteria, we plotted the average of CpG observed/expected (O/E) ratio and frequency of poly-A–like signals around the 5′ boundaries of mapped genomic regions, and the 3′ end of cDNA sequences, respectively. As shown in Figures 1 and 2, there is a clear peak of the CpG O/E ratio and poly-A signals that delineates the subsets of transcripts in the candidate set into separate classes.

Figure 1

Average of CpG observed/expected (O/E) value around putative transcription start site. The average of CpG O/E ratio for each transcription start site (from 3 kb upstream to 0.5 kb downstream) of the 4280 largest-candidate set was plotted. The set contains 919 sequences, which have potential CpG islands surrounding transcription start site, as referred in Table 3. Putative transcription start sites (TSSs) were defined by 5′ boundaries of mapped genomic regions as indicated by an arrow. CpG O/E ratio was calculated every 200-bp window with sliding 20 bp. The formula for producing CpG O/E ratio is described in Methods.

Figure 2

Frequencies of polyadenylation signal-like sequences upstream of 3′ end. Frequencies of polyadenylation signal-like sequences located upstream of 3′ end sites were plotted. The sequence pattern AATAAA/ATTAAA was searched for every position from the 3′ end of the 4280 largest-candidate set. The set contains 1395 sequences, which contain polyA-signal like sequence in the 3′ end, as mentioned in Table 3.

Among the candidate set that could be mapped to the genome, 1150 (26.9%) revealed multiple exons. Several known ncRNAs such as Xist, Gas5, and BIC (Smith and Steitz 1998; Hong et al. 1999; Tam 2001) are known to undergo splicing. Again, this criterion provides a strong indication that the transcript is a genuine product of RNAPol II–mediated transcription and is likely to be functional.

In a separate analysis, the FANTOM2 cDNA set was found to contain >2400 sense–antisense pairs (Okazaki et al. 2002; Kiyosawa et al. 2003). Three hundred twenty-three (7.5%) members of the candidate set are also included in the antisense transcript candidates. Antisense transcripts have been implicated in transcription control, especially in genomic imprinting. For example, AIR, which is an antisense transcript from Igf2r locus, silences three kinds of paternal imprinting genes (Sleutels et al. 2002). In another study (Holmes et al. 2003), new and novel antisense transcripts from among the FANTOM2 set were mapped to the imprinted GNAS locus.

Among the candidate set, 68.0% of the clones fitted at least one of the criteria, and 54.8% of these fitted clones satisfied more than two criteria, as shown in Table 3. Therefore, we believe that the set contains many potential ncRNAs and that a substantial subset will be shown to be functional in some aspect of mammalian biology. Twenty-five strong candidates extracted by our filtering are listed as examples for putative ncRNAs in Table 4. The average length of the longest ORFs of them is 195.2 nt. As an additional index that the transcripts are unlikely to code for even a small peptide, in all of the clones except TF14562 and TF9816, the longest ORFs are not started from the first ATG of each clone, which is normally used as the initiation codon in >90% of mammalian protein coding transcripts. It remains possible that a small subset does, indeed, encode small proteins. Detailed annotation of candidate CDS encoding proteins between 50 and 99 amino acids is described elsewhere (Grimmond et al. 2003), and candidates that arise from this analysis will be eliminated from the candidate set.

Table 4.

Twenty-Five Examples of ncRNA Candidates

The next stage in further validation of these candidate ncRNAs is documentation of their regulation and expression profiles. A number of candidates are isolated from tissue-specific and stage-specific libraries (Table 4). The ncRNAs that show tissue-restricted expression would clearly be candidates for functions in differentiation and development. Examples in this class would include BC1 and NTAB (French et al. 2001; Muddashetty et al. 2002), which are specifically transcribed in the brain. They make a complex with certain ribonucleoproteins (RNPs) and regulate the RNA translation, transport, and turnover. The FANTOM2 Web interface for all candidate transcripts provides some indication of expression pattern based on the profile of libraries from which ESTs have been identified. The RIKEN project includes systematic cDNA microarray analysis of all of the clones in the FANTOM2 set, and some of this information is already in the public domain (Bono et al. 2003).

In conclusion, the FANTOM2 cDNA collection clearly contains thousands of candidate ncRNAs, a significant subset of which has all the characteristics of an mRNA other than protein-coding function. The era of the central dogma, DNA-RNA-protein, as the major conduit for expression of genome-encoded biological information, is clearly at an end.

METHODS

Computational Screening of the Candidates

The Mouse Representative Transcripts Set (RTS) was used as the nonredundant sequence set for this transcriptome screening. First, the RTS sequences that are defined as non–protein coding TU by FANTOM consortium, that is, RTS that cannot produce proteins in the Representative Protein Set (RPS), were extracted (Okazaki et al. 2002). Homology search with known amino-acid sequences (ftp://us.expasy.org/databases/sp_tr_nrdb/) according to BLASTX (http://www.ncbi.nlm.nih.gov/blast) was performed to eliminate any likelihood of homology with a protein-coding transcript. If the result of BLASTX was produced with e value <1.0e - .05, the clone was eliminated. It should be noted that this criterion would eliminate transcripts from expressed pseudogenes. As the next filtering step, cDNA sequences were aligned to genomic sequence by using BLAST and SIM4 (http://globin.cse.psu.edu/dist/sim4/). If they were aligned at >90% identity over >90% of their length, cDNA clones were kept, otherwise they were discarded. In addition, the prediction of protein-coding regions in genomic sequences by GENSCAN (http://genes.mit.edu/GENSCANinfo.html) was also carried out. The remaining sequences were kept as the candidates, if the entire 10-kb region of genome sequence around the mapped region—the size was determined according to previous work (MacIntosh et al. 2001)—did not overlap with protein-coding regions predicted by GENSCAN.

Further Characterization of the Largest Candidate Set

All of the homology searches with publicly available EST sequences were performed by BLASTN. Only EST sequences with e value <1.0e - 100 were regarded as corresponding homologous mouse ESTs (ftp://ftp.ncbi.nih.gov/blast/db/est_mouse.Z), and sequences with E-values lower than 1.0e - 50 were regarded as the likely human (ftp://ftp.ncbi.nih.gov/blast/db/est_human.Z) and rat (ftp://ftp.ncbi.nih.gov/genomes/R_norvegicus/rn_est.gz) orthologous ESTs. Reverse hits were not considered.

A homology search with human genomic sequences was performed according to BLAST and SIM4 in the identical way as gene mapping to the mouse genome sequence (see Computational Screening of the Candidates). Because there is less selection pressure on contiguous homology of noncoding sequences, hits of 70% of the full-length, and at least 50% of nucleotide identity, were considered significant.

The CpG island analysis was performed according to calculation of CpG O/E ratio and (G+C) content for every 200-bp window with moving 1-bp intervals around 5′ boundaries of aligned region of cDNAs. If the region had (G+C) content ≥50% and CpG O/E ratio ≥ 0.6, it was considered as a CpG island. CpG O/E ratio was calculated by using the Gardiner-Garden and Frommer formula (Gardiner-Garden and Frommer 1987), ([number of CG × N]/[number of C × number of G]), where N denotes the total number of nucleotides in the analyzed sequence. The search of polyadenylation signal is based on statistic pattern search. Two hexamer sequences, AATAAA or ATTAAA, were searched for each 30 nucleotides of the 3′ end.

Acknowledgments

We acknowledge Dr. I. Yamanaka, Dr. H. Bono, and I. Nikaido for technical support; S. Fujimori, A. Sakurai, and H. Kochiwa for helpful discussions; and M. Chishima for surveys of previously identified ncRNAs.

Footnotes

  • [Supplemental material is available online at www.genome.org.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1011603.

  • 8 Takahiro Arakawa, Piero Carninci, and Jun Kawai.

  • 9 Corresponding author. E-MAIL mt{at}sfc.keio.ac.jp; FAX 81 (466) 47-5099.

    • Accepted January 28, 2003.
    • Received November 22, 2002.

References

WEB SITE REFERENCES

| Table of Contents

Preprint Server