A Guide to the Mammalian Genome
- Yasushi Okazaki1,3,4 and
- David A. Hume2
- 1Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
- 2Institute for Molecular Bioscience, University of Queensland, Brisbane, Q4072, Australia
This extract was created in the absence of an abstract.
Sequencing of a Transcriptome
The rapid completion and public release of the genome sequences of mouse and human has led to a down-grading of the number of “genes” predicted in the mammalian genome to the region of 30,000 (Mouse Genome Sequencing Consortium, Waterston et al. 2002). In simpler organisms such as yeast, the estimate of gene number is comparatively straightforward, because the majority of the genome clearly encodes proteins, and individual genes generally have a well-defined start and finish and a single mRNA output. In mammals, the task is much more complex. Only a small proportion of the genome encodes mRNAs that in turn encode protein, and protein-coding sequence is interspersed with large introns or intergenic regions. Even protein coding genes have proven difficult to annotate reliably (Kawai et al. 2001), and non–protein coding genes are essentially impossible to annotate a priori.
The key to reliable annotation of a mammalian genome is the comprehensive characterization of the transcriptional output, the transcriptome. There are two approaches to this problem. The most common is high-throughput sequencing of cDNA ends (ESTs). In mouse and human, and to a lesser extent in many other mammals, there are millions of EST sequences in various repositories. EST sequences can be computationally assembled into clusters, as in the UniGene projects (http://www.ncbi.nlm.nih.gov/UniGene). There are many drawbacks with this approach, both from the cDNA cloning and sequence quality and from computational perspectives, but the most compelling is that the sequences are generated in silico and are not necessarily supported by a physical clone. It is also rather inefficient, because even with the best subtraction and normalization, abundant transcripts have been sequenced thousands of times, whereas many rare transcripts are absent from EST databases. EST assemblies are particularly difficult to interpret when there are multigene families or complex alternative splicing.
The alternative approach is …











