Origin of phenotypes: Genes and transcripts

  1. Thomas R. Gingeras
  1. Affymetrix, Inc., Santa Clara, California 95051, USA

Abstract

While the concept of a gene has been helpful in defining the relationship of a portion of a genome to a phenotype, this traditional term may not be as useful as it once was. Currently, “gene” has come to refer principally to a genomic region producing a polyadenylated mRNA that encodes a protein. However, the recent emergence of a large collection of unannotated transcripts with apparently little protein coding capacity, collectively called transcripts of unknown function (TUFs), has begun to blur the physical boundaries and genomic organization of genic regions with noncoding transcripts often overlapping protein-coding genes on the same (sense) and opposite strand (antisense). Moreover, they are often located in intergenic regions, making the genic portions of the human genome an interleaved network of both annotated polyadenylated and nonpolyadenylated transcripts, including splice variants with novel 5′ ends extending hundreds of kilobases. This complex transcriptional organization and other recently observed features of genomes argue for the reconsideration of the term “gene” and suggests that transcripts may be used to define the operational unit of a genome.

New technical and conceptual insights have often prompted reconsiderations of what constitutes fundamental functional elements in a genome. In 1909, influenced by the writings of Hugo de Vries, Wilhelm Johannsen coined the term “gene” (Churchill 1974; Stamhuis et al. 1999). It was an attempt to provide a term that would represent an element that connected an inherited physical entity to an observable phenotype (Fig. 1A). Empirical findings and conceptual proposals made in the mid-20th century focused on the structural entities composing a gene. Notably, the elucidation of the structure of DNA (Watson and Crick 1953) and the subsequent unraveling of the processes of DNA replication and RNA transcription led to the identification of new elements in the genome, which, in turn, helped to sharpen an understanding of both the physical properties and definition of the term “gene” (Fig. 1B). Not long after the first description of the double helical structure of DNA, Francis Crick published a Central Dogma proposition as an operational framework describing how information stored in the sequence of DNA was transferred from the genome into functional protein products (Crick 1958). Two unstated implications from Crick’s proposition emerged after publication. First, genes were viewed as discrete bounded elements, from which RNA was transcribed to carry stored information from the DNA to the cell for protein synthesis. Second, it was interpreted that the flow of information from DNA was unidirectional, with genes having the limited role of encoding protein synthesis information. Twelve years later, Crick responded to criticisms that the Central Dogma proposition was an oversimplification and further clarified his intended meaning. He restated that there was a versatile role for RNA that could allow for information to flow back to genome. Although the details were understandably sparse, Crick noted that RNA should not be considered as single purpose functional elements (Crick 1970). However, as the field of molecular genetics matured, with few notable exceptions, the functional roles for RNAs as products of genes remained focused on the production of proteins.

Figure 1.

Evolution of the gene model and its relationship to wild-type and mutant phenotypes. Over the past century, the definition of a gene has been improved and refined from its conceptual origin in the early 1900’s (A) with the discovery of RNA and DNA structures (B), splicing (C), and lastly, widespread unannotated transcription (D). Exonic regions are depicted as blue boxes with transcripts shown as arrows below (spliced and unspliced). A hypothetical mutation is shown as a red triangle. Note that as the definition of a gene grows to include multiple transcripts, a single mutation can now affect many different transcripts and thus potentially could have multiple and more subtle phenotypes.

Are genes exclusively composed of protein-coding transcripts?

Efforts by subsequent generations of scientists have centered on adding greater molecular definition to the physical structure of genes and on achieving a greater understanding of how phenotypes are derived from genes. Studies aimed at defining the structure and organization of genes, characterizing the molecular structures of the RNA and protein products of genes, and determining the processes responsible for how gene expression at both the transcript and protein levels are regulated have led to many landmark discoveries. These include gene cloning, transcription factor-gene interactions, and RNA-splicing, RNA-editing, RNA-transport, and RNA-translation, to mention just a few (Fig. 1C). For the most part, these advances have taken place in the context of studying individual genes or restricted portions of genomes. Meanwhile, there was a growing realization that comprehensive answers to questions concerning the structure, function, and regulation of genes and their relationships to phenotypes required the analysis of large portions or entire genomes for most organisms.

The completion of a working draft of the human genome (Lander et al. 2001; Venter et al. 2001) provided one of the prerequisites for the development of the field of genomics. The number of genes that constitute the human genome was one of the first well-publicized genome-wide questions to be posed. While the ensuing debate may have been unnecessarily exaggerated, it seems clear that the proposed estimates were and still are based primarily on the default definition of a gene as a protein-coding functional element. As a reflection of this perception bias, protein-coding genes currently dominate the contents of most genome databases (Lander et al. 2001; Waterston et al. 2002; Zhang 2002; Parra et al. 2003).

The question of how many genes are present in the human genome led to a second query centered on the completeness of the cataloged collection of known protein-coding genes. The technical approaches used to answer this question, undertaken for not only the human genome but also for Arabidopsis, worm, fly, and mouse, have included in-depth full-length cDNA cloning, tiling microarrays, determination of the transcript 5′ ends using cap analysis of gene expression (CAGE), 3′ ends using serial analysis of gene expression (SAGE), and both 5′ and 3′ ends with gene-identification signature analysis using paired-end ditags (GIS-PET) (for review, see Johnson et al. 2005; Carninci 2006; Willingham and Gingeras 2006; Kapranov et al. 2007b). Each of these approaches has made an effort to interrogate genomes in an unbiased fashion (i.e., without regard to the knowledge of the location of previously identified protein-coding and noncoding genes). In this way, empirically based maps could be compared with the maps composed of annotated protein-coding genes. In turn, it would be possible to assess the completeness of catalogs of genes for each genome. In addition, estimates of the total number of genes for each genome could be measured, leading to the surprising observation that despite the 30-fold difference in genome size and vast differences in organismal complexity, humans have a comparable number of genes to the nematode worm (22,726 vs. 20,060 genes, respectively).

A by-product of these studies was the unanticipated, but unanimous conclusion that there was a significantly greater amount of transcriptional output from genomes than could be accounted for by our current collection of annotated protein-coding transcripts. Most of the newly identified unannotated transcripts were observed to have little protein-coding capacity (i.e., <100 amino acids) (Kapranov et al. 2002). These observations indicated that there exists a large collection of transcripts within cells that are not involved in directing protein synthesis (Figure 1D). This large collection of transcribed regions has been euphemistically been called the “dark matter” of the genome (Johnson et al. 2005) because until recently, these transcripts have escaped detection despite a considerable history of cDNA and EST cloning experiments. Although these transcripts appear to have reduced coding potential and have putatively been termed noncoding transcripts, there is no formal evidence that these transcripts do not encode short polypeptides. Thus, the term transcripts of unknown function (TUFs) has recently been suggested as their interim collective name (Cheng et al. 2005).

Prevalence of TUFs in nonprotein-coding regions of genomes

Additional confirmation of the prevalence of TUFs indicate a consistent picture of a large and until recently unannotated collection of stable cytosolic polyadenylated and nonpolyadenylated transcripts comprising approximately half of the human and mouse transcriptome. Initial analyses of the transcribed regions identified by independent technical approaches show more than half are observed by at least two different methods (Chen et al. 2002, 2004; Shiraki et al. 2003; Carninci et al. 2005, 2006; Ng et al. 2005; Ge et al. 2006).

The complexity and cellular localization of these unannotated transcripts has also proven to be unexpected. Transcriptional analysis of 10 human chromosomes demonstrates that unannotated nonpolyadenylated transcripts originating from intergenic regions of these chromosomes comprise the major proportion of the transcriptional output of the human genome (Cheng et al. 2005). In addition, nuclear and cytosolic compartmentalization of both polyadenylated and nonpolyadenylated unannotated transcripts has been observed using tiling arrays and cDNA sequencing analyses (Cheng et al. 2005; Kiyosawa et al. 2005).

Several studies have estimated that ∼10% of the nonrepeat sequences of the genome appear to be transcribed, polyadenylated, spliced in a high proportion of transcripts, and transported into the cytosol (Kapranov et al. 2002; Lian et al. 2003; Martone et al. 2003; Rinn et al. 2003; Yelin et al. 2003; Cheng et al. 2005). Considering the annotated transcripts present in RefSeq and GENCODE (Harrow et al. 2006) databases, as well as all ESTs recorded in dbEST, more than half of the detected transcribed sequences are not observed to align with these annotated transcripts (The ENCODE Project Consortium 2007; Kapranov et al. 2007a). These unannotated transcribed regions are approximately evenly distributed within and between gene boundaries.

These results were confirmed by several groups who participated in the National Human Genome Research Institute-sponsored Encyclopedia of DNA Elements (ENCODE) project, which focused its research efforts on 44 diverse regions of the human genome (∼1%) to identify and characterize the functional elements present in these sequences (The ENCODE Project Consortium 2007). Analyses of the sites of transcription in these regions are presented in this special issue of Genome Research. Several striking observations consistent with the presence of a large representation of TUFs were made. First, it was estimated that for the nearly 400 annotated genes present in the ENCODE regions, the protein-coding loci averaged 5.4 transcripts per gene with only 1.7 potentially encoding proteins (Denoeud et al. 2007; The ENCODE Project Consortium 2007). Second, >65% of these genes possess 5′ distal (108,000 bp on average) previously unannotated, tissue-specific transcription start sites (TSS) and promoter regions, many of which are parts of TUFs (Denoeud et al. 2007). Third, large numbers of protein-coding genes in these regions have isoforms that are composed of exons located in genomic nonprotein-coding regions (introns and intergenic regions) (Rozowsky et al. 2007). Fourth, analysis of transcribed unannotated ENCODE regions reveal the potential to fold into stable RNA structures (Washietl et al. 2007). Fifth, a compilation of all previously annotated and empirically detected RNAs found in the ENCODE studies indicates that to produce these RNAs, >90% of genomic sequence appears to be transcribed as nuclear primary transcripts (The ENCODE Project Consortium 2007).

The existence of this additional layer of transcriptional complexity has prompted several questions concerning: (1) the likelihood of the functional significance of widespread transcription; (2) the relationship of TUFs to protein-coding transcripts; and (3) their regulation, structure, and genomic organization. While answers to some of these questions are emerging, studies focused on noncoding transcripts of known biological function have begun to reveal a complexity in genome organization not captured by the current collection of annotations, prompting a reconsideration of what constitutes the fundamental functional element of the genome and how it relates to phenotypic variation.

Well-characterized noncoding transcripts of known function

Well-characterized noncoding transcripts with known functions include ribosomal (r)RNAs, transfer (t)RNAs, small nuclear (sn)RNAs, small nucleolar (sno)RNAs, as well as small RNA components of RNase P and other protein complexes (for review, see Eddy 2001; Storz 2002; Prasanth and Spector 2007). Another class of noncoding RNAs includes microRNAs (miRNAs) and exogenous small interfering RNAs (siRNAs), both of which participate in the RNA interference pathway (RNAi) and have regulatory functions at transcriptional and post-transcriptional levels. Several of the structural and regulatory features of these known nonprotein-coding RNAs are notable and can be used as characteristics of functional transcripts.

snoRNAs

The first notable feature is the range of lengths of snoRNA transcripts, varying from 60 to 300 nucleotides (nt) in length. This variability in transcript size likely suggests either that the flexibility in the length of transcript sequences is required to carry out similar functions, or this class of noncoding RNAs may have multiple functions (see below). Second, these stable transcripts carry out their modifications of rRNAs in association with a set of proteins to form a collection of small nucleolar particles (Bachellerie et al. 2002). This association with specialized proteins to carry out their function is also shared by many other protein-coding and noncoding transcripts and the specificity conferred via this is instructive for how, throughout the cell, noncoding transcripts appear to provide a context-specific function to a common set of protein factors. Third, snoRNA transcripts in higher eukaryotes are processed from introns of mRNAs, thus serving as one of the first examples of the functional importance of intronic portions of preprocessed and blurring the boundaries of gene organization. Fourth, computational studies of the Saccharomyces cerevisiae genome have identified many novel methylation-guide snoRNAs that are involved in rRNA modification (Lowe and Eddy 1999; Schattner et al. 2004), indicating that although this is a well-established functional class of noncoding transcripts, the membership of this class is still growing. Finally, recent studies indicate that a number of snoRNA transcripts do not possess sequences that are fully complementary to rRNA targets (Jady and Kiss 2000; Li et al. 2005), which not only presents a challenge in identifying these targets, but also suggests that a larger network of cellular proteins and/or other transcripts outside of the rRNA complex may be required to assist snoRNAs in carrying out their functions. This later finding opens the possibility that snoRNAs may have functions other than modification of rRNAs and spliceosomal RNAs. One such function, regulation of alternative splicing of a transcript encoded in trans, has recently been demonstrated for one snoRNA, HBII-52 (Kishore and Stamm 2006).

RNA interference (RNAi): miRNAs and siRNAs

Both miRNAs and siRNAs have been shown to be sequence-specific transcriptional and post-transcriptional regulators of gene expression (Doench et al. 2003; Bartel 2004; Meister and Tuschl 2004; Zamore and Haley 2005; Kim and Nam 2006). These two classes of noncoding transcripts also possess many distinguishing characteristics that are essential for their biological functions, and as such, may exemplify common characteristics shared by newly identified noncoding transcripts.

RNAi noncoding transcripts operate as double-stranded RNA molecules, with each strand being ∼21–23 nt in length in their ultimately functional forms. It is known that both types of RNAi molecules are produced from relatively long pri(mary)-transcripts by RNase III classes of endoribonucleases. The miRNAs are first processed in the nucleus by RNASEN (formerly DROSHA). Following transport out of the nuclear compartment, DICER1, a dsRNA-specific endonuclease, processes the 70-mer pre-transcripts into the biologically active double stranded 21–23-mers. However, with the exception of a few cases, relatively little is known about the primary transcripts that give rise to the 70-mer precursors (pre-) of miRNAs or to siRNAs. The fully processed siRNAs and miRNAs are incorporated into the RNA-induced silencing complexes (RISC), which target specific mRNA transcripts to interfere with target RNA stability or translation (Nelson et al. 2003; Bartel 2004; Cullen 2004; Lee et al. 2004; Tijsterman and Plasterk 2004; Rivas et al. 2005; Zamore and Haley 2005; Kim and Nam 2006).

These two classes of nc-RNAi transcripts also possess several characteristics that are similar to those previously described for snoRNA transcripts, including: (1) both classes of RNAs are produced from much larger precursor RNA molecules; (2) the genomic location of pri-RNA transcripts often mapping to genomic sites previously considered less biologically relevant (i.e., intergenic and intronic regions); (3) the association of primary, precursor, and mature miRNA and siRNAs with specific protein complexes to achieve biological functionality; (4) a single RNAi or snoRNA has the ability to regulate multiple transcripts in trans using partial sequence complementarity; and (5) the likelihood that the current catalog of RNAi transcripts are significantly underestimated (Lewis et al. 2003, 2005; Krek et al. 2005; Kishore and Stamm 2006).

Other characterized functional noncoding RNAs

In addition to the short RNA species discussed above, there is a growing number of other noncoding RNAs with established or likely biological functions (for review, see Mattick and Makunin 2006; Willingham and Gingeras 2006; Prasanth and Spector 2007). These RNAs can range in length from 21 to 30 nt (e.g., 21U RNAs and piRNAs) through hundreds of nucleotides (e.g., 330 bp for 7SK snRNA) to the 100 kb (e.g., 108 kb for Air RNA) (Prasanth and Spector 2007). Furthermore, functional noncoding RNAs have been shown to act via protein (e.g., NRON) (Willingham et al. 2005), RNA (e.g., some natural antisense transcripts) (Wahlestedt 2006), DNA (e.g., Xist) (Avner and Heard 2001), or combinations of both types of interactions (e.g., the promoter-specific noncoding RNA of the DHFR gene that interacts with promoter DNA as well as components of the core transcriptional machinery (Martianov et al. 2007).

Thus, the characteristics observed to be part of the regulation, structure, and genomic organization of well-characterized noncoding transcripts of known function (e.g., snoRNAs, miRNAs, siRNAs, and others) represent potential hallmarks, several of which are shared by TUFs, which could be used to help to identify other classes of functional noncoding transcripts.

Transcripts of unknown function (TUFs)

TUFs identified from analysis of cytosolic polyadenylated RNAs appear to share at least four characteristics with RNAi and snoRNA transcripts. The first of these shared characteristics is that some of these unannotated transcripts appear to be part of a regulatory system for protein-coding gene expression. Several groups have shown that cis-encoded unannotated antisense transcripts on a wide genomic scale are found to be simultaneously expressed with their paired sense transcript (Cawley et al. 2004; Katayama et al. 2005; Kiyosawa et al. 2005). This expression is observed to be either coordinately or discordantly regulated with the sense transcript; therefore, antisense transcripts cannot be assumed to have a simple antagonistic RNAi-mediated influence on the complementary transcript. However, when compared with genes without antisense transcripts, antisense transcript pairs are considerably more likely to have this genomic organization evolutionarily preserved, suggesting that some functional relationship is being retained (Dahary et al. 2005). On an individual gene level, the antisense regulation of MYCN (Krystal et al. 1990), HIF1A (Thrash-Bingham and Tartof 1999), and IME4 (Hongay et al. 2006) may point the way to how some of the antisense transcripts may carry out the regulation of their cognate sense genes. In yeast, entry into meiosis is controlled by IME4 and its regulation by an antisense transcript through what appears to be a mechanism of transcription interference. Diploid cells with IME4 antisense transcription have reduced sense transcripts and do not enter meiosis. Furthermore, human diseases ranging from breast cancer and lymphoma to thalassemia have been linked to naturally occurring antisense transcripts (for review, see Wahlestedt 2006).

The second shared characteristic is regulation of the TUFs by independent promoter elements not necessarily associated with the regulation of protein-coding genes. The majority of binding of MYC and SP1 to chromosomes 21 and 22 and of CREB1 to chromosome 21 were located in introns, exons, and intergenic regions (Cawley et al. 2004; Euskirchen et al. 2004). Many of these sites contained evidence of unannotated transcription in close proximity. The large-scale sequencing of more than 12 million CAGE tags from multiple mouse and human tissues permitted the genome-wide mapping of transcriptional start sites (TSSs) (Carninci et al. 2006). Widespread unannotated transcription was supported by an abundance of intergenic TSSs. Furthermore, the significant appearance of TSSs within internal exons and 3′ UTRs of annotated genes suggests multiple overlapping transcripts for many known genes.

The third shared characteristic is that the genomic locations encoding these TUFs correspond to regions thought to be biologically less important (introns and intergenic regions). Bertone et al. (2004) noted that 38% of their detected transcriptionally active regions (TARs) found while interrogating the entire human genome using tiling arrays were located more than 10 Kb from any previously annotated gene. Schadt et al. (2004) for human chromosomes 20 and 22 and Cheng et al. (2005) for 10 human chromosomes also reported that ∼25% of the oligonucleotide probes on their respective microarrays detected evidence of transcription emanating from intergenic regions.

Of note, the maps of transcribed sequences created using microarrays are very conservative. The thresholds used to determine whether a hybridizing signal is background or real signal have been set to select for the highest 2%–10% of the possible probes. Since the estimated copy number of many of the detected TUFs is low (estimated to be between less than one and 10 copies per cell), most of these transcripts are not reported because of the possibility of increasing the amount of false positive calls. In addition, only a relatively small number of differentiated and undifferentiated mammalian cell types/tissues have been analyzed by each of the laboratories using the five methodological approaches mentioned previously (cDNA cloning, microarrays, CAGE, SAGE, PET ditags). In-depth analysis of the full range of cell types found in mammals is likely to reveal additional members for each of the TUF categories (see below). Therefore, a fourth characteristic shared with noncoding RNAs of known function is that the transcript membership of each of the general categories of TUFs is undoubtedly underestimated.

Potential categories of TUFs

Three general organizational categories for the observed unannotated transcribed sequences can be identified. These categories are defined based on the relationship of TUFs to the structure and organization of the protein-coding transcripts. The first category consists of those TUFs that are complementary to sense transcripts. Relative to the sense transcript, these antisense transcripts can occur in cis (transcripts that overlap sense transcripts and for at least some portion of their length are completely complementary to exonic and/or intronic portions of sense transcripts) and trans (transcripts that are synthesized at a genomic site distal from the sense-transcribed region and may be only partially complementary to the sense transcript) (Kumar and Carmichael 1998; Vanhee-Brossollet and Vaquero 1998). The prominent presence of antisense transcripts in the genome has only recently been appreciated. Computational analyses of cDNA databases have estimated that from 8% (Shendure and Church 2002; Yelin et al. 2003) to 20% (Chen et al. 2004) of well-characterized coding genes have at least one overlapping antisense transcript. Empirical estimates have increased this estimate to >50% (Cheng et al. 2005), with the majority being unannotated transcripts. A comprehensive analysis of large cDNA, CAGE, and PET ditag libraries report similar occurrences of antisense transcription with as high as 72% of all annotated transcription units having an antisense transcript (Kiyosawa et al. 2003, 2005; Katayama et al. 2005).

The second category of unannotated transcribed sequences corresponds to isoforms of well-characterized protein-coding transcripts. Using a combination of techniques including microarray analysis, rapid amplification of cDNA ends (RACE), RT–PCR, and sequencing of isolated c-DNA clones, Kapranov et al. (2005) have noted that novel isoforms have been identified for almost every well-characterized protein-coding transcript examined. These experiments were later greatly expanded to include all annotated genes within the boundaries of the 1% of human genomes represented by the ENCODE regions (Denoeud et al. 2007; The ENCODE Project Consortium 2007). Strikingly, 90% of the 399 genes have either a previously unannotated exon or a new TSS (Denoeud et al. 2007; The ENCODE Project Consortium 2007). These novel isoforms include extended or shortened annotated exons as well as new exons. In fact, a combination of tiling arrays and RT–PCR/RACE experiments revealed that many human and Drosophila genes have extensive previously unannotated 5′ exons that are often noncoding UTRs. In Drosophila, the average size of newly predicted first introns was found to be >10-fold larger than estimated from RefSeq annotations (Manak et al. 2006), whereas in human ENCODE regions, new first introns averaged 108 kb with 23% of new introns >200 kb (The ENCODE Project Consortium 2007).

Expressed pseudogenes are a special version of this second category and may also contribute to the pool of unannotated transcribed sequences. While a pseudogene may have lost its ability to code for a functional protein, it may still be transcribed. An estimated 20,000 processed and unprocessed pseudogenes are present in the human genome (Torrents et al. 2003). However, this is likely to be an underestimate, since these analyses under-represent evolutionary older and smaller pseudogenes. A recent revision of the state of the human genome sequence estimates that there will be more pseudogenes than functional protein-coding genes in the human genome (International HumanGenome Sequencing Consortium 2004). Pseudogene transcripts have previously been shown to be functional by assisting to regulate the protein-coding mRNA stability and/or translation of their homologous coding genes (Hatfield et al. 2002; Zhang et al. 2002; Hirotsune et al. 2003; Yano et al. 2004). These findings demonstrate that expressed pseudogenes may be associated with specific regulatory role(s), and further highlight the potential functional significance of some of the unannotated transcripts. Approximately 10%–14% of the array-detected unannotated transcribed sequences found expressed in 10 human chromosomes may map to pseudogene loci (Cheng et al. 2005). Consistent with these results, the ENCODE Consortium, using a variety of experimental techniques, conservatively estimated that 19% of pseudogenes located within the ENCODE regions are transcribed (The ENCODE Project Consortium 2007; Zheng et al. 2007).

The third category consists of transcripts that either overlap intron regions of well-characterized annotated gene transcripts (on the same strand) or are entirely found within intergenic regions. Analysis of the structure and organization of TUFs using microarrays, RACE, and cloning/sequencing methods indicated that ∼10% of the interrogated unannotated polyadenylated cytosolic TUFs were found to be located entirely in the intergenic regions, while another 10% of TUFs were found to be entirely included in the intronic regions of annotated protein-coding transcripts (Cheng et al. 2005). These transcripts often appear to be located near genomic regions that bind an assortment of transcription factors and contain localized histone modifications that alter the chromatin structure in a manner conducive for active transcription (The ENCODE Project Consortium 2004, 2007; Kapranov et al. 2007a).

Evolutionary conservation of TUFs

Overall, while ∼5% of the human and mouse genomes appear to be under purifying evolutionary selection, and ∼60% of these genomic regions occur outside the boundaries of the well-annotated exons, sequences detected as being part of unannotated transcribed sequences align to only a small percent of these conserved regions (The ENCODE Project Consortium 2007). Kampa et al. (2004) and Bertone et al. (2004) report that ∼20%–24% of the unannotated transfrag and TAR sequences have substantial BLAST alignments with the mouse genome. Thus, the majority of detected unannotated transcribed sequence appears not to be strongly conserved relative to the mouse genome.

This characteristic of reduced evolutionary conservation makes TARs and TUFs unattractive in both being functionally important and being categorized as genes under traditional criteria (Snyder and Gerstein 2003). However, given the stated bias toward protein-coding transcripts in the formation of these criteria, it may prove premature to reach such conclusions. First, additional analyses are needed to address whether there is evolutionary conservation not detected using these traditional analysis approaches. One interesting possibility is that these unannotated transcribed sequences exhibit more recent evolutionary change, and thus may be more related to the primate limb of the mammalian lineages. Indeed, a search for sequences most rapidly evolving in the human lineage identified a noncoding RNA with brain-specific expression patterns (Pollard et al. 2006). Second, the types of sequence conservation observed for protein-coding transcripts and for mature miRNA molecules may not be observed in either precursors to these short RNAs or other mature functional noncoding transcripts (e.g., NRSE dsRNA) (Kuwabara et al. 2004). Furthermore, the large noncoding RNA, XIST, is essential for sex chromosome dosage compensation in mammals, and yet exhibits rapid evolution of primary sequences despite an overall conservation of gene structure and organization (Nesterova et al. 2001). Third, noncoding transcripts may adopt secondary structures essential for their function, and these structures may permit certain latitude in primary sequence composition. Computational analysis of RNA structural conservation based on base pairing and thermodynamic stability identified more than 30,000 RNA elements across the human genome, with approximately half mapping outside of known genes (Washietl et al. 2005). Focusing on the approximate third of the human genome not alignable with mouse, a significant number of these nonconserved regions were found to have signatures of RNA structure and impressively were twice as likely to overlap a tiling array-detected transfrag (Torarinsson et al. 2006). Lastly, the general lack of evolutionary conservation for TUFs may be explained if the TUFs represent larger precursor transcripts that are post-transcriptionally processed to produce short RNAs, which themselves do have a higher degree of conservation, as noted by Ponjavic et al. (2007). For example, mature miRNA sequences can be quite conserved across the animal kingdom, and yet their longer precursor sequences often lack significant conservation. Indeed, this observation has recently been extended to a large number of entirely new classes of novel short RNAs that are overlapped by nuclear TUFs, raising the possibility that some distinct proportion of unannotated nuclear transcription could serve as precursors for short RNA species (Kapranov et al. 2007a).

A collective network of transcripts and other regulatory elements result in a phenotype

The finding that noncoding transcripts are an expanding class of biologically important molecules has been discussed by many authors (for reviews, see Eddy 2001, 2002; Mattick 2001, 2004, 2005; Mattick and Gagen 2001; Huttenhofer et al. 2002, 2005; Szymanski and Barciszewski 2002; Morey and Avner 2004). However, it is recognized that not all of the newly discovered transcripts are likely to be biologically important. Thus, additional independent empirical evidence is required to support their biological relevance. Traditionally, support for functionality is derived from genetic and biochemical experiments that demonstrate a measurable phenotype associated with the investigated RNAs. These experiments, however, require that a measurable phenotype be observable. This has not always been straightforward even for protein-coding genes. For 96% of the open reading frames in yeast mutated by gene deletions and assayed under six growth conditions, <7% were required for growth (Giaever et al. 2002). Similarly, 8.9% of the predicted genes in worm have a detectable phenotype after RNAi inhibition (Kamath et al. 2003). Thus, phenotypic responses for the newly identified TUFs will likely be challenging and certainly time consuming, as it has been for the vast majority of protein-coding transcripts.

It is likely that some of the TUFs and noncoding RNAs that have recently been identified will be members of the already identified classes of functional noncoding transcripts such as RNAi and snoRNAs. Yet, other TUFs and noncoding RNA transcripts will likely be involved in additional biological processes for which RNAs have been shown to be important components, such as genomic imprinting (Sleutels et al. 2002; Takada et al. 2002), regulation of transcription, DNA replication, RNA stability, processing, and translation (Storz 2002; Willingham and Gingeras 2006; Prasanth and Spector 2007). Some TUFs may simply be products of transcription and regulatory processes, and the RNAs themselves have little or no direct inherent functional value with the biological important function residing in the transcriptional process itself. Finally, since RNA is well suited to the recognition of other nucleic acids by base pairing and to interacting with cellular protein components by virtue of its folding capabilities, some of the identified TUFs and noncoding RNAs are likely to be involved in processes not currently associated with RNA transcripts.

It has been proposed that this additional layer of complexity embodied by the intricate network of noncoding transcripts within a cell provides two important higher order functional capabilities to genomes (Mattick 2001, 2004, 2005; Szymanski et al. 2003). The first functionality provides a means to increase the informational and operative capabilities of genomes, while the number of protein-coding genes remains relatively similar across evolutionary distances. Protein diversity can be substantially increased using multiple splice isoforms as well as using chimeric gene fusions (discussed in Kapranov et al. 2007b). Indeed, such “tandem-chimerism” and gene fusion has been proposed as a common cellular mechanism for increasing protein diversity (Akiva et al. 2006; Parra et al. 2006). The second functionality is to contribute to RNA-based mechanisms (discussed above) that carry out many of the regulatory processes required for the increased capabilities of higher organisms and to communicate the status of these regulated processes.

As described above, several classes of noncoding transcripts not only physically interact with protein-coding transcripts and their protein products, but are also organizationally embedded within or proximal to protein-coding transcripts (Fig. 2). This has served not only to blur the physical boundaries of genes, but also to increase the complexity of determining what sequences in a gene serve what functions. The abundant presence of cis-antisense transcripts, for example, allows for the same nucleotides present in a protein-coding transcript to be part of a noncoding transcript, which, in turn, may play a role in the regulation of the same (or another) protein-coding transcript. The recently reported whole-genome transcript mapping study of both long and short RNAs and their inter-relationship lends strong support to a model of gene organization that is decidedly not colinear (Kapranov et al. 2007a). Hundreds of thousands of new short RNA species were discovered and a significant class of promoter-associated short RNAs were found to correlate with expression of the associated long mRNAs (Kapranov et al. 2007a). Thus, in light of this overlapping interleaved network of protein-coding and noncoding transcripts, it seems appropriate to reconsider the concept of gene in describing the relationship of a portion of a genome to a phenotype.

Figure 2.

Transcriptional complexity of a gene. Hypothetical gene cluster with detailed zoom-in for highlighted gene demonstrates that a single gene can have multiple transcriptional start sites (TSSs) as well as many interleaved coding and noncoding transcripts. Exons are shown as red boxes and TSSs are green right-angled arrows. Known short RNAs such as snoRNAs and miRNAs can be processed from intronic sequences and novel species of short RNAs that cluster around the beginning and ends of genes have recently been discovered (see text).

What is a gene and are transcripts fundamental operational units?

The current definition of gene (as defined by HUGO’s Human Genome Nomenclature Committee) is a DNA segment that contributes to phenotype/function, and in the absence of demonstrated phenotype/function, a gene may be characterized by sequence, transcription, or homology (Wain et al. 2002). Accordingly, this definition would arguably include the DNA regions that regulate the “contribution” leading to the phenotype/function. Inclusion of regulatory regions along with the entire transcribed regions (intronic and exonic) is appropriate given that the levels of transcription and the efficiency of transcript processing (both examples of a contribution) often influence the phenotypes/functions. Proximal and distal regulatory elements such as promoters, enhancers, and insulators would therefore be considered parts of gene under such a definition. Thus, defining the functional components for any gene could include many clustered and dispersed portions of a genome. Additionally, multiple transcripts utilizing the same sequence space on the same and opposite strands often each controlled by their own distinct regulatory regions and that may extend the boundaries of protein-coding transcripts all together further complicates the concept of relating a DNA region with a corresponding phenotype/function (Fig. 2). If each of the transcripts sharing sequence space with a protein-coding gene are capable of effecting the same phenotype/function, then a gene can consist of multiple (coding and noncoding) transcripts and regulatory regions (Fig. 1D). This increased complexity of both the components of a gene and its boundaries begs for a simpler operational unit that can be used to link a specific DNA sequence to phenotype/function. Individual RNA transcripts provide these fundamental operational elements.

The consideration of the use of transcripts as a fundamental operational element in describing the linkage of discrete genomic sequences to specific phenotypes/function allows for the straightforward cataloging and the identification of singular or multiple RNAs that influence the same phenotype and the separation of the operational components that contribute to phenotypes/function from other genomic elements that directly contribute to phenotypes/functions, but whose influences may be subtle and/or whose location may be very distal from the site of transcription.

Clearly, our understanding of the complexity of how information in genomes is organized, regulated, and expressed has grown in recent years. The identification of an abundant collection of polyadenylated and nonpolyadenylated transcripts with highly reduced protein-coding potential, which are found in the many cell types from many organisms, together with the elucidation of the complex relationship of these transcripts to the protein-coding transcripts exemplifies this increased complexity. Correspondingly, if the biological relevance of the bulk of these novel transcripts continues to be confirmed by subsequent experiments, this increased complexity most certainly will necessitate a reconsideration of the definition of a gene and require the use of an alternative term to help to define the fundamental operational unit that relates genomic sequences to phenotypes/function.

Acknowledgments

A special thanks to A. Willingham who updated and assisted in restructuring this manuscript from an earlier written version and for the generation of the figures, as well as to K. Kong, P. Kapranov, and R. Duttagupta for helpful literature, organization editing suggestions, and helpful discussions. This work has been funded in part with Federal Funds from the National Cancer Institute, National Institutes of Health under Contract No. N01-CO-12400, the National Human Genome Research Institute under Grant number U01 HG003147, and from Affymetrix, Inc.

Footnotes

  • E-mail tom_gingeras{at}affymetrix.com; fax (408) 481-0422.

  • Article is online at http://www.genome.org/cgi/doi/10.1101/gr.6525007

  • Freely available online through the Genome Research Open Access option.

References

Related Articles

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server