The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci
- Joel S. Rozowsky1,8,
- Daniel Newburger1,
- Fred Sayward2,
- Jiaqian Wu3,
- Greg Jordan1,
- Jan O. Korbel1,
- Ugrappa Nagalakshmi3,
- Jin Yang2,
- Deyou Zheng1,
- Roderic Guigó4,
- Thomas R. Gingeras5,
- Sherman Weissman6,
- Perry Miller2,7,
- Michael Snyder3, and
- Mark B. Gerstein1,7,8
- 1 Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA;
- 2 Center for Medical Informatics, Yale University, New Haven, Connecticut 06520-8009, USA;
- 3 Molecular, Cellular, and Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA;
- 4 Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 37-49, 08003, Barcelona, Catalonia, Spain;
- 5 Affymetrix, Inc., Santa Clara, California, 92024, USA;
- 6 Department of Genetics, Yale University, New Haven, Connecticut 06520, USA;
- 7 Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
Abstract
For the ∼1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of “unannotated transcription.” We use a number of disparate features to classify the 6988 novel TARs—array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that ∼14% of the novel TARs can be associated with known genes, while ∼21% can be clustered into ∼200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.
Footnotes
-
↵8 Corresponding authors.
↵8 E-mail joel.rozowsky{at}yale.edu; fax (203) 432-5175.
↵8 E-mail mark.gerstein{at}yale.edu; fax (360) 838-7861.
-
[Supplemental material is available online at www.genome.org.]
-
Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5696007
-
9 This estimation of a P-value of <0.05 takes into account the multiple testing of the expression profile of a novel TAR with on average 19 known exons within 20 kb. The P-value for obtaining a Pearson correlation of 0.9 for two 11-dimensional vectors is <10−3.
-
10 There are 828 putative composite promoters on the list from The ENCODE Project Consortium (2007), which is a set of both known and predicted promoters. Promoters were predicted using multiple ChIP-chip data sets for promoter specific transcription factors and modifications. This set of promoters is available at DART.gersteinlab.org.
-
11 ARC accepts files in Browser Extensible Data (BED) format and files containing inclusive intervals. The BED format uses a zero-based, half-open coordinate system. It was developed for the UCSC Genome Browser and is described fully at http://genome.ucsc.edu/FAQ/FAQformat#format1. The inclusive intervals option accepts one-based, closed coordinates as used by Ensembl.
-
- Received June 26, 2006.
- Accepted November 22, 2006.
-
Freely available online through the Genome Research Open Access option.
- Copyright © 2007, Cold Spring Harbor Laboratory Press











