Numerous Novel Annotations of the Human Genome Sequence Supported by a 5′-End–Enriched cDNA Collection

Abstract

A collection of 90,000 human cDNA clones generated to increase the fraction of “full-length” cDNAs available was analyzed by sequence alignment on the human genome assembly. Five hundred fifty-two gene models not found in LocusLink, with coding regions of at least 300 bp, were defined by using this collection. Exon composition proposed for novel genes showed an average of 4.7 exons per gene. In 20% of the cases, at least half of the exons predicted for new genes coincided with evolutionary conserved regions defined by sequence comparisons with the pufferfish Tetraodon nigroviridis. Among this subset, CpG islands were observed at the 5′ end of 75%. In-frame stop codons upstream of the initiator ATG were present in 49% of the new genes, and 16% contained a coding region comprising at least 50% of the cDNA sequence. This cDNA resource also provided candidate small protein-coding genes, usually not included in genome annotations. In addition, analysis of a sample from this cDNA collection indicates that ∼380 gene models described in LocusLink could be extended at their 5′ end by at least one new exon. Finally, this cDNA resource provided an experimental support for annotations based exclusively on predictions, thus representing a resource substantially improving the human genome annotation.

Footnotes

  • [The sequence data from this study have been submitted to EMBL under accession nos. BX323813, BX323814, BX324295–BX465182, AL513551–AL583711.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1481104. Article published online before print in February 2004.

  • 5 Corresponding author. E-MAIL betina{at}genoscope.cns.fr; FAX 33-1-60-87-25-14.

  • 2 Present address: LGI-BioInformatic, Aventis Pharma S.A., 94400, Vitry-Sur-Seine, France

  • 3 Present address: European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB101SD, UK

  • 4 Present address: Genomining, 92120, Montrouge, France.

    • Accepted December 2, 2003.
    • Received April 30, 2003.

Articles citing this article

| Table of Contents

Preprint Server