De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: Performance of Markov-dependent genome feature scoring

  1. Pontus Larsson1,
  2. Andrea Hinas2,4,
  3. David H. Ardell3,5,6,
  4. Leif A. Kirsebom1,
  5. Anders Virtanen1, and
  6. Fredrik Söderbom2,6
  1. 1 Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, SE-75124 Uppsala, Sweden;
  2. 2 Department of Molecular Biology, Biomedical Center, Swedish University of Agricultural Sciences, SE-75124 Uppsala, Sweden;
  3. 3 Linnaeus Centre for Bioinformatics, Biomedical Center, SE-751 24 Uppsala, Sweden

Abstract

Genome data are increasingly important in the computational identification of novel regulatory non-coding RNAs (ncRNAs). However, most ncRNA gene-finders are either specialized to well-characterized ncRNA gene families or require comparisons of closely related genomes. We developed a method for de novo screening for ncRNA genes with a nucleotide composition that stands out against the background genome based on a partial sum process. We compared the performance when assuming independent and first-order Markov-dependent nucleotides, respectively, and used Karlin-Altschul and Karlin-Dembo statistics to evaluate the significance of hits. We hypothesized that a first-order Markov-dependent process might have better power to detect ncRNA genes since nearest-neighbor models have been shown to be successful in predicting RNA structures. A model based on a first-order partial sum process (analyzing overlapping dinucleotides) had better sensitivity and specificity than a zeroth-order model when applied to the AT-rich genome of the amoeba Dictyostelium discoideum. In this genome, we detected 94% of previously known ncRNA genes (at this sensitivity, the false positive rate was estimated to be 25% in a simulated background). The predictions were further refined by clustering candidate genes according to sequence similarity and/or searching for an ncRNA-associated upstream element. We experimentally verified six out of 10 tested ncRNA gene predictions. We conclude that higher-order models, in combination with other information, are useful for identification of novel ncRNA gene families in single-genome analysis of D. discoideum. Our generalizable approach extends the range of genomic data that can be searched for novel ncRNA genes using well-grounded statistical methods.

Footnotes

  • 4 Present addresses: Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Room 3050, Cambridge, MA 02138, USA;

  • 5 School of Natural Sciences, University of California, Merced, CA 95344, USA.

  • 6 Corresponding authors.

    6 E-mail dardell{at}ucmerced.edu; fax (209) 228-4060.

    6 E-mail fredde{at}xray.bmc.uu.se; fax 46-18-536971.

  • [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to GenBank under accession nos. EF551319 and EF551320.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.069104.107.

    • Received July 14, 2007.
    • Accepted March 11, 2008.
| Table of Contents

Preprint Server