Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes

  1. Mark Borodovsky2,3
  1. 1 Gerogia Tech;
  2. 2 Georgia Tech
  • * Corresponding author; email: borodovsky{at}gatech.edu
  • Abstract

    In a conventional view of the prokaryotic genome organization, promoters precede operons and RBS sites with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS 2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of pre-computed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS 2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as non-canonical RBS patterns. To assess the accuracy of GeneMarkS 2 we used genes validated by COG annotation, proteomics experiments, and N terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of ~5,000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.

    • Received September 29, 2017.
    • Accepted May 16, 2018.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    ACCEPTED MANUSCRIPT

    Preprint Server