Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes
Abstract
In a conventional view of the prokaryotic genome organization, promoters precede operons and RBS sites with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS 2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of pre-computed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred). Importantly, we designed GeneMarkS 2 to identify several types of distinct sequence patterns (signals) involved in gene expression control, among them the patterns characteristic for leaderless transcription as well as non-canonical RBS patterns. To assess the accuracy of GeneMarkS 2 we used genes validated by COG annotation, proteomics experiments, and N terminal protein sequencing. We observed that GeneMarkS-2 performed better on average in all accuracy measures when compared with the current state-of-the-art gene prediction tools. Furthermore, the screening of ~5,000 representative prokaryotic genomes made by GeneMarkS-2 predicted frequent leaderless transcription in both archaea and bacteria. We also observed that the RBS sites in some species with leadered transcription did not necessarily exhibit the Shine-Dalgarno consensus. The modeling of different types of sequence motifs regulating gene expression prompted a division of prokaryotic genomes into five categories with distinct sequence patterns around the gene starts.
- Received September 29, 2017.
- Accepted May 16, 2018.
- Published by Cold Spring Harbor Laboratory Press
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.











