BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

  1. Mario Stanke1,2,6
  1. 1Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany;
  2. 2Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany;
  3. 3U.S. Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA;
  4. 4Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA;
  5. 5School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
  1. 6 These authors contributed equally to this work.

  • Corresponding authors: katharina.hoff{at}uni-greifswald.de, alexandre.lomsadze{at}bme.gatech.edu
  • Abstract

    Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.278090.123.

    • Freely available online through the Genome Research Open Access option.

    • Received June 10, 2023.
    • Accepted February 28, 2024.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server