Ultrafast functional profiling of RNA-seq data for nonmodel organisms

  1. Jianguo Xia1,2
  1. 1Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada;
  2. 2Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada;
  3. 3Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada;
  4. 4Environment and Climate Change Canada, National Wildlife Research Centre, Ottawa, Ontario K1A 0H3, Canada
  • Corresponding author: jeff.xia{at}mcgill.ca
  • Abstract

    Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.269894.120.

    • Freely available online through the Genome Research Open Access option.

    • Received August 5, 2020.
    • Accepted February 18, 2021.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server