DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools

  1. Mitchell R. Vollger2
  1. 1Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA;
  2. 2Division of Medical Genetics, University of Washington, Seattle, Washington 98195, USA;
  3. 3Department of Statistics, University of Washington, Seattle, Washington 98195, USA;
  4. 4Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA;
  5. 5Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA;
  6. 6Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
  7. 7Brotman Baty Institute for Precision Medicine, Seattle, Washington 98195, USA
  1. 8 These authors contributed equally to this work.

  • Corresponding authors: absterga{at}uw.edu, mvollger{at}uw.edu
  • Abstract

    Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

    Footnotes

    • Received February 9, 2024.
    • Accepted May 21, 2024.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents

    Preprint Server