Long-read reconstruction of many diverse haplotypes with devider

  1. Heng Li1,2
  1. 1Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA;
  2. 2Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02215, USA;
  3. 3Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA;
  4. 4Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
  5. 5Department of Veterinary Population Medicine, University of Minnesota, St. Paul, Minnesota 55421, USA
  • Corresponding author: jshaw{at}ds.dfci.harvard.edu
  • Abstract

    Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. Here, we present devider, an algorithm for haplotyping small sequences, such as viruses or genes, from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Oxford Nanopore Technologies (ONT) long-read data set containing seven HIV strains, devider recovers 97% of the haplotype content and has the most accurate abundance estimates while taking <4 min and 1 GB of memory for >8000× coverage. Benchmarking on synthetic mixtures of antimicrobial-resistance (AMR) genes shows that devider recovers 83% of haplotypes, 23 percentage points higher than the next best method. On real Pacific Biosciences (PacBio) and ONT data sets, devider recapitulates previously known results in seconds, disentangling a bacterial community with more than 10 strains and an HIV-1 coinfection data set. We use devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline-resistance gene with >18,000× coverage and six haplotypes for a CfxA2 beta-lactamase gene. We find clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.

    Footnotes

    • Received February 14, 2025.
    • Accepted September 9, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents

    Preprint Server