Graph-based deep reinforcement learning for haplotype assembly with Ralphi

  1. Victoria Popic1,2
  1. 1Broad Clinical Labs, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
  2. 2Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
  3. 3Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, USA;
  4. 4Department of Mathematics, MIT, Cambridge, Massachusetts 02139, USA
  • Corresponding author: vpopic{at}broadinstitute.org
  • Abstract

    Haplotype assembly is the problem of reconstructing the combination of alleles on the maternally and paternally inherited chromosome copies. Individual haplotypes are essential to our understanding of how combinations of different variants impact phenotype. In this work, we focus on read-based haplotype assembly of individual diploid genomes, which reconstructs the two haplotypes directly from read alignments at variant loci. We introduce Ralphi, a novel deep reinforcement learning framework for haplotype assembly, which integrates the representational power of deep learning with reinforcement learning to accurately partition read fragments into their respective haplotype sets. To set the reward objective for reinforcement learning, our approach uses the classic reduction of the problem to the maximum fragment cut formulation on fragment graphs, in which nodes correspond to reads and edge weights capture the conflict or agreement of the reads at shared variant sites. We train Ralphi on a diverse data set of fragment graph topologies derived from genomes in the 1000 Genomes Project. We show that Ralphi achieves lower error rates at comparable or longer haplotype block lengths over the state of the art for short and long reads at varying coverage in standard human genome benchmarks.

    Footnotes

    • Received February 16, 2025.
    • Accepted October 20, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents

    Preprint Server