Pangenome-based genome inference using integer programming

  1. Chirag Jain1
  1. 1Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka 560012, India;
  2. 2Department of Computer Science, The University of Texas at Dallas, Dallas, Texas 75080, USA;
  3. 3Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  4. 4Center for Digital Medicine, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
  • Corresponding author: chirag{at}iisc.ac.in
  • Abstract

    Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single-nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-hard and develop efficient integer-programming solutions. We benchmark the algorithm using downsampled short-read data sets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs.

    Footnotes

    • Received February 16, 2025.
    • Accepted July 29, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    This Article

    1. Genome Res. © 2025 Chandra et al.; Published by Cold Spring Harbor Laboratory Press

    Article Category

    Share

    Preprint Server