A new compression strategy to reduce the size of nanopore sequencing data

  1. Ira W Deveson6
  1. 1 Garvan Institute of Medical Research, Murdoch Children's Research Institute, UNSW Sydney, University of Peradeniya;
  2. 2 Garvan Institute of Medical Research;
  3. 3 University of Peradeniya;
  4. 4 UNSW Sydney;
  5. 5 UNSW Sydney, Garvan Institute of Medical Research, Murdoch Children's Research Institute;
  6. 6 Garvan Institute of Medical Research, Murdoch Children's Research Institute, UNSW Sydney
  • * Corresponding author; email: hasindu{at}unsw.edu.au
  • Abstract

    Nanopore sequencing is an increasingly central tool for genomics. Despite rapid advances in the field, large data volumes and computational bottlenecks continue to pose major challenges. Here we introduce ex-zd, a new data compression strategy that helps address the large size of raw signal data generated during nanopore experiments. Ex-zd encompasses both a lossless compression method, which modestly outperforms all current methods for nanopore signal data compression, and a 'lossy' method, which can be used to achieve dramatic additional savings. The latter component works by reducing the number of bits used to encode signal data. We show that the three least significant bits in signal data generated on instruments from Oxford Nanopore Technologies (ONT) predominantly encode noise. Their removal reduces file sizes by half without impacting downstream analyses, including basecalling and detection of modified DNA or RNA bases. Ex-zd compression saves hundreds of gigabytes on a single ONT sequencing experiment, thereby increasing the scalability, portability, and accessibility of nanopore sequencing.

    • Received October 2, 2024.
    • Accepted May 2, 2025.

    This manuscript is Open Access.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International license), as described at http://creativecommons.org/licenses/by/4.0/.

    OPEN ACCESS ARTICLE
    ACCEPTED MANUSCRIPT

    This Article

    1. Genome Res. gr.280090.124 Published by Cold Spring Harbor Laboratory Press

    Article Category

    ORCID

    Share

    Preprint Server