Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2

  1. Ben Langmead1
  1. Johns Hopkins University
  • * Corresponding author; email: langmea{at}cs.jhu.edu
  • Abstract

    A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing dataset. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. While existing tools can easily compare 10,000s of genomes, relevant datasets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. We describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog, but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. Dashing 2 is free, open source software.

    • Received January 6, 2023.
    • Accepted June 30, 2023.

    This manuscript is Open Access.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International license), as described at http://creativecommons.org/licenses/by/4.0/.

    OPEN ACCESS ARTICLE
    ACCEPTED MANUSCRIPT

    This Article

    1. Genome Res. gr.277655.123 Published by Cold Spring Harbor Laboratory Press

    Article Category

    ORCID

    Share

    Preprint Server