RT Journal A1 Joudaki, Amir A1 Meterez, Alexandru A1 Mustafa, Harun A1 Groot Koerkamp, Ragnar A1 Kahles, André A1 Rätsch, Gunnar T1 Aligning distant sequences to graphs using long seed sketches JF Genome Research JO Genome Research YR 2023 FD April 18 DO 10.1101/gr.277659.123 SP gr.277659.123 UL http://genome.cshlp.org/content/early/2023/04/18/gr.277659.123.abstract AB Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and demonstrate that it yields a better time-accuracy trade-off in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of 25%. For such queries, longer sketch-based seeds yield a 4× increase in recall compared to exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.