k-mer approaches for biodiversity genomics

  1. Kamil S. Jaron3
  1. 1Johns Hopkins University, School of Medicine, Baltimore, Maryland 21205, USA;
  2. 2Centre for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Campus UAB, Cerdanyola del Vallès, 08193 Barcelona, Spain;
  3. 3Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;
  4. 4Center for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, 0313 Oslo, Norway;
  5. 5University College London, UCL Department of Genetics, Evolution & Environment, London, WC1E 6BT, United Kingdom
  • 6 Present address: Department of Bioinformatics and Genetics, Swedish Museum of Natural History, 114 18 Stockholm, Sweden

  • Corresponding author: kamil.jaron{at}sanger.ac.uk
  • Abstract

    The wide array of currently available genomes displays a wonderful diversity in size, composition, and structure and is quickly expanding thanks to several global biodiversity genomics initiatives. However, sequencing of genomes, even with the latest technologies, can still be challenging for both technical (e.g., small physical size, contaminated samples, or access to appropriate sequencing platforms) and biological reasons (e.g., germline-restricted DNA, variable ploidy levels, sex chromosomes, or very large genomes). In recent years, k-mer-based techniques have become popular to overcome some of these challenges. They are based on the simple process of dividing the analyzed sequences (e.g., raw reads or genomes) into a set of subsequences of length k, called k-mers, and then analyzing the frequency or sequences of those k-mers. Analyses based on k-mers allow for a rapid and intuitive assessment of complex sequencing data sets. Here, we provide a comprehensive review to the theoretical properties and practical applications of k-mers in biodiversity genomics with a special focus on genome modeling.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279452.124.

    • Freely available online through the Genome Research Open Access option.

    • Received April 11, 2024.
    • Accepted January 9, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    Articles citing this article

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server