Highly accurate assembly polishing with DeepPolisher

  1. and the Human Pangenome Reference Consortium3
  1. 1UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95060, USA;
  2. 2Google Incorporated, Mountain View, California 94043, USA;
  3. 4Center for Genomic Discovery, Mohammed Bin Rashid University, Dubai Health, P.O. Box 505055, UAE;
  4. 5Dubai Health Genomic Medicine Center, Dubai Health, P.O. Box 505055, UAE;
  5. 6McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA;
  6. 7European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK;
  7. 8Center for Applied and Translational Genomics (CATG), Mohammed Bin Rashid University of Medicine and Health Sciences, P.O. Box 505055, Dubai, United Arab Emirates;
  8. 9Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA;
  9. 10Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA;
  10. 11UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA;
  11. 12Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA;
  12. 13The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY 10065, USA;
  13. 14Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA;
  14. 15Human Technopole, 20157 Milan, Italy;
  15. 16Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
  16. 17Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA;
  17. 18Canadian Center for Computational Genomics, McGill University, Montréal, QC H3A 0G1, Canada;
  18. 19Department of Human Genetics, McGill University, Montréal, QC H3A 0G1, Canada;
  19. 20Victor Phillip Dahdaleh Institute of Genomic Medicine, Montréal, QC H3A 0G1, Canada;
  20. 21Google LLC, Mountain View, CA 94043, USA;
  21. 22Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, USA;
  22. 23University of Florence, Department of Biology, Sesto Fiorentino (FI) 50019, Italy;
  23. 24Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA;
  24. 25Arizona State University, Consortium for Science, Policy & Outcomes, Washington, DC 20006, USA;
  25. 26Center for Digital Medicine, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  26. 27Department for Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  27. 28German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research, 40225 Düsseldorf, Germany;
  28. 29University of Cambridge, Cambridge Biomedical Campus, Cambridge CB2 0QQ, UK;
  29. 30Wellcome Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK;
  30. 31Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, 40225 Düsseldorf, Germany;
  31. 32Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA;
  32. 33ISEM, Univ Montpellier, CNRS, IRD, 34095 Montpellier, France;
  33. 34Institut Universitaire de France, 75005 Paris, France;
  34. 35Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA;
  35. 36Department of Bioethics & Humanities, University of Washington School of Medicine, Seattle, WA 98195, USA;
  36. 37Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA;
  37. 38Department of Anthropology, University of Kansas, Lawrence, KS 66045, USA;
  38. 39University of Manchester, Manchester M13 9PL, UK;
  39. 40Traditional, ancestral and unceded territory of the Gabrielino/Tongva peoples, Institute for Society & Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  40. 41Traditional, ancestral and unceded territory of the Gabrielino/Tongva peoples, Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  41. 42Traditional, ancestral and unceded territory of the Gabrielino/Tongva peoples, Division of General Internal Medicine & Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  42. 43Medical and Population Genomics Lab, Sidra Medicine, P.O. Box 26999, Doha, Qatar;
  43. 44Montreal Heart Institute, Montreal, Quebec H1T 1C8, Canada;
  44. 45Center for Genomic Health, Yale University School of Medicine, New Haven, CT 06510, USA;
  45. 46Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, USA;
  46. 47Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA;
  47. 48Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA;
  48. 49Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA;
  49. 50Sun Yat-sen University, Guangzhou 510275, China;
  50. 51Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA;
  51. 52Center for Medical Genomics, Penn State University, University Park, PA 16802, USA;
  52. 53Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA 98195, USA;
  53. 54Coriell Institute for Medical Research, Camden, NJ 08103, USA;
  54. 55Department of Biology, Penn State University, University Park, PA 16802, USA;
  55. 56Department of Biomedical Science, College of Health Sciences, Qatar University, P.O. Box 2713, Doha, Qatar;
  56. 57Department of Genetic Medicine, Weill Cornell Medicine-Qatar, P.O. Box 24144, Doha, Qatar;
  57. 58IRSD - Digestive Health Research Institute, University of Toulouse, INSERM, INRAE, ENVT, UPS, 31024 Toulouse, France;
  58. 59MATCH biosystems, S.L., 03202 Elche, Spain;
  59. 60Universidad Miguel Hernández de Elche, 03202 Elche, Spain;
  60. 61Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan;
  61. 62University of Pisa, 56126 Pisa, Italy;
  62. 63Institute of Genetics and Biomedical Research, UoS of Milan, National Research Council, 20133 Milan, Italy;
  63. 64Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, 00290 Helsinki, Finland;
  64. 65Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA;
  65. 66University of Amsterdam, 1012 WX Amsterdam, Netherlands;
  66. 67GenomeArc Inc., Mississauga, ON L4W 5M1, Canada;
  67. 68Department of Biology and Biotechnologies “Charles Darwin”, University of Rome “La Sapienza”, 00185 Rome, Italy;
  68. 69Center for Genomics, Loma Linda University School of Medicine, Loma Linda, CA 92350, USA;
  69. 70PacBio, Menlo Park, CA 94025, USA;
  70. 71The first affiliated hospital of Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China;
  1. Corresponding authors: awcarroll{at}google.com, bpaten{at}ucsc.edu, shafin{at}google.com
  2. Abstract

    Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.

    Footnotes

    • 3 A complete list of the HPRC authors appears at the end of this paper.

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280149.124.

    • Received October 22, 2024.
    • Accepted April 30, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents

    This Article

    1. Genome Res. 35: 1595-1608 © 2025 Mastoras et al.; Published by Cold Spring Harbor Laboratory Press

    Article Category

    Share

    Preprint Server