RT Journal
A1 Mastoras, Mira
A1 Asri, Mobin
A1 Brambrink, Lucas
A1 Hebbar, Prajna
A1 Kolesnikov, Alexey
A1 Cook, Daniel E.
A1 Nattestad, Maria
A1 Lucas, Julian
A1 Won, Taylor S.
A1 Chang, Pi-Chuan
A1 Carroll, Andrew
A1 Paten, Benedict
A1 Shafin, Kishwar
A1 and the Human Pangenome Reference Consortium
T1 Highly accurate assembly polishing with DeepPolisher
JF Genome Research 
JO Genome Research 
YR 2025 
FD July 01 
VO 35 
IS 7 
SP 1595 
OP 1608 
DO 10.1101/gr.280149.124 
UL http://genome.cshlp.org/content/35/7/1595.abstract 
AB Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.