Balancing Gene Ontology annotation specificity in protein function prediction based on the protein sequence large graph

Abstract

Accurate protein function prediction is fundamental to advancing drug discovery and precision medicine and understanding complex biological systems. Although Gene Ontology (GO) provides a standardized framework for protein annotation, a critical challenge persists: the imbalance between low-specificity GO terms and high-specificity GO terms. This imbalance creates blind spots in our understanding of protein function landscapes, particularly in clinically relevant pathways. Here, we present ProGO-PSL, a novel large graph architecture designed to resolve this imbalance. ProGO-PSL simultaneously leverages explicit domain identifiers from InterPro and implicit evolutionary contexts from multiple sequence alignments, fusing these complementary data sources within a powerful imbalance learning framework. Our model consistently outperforms state-of-the-art methods by 5%–15% across all specificity levels and on both a benchmark data set and an independent test set, demonstrating robust generalization. Furthermore, ProGO-PSL generates interpretable representations that clarify relationships between low- and high-specificity GO terms, enabling a more complete functional characterization of the proteome. This work accelerates the identification of therapeutic targets in previously uncharacterized biological pathways.