Resource

Genome-scale phylogenetic function annotation of large and diverse protein families

    • 1 University of Chicago;
    • 2 University of California, Berkeley;
    • 3 Harvard University
Published July 22, 2011. https://doi.org/10.1101/gr.104687.109
Download PDF Cite Article Permissions Share
cover of Genome Research Vol 36 Issue 6
Current Issue:

Abstract

The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a significant revision of the approach (SIFTER version 2.0) that allows annotations to be made on a genomic scale. We confirm that SIFTER 2.0 produces equivalently precise predictions to the earlier version of SIFTER on a carefully studied family and on a collection of one hundred protein families with limited functional diversity. We have added an approximation method to SIFTER 2.0, and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, which was previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 complete fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0% and 5.7% of these same proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is now capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses.

Loading
Loading
Loading
Back to top