Markup | Genome Research

Table 1.

Selected Examples of Prediction Accuracy in Different Areas of Sequence Analysis

Prediction of	Acc × cov[ii]	Accuracy (%)	Coverage or coverage in % of reference set	Reference[iii]
Human promoters	0.35	50	70% of annotated test set	Prestidge 1995; P. Bucher (pers. comm)
Human regulatory RNA elements	0.34	85	40% of new DNA	Dandekar and Sharma (1998)
Human genes (only presence)	0.49	70	70% of chromosome 22	Dunham et al. (1999) and refs. therein
Human SNPs by EST comparison	0.21	70	30% of all proteins with SNP	Buelow et al. (1999); Sunyaev et al. (2000)
Human alternative splicing	0.45	90	50% of all splice sites	Hanke et al. (1999)
Transmembranes (only presence)	0.85	85	99% of annotated test set	Tusnady and Simon (1998) and refs. therein
Signal peptides (only presence)	.90	90	100% of annotated test set	Nielsen et al. (1999)
GPI ancors (incl cleavage site)	.72	72	100% of annotated test set	Eisenhaber et al. (1999)
Coiled coil (only presence)	.81	90	90% of annotated coiled coil	Lupas (1996)
Secondary structure (Three states)	.77	77	100% of 3D test set	Jones (1999) and refs. therein
Buried or exposed residues	.74	74	100% of 3D test set	Rost (1996)
Residue hydration	.72	72	100% of 3D test set	Ehrlich et al. (1998)
Protein folds (in Mycoplasma)	.49	98	50% of Mycoplasma ORFs	Teichmann et al. (1999) and refs. therein
Homology (several methods)	.49	98	50% of 3D test set	Muller et al. (1999) and refs. therein
Functional features by homology	.63	90	70% unicellular genomes	Bork and Koonin (1998); Brenner (1999)
Function association by context	.25	50	10% high confidence in yeast	Marcotte et al. (1999b)
Cellular localization (two states)	.77	77	100% of annotated test set	Andrade et al. (1998)

[i] The numbers referred to are in many cases crude estimates taken or sometimes even estimated from the literature and have an expected accuracy of ∼70%. Direct comparison of the numbers might be misleading as the context is not properly explained here. Furthermore, although most of the examples are two state predictions, the percentage numbers do not take into account random occurrences of the states. All test sets are most likely biased (e.g., current 31) test sets do not contain many compositionally biased regions, which probably contain up 15% of all residues, and annotation test sets are far from being perfect; see text), i.e., the real accuracy is thus probably lower.

[ii] To make the numbers more comparable, accuracy has been multiplied by coverage; some methods give accuracy for different degree of coverage and roughly justify this procedure. However, often it is biased toward sensitivity as specificity cannot be properly taken into account. Most features predicted with an accuracy × coverage >0.70 are of structural nature and at best only indirectly imply a certain functionality.

[iii] Only one recent reference is given and if indicated, references therein should also be considered as other reports do not always agree with the numbers given.