Protein domain embeddings for fast and accurate similarity search

Table 1.

Protein benchmarks for homology/similarity detection

Benchmarka # Pairs (homologs) Homolog definition Example protein [domain architecture]
pfam-max50 10,450 (5228) Identical domain architecture; <50 aa between domains Q9VFJ2 [PF03946, PF00298]
P53875 [PF03946, PF00298]
pfam-nomax50 71,988 (36,278) Identical domain architecture; no constraint on the amino acid between domains Q15149 [PF03501, CL0188, CL0188, PF00681]
Q9QXS1 [PF03501, CL0188, CL0188, PF00681]
pfam-local 15,273 (7602) Share some domains, but not all P40791 [PF00319, PF12347]
Q8VWM8 [PF00319, PF01486]
gene3d-nomax50 58,163 (29,109) Same as pfam-nomax50 but based on CATH domains P52917 [1.20.58.280, 3.40.50.300]
Q9ZNT0 [1.20.58.280, 3.40.50.300]
supfam-nomax50 49,365 (24,708) Same as pfam-nomax50 but based on SCOP domains Q9T0N8 [56,176, 55,103]
P46681 [56,176, 55,103]
  • aThe benchmarks are denoted as pfam-max50, gene3d-nomax50, and so on to indicate the domain database used for defining the homologs, with the number of pairs (total/homologs) in each benchmark listed in the second column. The benchmarks include full-length proteins. Each particular benchmark's definition of homology is located in the third column, and example protein domain architectures are depicted in the last column.

This Article

  1. Genome Res. 34: 1434-1444

Preprint Server