The ProPhylER dataflow pipeline. (Solid arrows) Automated steps; (dashed arrows) curated steps. (A) Generating clusters of functionally conserved homologs. (A, step i) Single-linkage clusters are built from all-by-all BLAST searches of (gray-filled circles) protein sequences from 13 fully sequenced genomes. Edges of the clusters (lines of varying thickness joining sequences) are similarity scores between cluster members. (A, step ii) The MinCut routine (line through clusters) separates clusters at their weakest edges (lowest scores). (A, step iii) Manual curation rejoins overcut clusters (dashed circle). (A, step iv) Each eukaryotic sequence in UniProt (black-filled circles) is added to its best-matching cluster. (B) Building alignments and trees for ProPhylER clusters. (B, step i) The initial alignment is built (bars), and sequences containing or creating excessive alignment gaps are flagged for potential removal (underlined numbers). (B, step ii) Manual curation removes any problem sequences. (B, step iii) The remaining cluster sequences are realigned, and a maximum-likelihood phylogenetic tree is built. It is compared to its corresponding species tree, and each internal node is annotated as either a (white node) speciation or (black node) gene duplication event. (C) Predictive analyses are generated using the information in cluster alignments and trees, and are displayed with ProPhylER's graphical user interfaces.
