TY  - JOUR
A1  - Markova-Raina, Penka
A1  - Petrov, Dmitri
T1  - High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes
Y1  - 2011/03/10 
JF  - Genome Research 
JO  - Genome Research 
DO  - 10.1101/gr.115949.110 
SP  - gr.115949.110 
UR  - http://genome.cshlp.org/content/early/2011/03/10/gr.115949.110.abstract 
N2  - We investigate the effect of aligner choice on inferences of positive selection and rates of protein evolution in the 12 Drosophila genomes. The study is a whole-genome analysis based on the GLEAN-R consensus set, and includes all genes either with annotated orthologs in all 12 species (~6690 genes) or in the six melanogaster group species (~8560 genes). We compare six popular aligners (PRANK, T-Coffee, CLUSTALW, Probcons, AMAP and Muscle), and estimate the rate of protein evolution and infer the presence of positive selection using widely used PAML site-specific models. We find that the choice of aligner strongly influences the estimates of positive selection. These differences persist when we use (i) different stringency cutoffs, (ii) different PAML models, (iii) alignments with or without gaps and/or additional masking, (iv) per-site statistics versus gene-wise LRT statistics, (v) closely related melanogaster group species versus more distantly related 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as the determination of the over/underrepresented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are in fact misaligned at the codon level. The rate of false positives ranged between 45% and 82% depending on the aligner used, the selection inference method and the closeness of included species. PRANK, which has recently been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high and unacceptable for most applications rate of false positives of ~45-50%. We investigate the problems leading to such a high error of misalignment and identify misannotations and indels, many of which appear to be located in disordered protein regions, as the primary culprits. Finally, we discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses. 
ER  -