
Evolutionary signatures for protein-coding gene identification. (A) Within coding regions, triplet substitutions are biased toward conservative codon substitutions (Codon Substitution Frequencies, CSF). Additionally, indels in coding regions are strongly biased to be a multiple of three in length (reading frame conservation; RFC). (B) The color of each codon substitution between the D. melanogaster sequence and an informant sequence corresponds to a log-odds score of observing that substitution in a coding region versus a noncoding region. (C) Quantitative metrics of RFC and CSF distinguish coding and noncoding regions. Shown in blue are 5567 coding exons of well-studied genes and in orange are 22,019 regions chosen uniformly at random from the noncoding part of the genome, with the same length distribution as the exons. The CSF score is length-normalized and the discrete RFC score is dithered by adding random noise uniformly from (−0.5,0.5) for the purposes of visualization.











