Comparison of conservation of cis-regulatory elements (CREs) to two types of control sites
Group 1 vs. group 2 | CRE vs. nearby | CRE vs. random intergenic | Nearby vs. random intergenic | |
|---|---|---|---|---|
| Per site analysis | Group 1 mean per site % identity | 51.3% | 51.3% | 47.8% |
| Group 2 mean per site % identity | 47.8% | 42.9% | 42.9% | |
| Difference of means (group 1 – group 2) | 3.6% | 8.4% | 4.9% | |
| Difference of means resampling p-value | 0.05 | 0.003 | 1E-5 | |
| Distribution comparison KS p-value | 0.026 | 0.0016 | 2E-6 | |
| Per base analysis | Group 1 mean per base % identity | 47.8% | 47.8% | 46.3% |
| Group 2 mean per base % identity | 46.3% | 42.4% | 42.4% | |
| Difference of means (group 1 – group 2) | 1.5% | 5.4% | 3.9% | |
| Difference of means resampling p-value | 0.24 | 0.05 | 5.8E-4 |
[i] For each CRE 20 RICs were generated by randomly choosing sites of the same length as the CRE, on the same chromosome and strand, and rejecting any that overlapped a known gene. Then 10 nearby control sites were generated for each CRE by adding positive and negative (i.e., 3′ and 5′) offsets of 50, 100, 150, 200, and 250 bp to the coordinates of each true CRE. Percentage identities for all CRE and control sites were computed relative to reference alignment, on both a per site and per base basis. Unaligned bases, mismatchs, and D. melanogaster insertions contributed zeros to % identity results; D. pseudoobscura insertions were ignored. The distributions of % identity values were clearly not normal, thus we avoided using tests such as the t-test that assume normality. We compared the per site and per base mean % identities of each group using a resampling test, in which the p-value of the observed difference was estimated as the frequency (over a million trials) in which a value as large or larger than the observed CRE mean was observed in an equal-sized sample of control sites. Similarly, the p-value of the difference between the two control sets was estimated using a randomization test (over a million trials) in which the sets mixed and then repartitioned into corresponding mock control sets. We compared the distributions using the Kolmogorov-Smirnov test, which measures the likelihood that samples came from the same continuous distribution.