Whole-genome analysis of Alu repeat elements reveals complex evolutionary history

Table 1.

Finding the Ya5 subfamily in the set of all Alu elements


(A) All Alus: nucleotide frequencies











1
2
3
4
5
A 0.01 0.02 0.06 0.03 0.31
C 0.02 0.87 0.01 0.63 0.02
G 0.00 0.02 0.91 0.03 0.64
T
0.96
0.08
0.01
0.31
0.02
(B) k-means clustering





Cluster 1 (90% of Alus)

1
2
3
4
5
A 0.01 0.03 0.06 0.03 0.31
C 0.02 0.97 0.01 0.63 0.03
G 0.00 0.00 0.91 0.03 0.64
T
0.96
0.00
0.01
0.31
0.02
Cluster 2 (10% of Alus)

1
2
3
4
5
A 0.01 0.00 0.07 0.03 0.34
C 0.02 0.00 0.01 0.64 0.02
G 0.00 0.22 0.91 0.03 0.63
T
0.98
0.78
0.02
0.31
0.02
(C) All Alus: binucleotide frequencies relative to expected











1,2
1,3
1,4
1,5
2,3
2,4
2,5
3,4
3,5
4,5
A,A 1 1 1 1 5 1 1 1 1 1
A,C 1 1 1 1 1 1 12 1 5 1
A,G 1 1 1 1 1 1 1 1 1 1
A,T 1 1 1 1 1 2 1 1 1 1
C,A 12 5 1 1 1 1 1 1 1 1
C,C 1 1 1 11 1 1 1 1 1 1
C,G 1 1 1 1 1 1 1 1 1 1
C,T 1 1 2 1 1 1 1 1 1 1
G,A 1 1 1 2 2 1 1 1 1 1
G,C 1 1 1 1 1 1 1 1 1 1
G,G 1 1 1 1 1 1 1 1 1 1
G,T 1 1 1 1 1 1 1 1 1 1
T,A 1 1 1 1 1 1 1 1 1 1
T,C 1 1 1 1 1 1 1 1 1 2
T,G 1 1 1 1 1 1 1 1 1 1
T,T
1
1
1
1
1
1
1
1
1
1
(D) Our algorithm





Cluster 1 (99.3% of Alus)

1
2
3
4
5
A 0.01 0.02 0.06 0.03 0.32
C 0.02 0.88 0.01 0.64 0.02
G 0.00 0.02 0.92 0.03 0.64
T
0.97
0.08
0.01
0.30
0.02
Cluster 2 (0.7% of Alus)

1
2
3
4
5
A 0.01 0.98 0.95 0.00 0.02
C 0.95 0.01 0.00 0.06 0.91
G 0.00 0.01 0.05 0.00 0.07
T
0.05
0.00
0.00
0.93
0.00
  • For simplicity, we considered only the 5 Alu positions with diagnostic mutations in the Ya5 subfamily (positions 91, 98, 146, 175, and 238, assuming that positions of the AluSx consensus sequence are labeled from 1 to 282). In each table, entries corresponding to the Ya5 consensus are underlined. In (A), entries corresponding to the Alu consensus are indicated in boldface type. In (B) and (D), entries corresponding to the consensus of each respective cluster are indicated in bolface type. (A) The nucleotide frequency profile of all Alus. (B) Frequency profiles for the 2 clusters returned by k-means clustering with k = 2, which does not find the Ya5 subfamily. (C) Ratio of actual versus expected biprofile frequencies at each pair of positions, rounded to the nearest integer. (D) Frequency profiles for the 2 clusters found by our algorithm, which finds the Ya5 subfamily.

This Article

  1. Genome Res. 14: 2245-2252

Preprint Server