Can Alkan; Maria Francesca Cardone; Claudia Rita Catacchio; Francesca Antonacci; Stephen J. O'Brien; Oliver A. Ryder; Stefania Purgato; Monica Zoli; Giuliano Della Valle; Evan E. Eichler; Mario Ventura

Figure 1.

The RepeatNet algorithm. (A) The layout of the alphoid repeat array in the centromere and the paired-end inserts in the centromeric region is shown. Note that since the centromere is larger than the inserts (fosmids, plasmids, BACs, or short inserts used in next-generation sequencing), both ends of the same insert contain alphoid sequence. (B) Close-up view of a paired-end insert over the alphoid repeat array. We also show all possible k-mers (sliding by 1 bp) that can be generated from the reads. (C) The ideal case for the k-mer structure in the end sequences. When both ends of a paired-end insert contain alphoid sequence, we expect that the k-mers in the forward end will be represented with their reverse-complement counterparts in the reverse end. For simplicity, we show only the nonoverlapping k-mers; however, RepeatNet considers all possible overlapping k-mers. In this figure, w₁-m₁, w₂-m₂, w₃-m₃, w′₁-m′₁, w′₂-m′₂, w′₃-m′_3, w′′₁-m′′₁, w′′₂-m′′₂, w′′₃-m′′₃ are the k-mer pairs that are reverse complements of each other, and the triplet k-mer groups (w₁-w′₁-w′′₁), (w₂-w′₂-w′′₂), (w₃-w′₃-w′′₃) are highly similar k-mers. In the case of exact repeats, these k-mers are identical. (D) Since k-mer pairs w₁-m₁, w₂-m₂, and w₃-m₃ exist in the same read pairs, we put an edge between the nodes that represent such k-mers. (E) The repeat graph for the ideal case of a 31-mer tandem repeat with exact repeat units is shown. This graph includes 20 vertices for 20 k-mer pairs that can be generated from a 31-mer repeat structure, and there exists an edge between all pairs of k-mers. Note that this graph is a clique of size 20. For non-ideal cases, the clique property will be lost; however, the graph will still be very dense in terms of the average degree of the vertices. RepeatNet finds such dense subgraphs of the repeat graph with a heuristic that selects the vertex with the highest degree, and other vertices that share an edge with this selected vertex. Alternatively, a maximum density subgraph algorithm can be used (Fratkin et al. 2006), though this algorithm has a high running time complexity of O[n.m.log(n²m)].

Genome-wide characterization of centromeric satellites from multiple mammalian genomes

This Article

Preprint Server

Current Issue

In This Issue