Evaluation of linear representation of centromeric arrays. (A) Estimate of accurate WGS sequences in processed linear representation of X (black) and Y (gray) linearized centromeric arrays. Read libraries and linearized centromere arrays X and Y are reformatted into k-mer libraries (where k = 50–400 bp with 1-bp slide in both strand orientations), and the proportion of sequences observed in the initial read database are observed in the final database. (B) Estimate of sequences observed in linearized centromeric arrays relative to the initial WGS sequence database, where proportions less than one reflect the gain of novel sequence windows due to the Markov chain model. (C) To determine the improvement of an array long-range prediction, given an increase of model order, simulated long reads were generated at random from each linearized centromeric array (with length defined by monomer order 3–23, with an average monomer of 171 bp), and the longest arrangement of correctly ordered monomers was normalized to the total length of the array.
