Kristoffer Sahlin

Figure 6.

An illustration of three minstrobes (A) and three randstrobes (B) with (n = 3, ℓ = 3, w_min = 3, w_max = 5), and one hybridstrobe (C) with (n = 3, ℓ = 3, w_min = 3, w_max = 5, x = 3) generated from a DNA string of 16 letters. With parameters n = 3 and ℓ = 3, the strobemers will consist of three strobes (substrings) each of length 3. The position of the first strobe, m₁, in each of the strobemers is highlighted in blue. The rest of the strobemers are chosen from a window of w_max − w_min + 1 = 3 positions based on the minimizer method of minstrobes (A), randstrobes (B), or hybridstrobes (C). The possible start positions of strobes m₂ and m₃ are highlighted in green and red, respectively. For the minstrobe method A, the 3-mer minimizer hash values (under a made up hash function in the figure) are shown above the DNA string and come from computing h(m) for each 3-mer strobe m. The position of the hash value corresponds to the first position of the 3-mer strobe. The minimizer values in all relevant strobe windows of length 3 in the figure are indicated by gray squares. For the minstrobe method, strobes m₂ and m₃ are selected independently based on the minimizer value in each strobemer window. This gives a high similarity between nearby strobemers (sharing minimizers). The three minstrobes produced are shown to the right in A. For the randstrobe method B, strobes m₂ and m₃ are selected depending on the previous strobes, namely, h(m|m₁, …, m_i−1). The function producing the conditional dependence is irrelevant for the purpose of illustration. Here we use string concatenation of previous strobes to produce the dependence, but any other function producing conditional dependence will suffice. Because of the conditional dependence in the hash function, randstrobes are more randomly (but deterministically) distributed across the sequence. For the hybridstrobe method C, strobes m₂ and m₃ are selected from one of the x subwindows depending on the remainder of the previous strobe. Each subwindow has individually computed minimizers similar to the minstrobes. However, allowing the sampling of a strobe from one of the x windows to depend on the remainder of the previous strobe creates more sampling variability than minstrobes.

Effective sequence similarity detection with strobemers

This Article

Preprint Server

Current Issue

In This Issue