Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases

(Downloading may take up to 30 seconds. If the slide opens in your browser, select File -> Save As to save it.)

Click on image to view larger version.

Figure 2.
Figure 2.

DNA polymerase kinetics in SMRT sequencing is a function of the local sequence context of the incorporation site, motivating a conditional random field approach to KVE detection. (A) Heatmap of the coefficient of determination (R2) for the IPD variance for the incorporation site of a SMRT sequencing reaction explained by local sequence context. This heatmap suggests that seven bases upstream of and two bases downstream from the incorporation site are the most informative, and that bases beyond this context do not provide much additional information about the enzyme kinetics. (B) Scatter plot comparing IPDs in identical sequence contexts between whole-genome amplified E. coli and M. genitalium samples. Each point represents the log of the IPD for a given 10-bp context (seven bases upstream of and two bases downstream from the incorporation site) in E. coli (y-axis) and M. genitalium (x-axis): 2500 points sampled from the 1,048,576 possible 10-mer contexts are shown here for ease of viewing. The strong correlation (Pearson's correlation coefficient = 0.91) between IPDs in identical contexts assayed from completely independent sequencing runs of different species demonstrate that the context effects are highly consistent between experiments. (C) Graphical representation of the CRF model. The Graphic variables represent the hidden modification states for site i, while the Graphic represent the observed IPD values for site i that inform on the modification status of the site. In this model we are considering interactions between the incorporation site, Graphic, and the two nearest neighboring sites on each side of Graphic. The edges between the Graphic variables indicate there can be interactions between the local sites, with the Graphic parameters representing the degree of interaction among the nodes. The Graphic parameters represent the exponential rates for the two possible rate classes at each position i (Graphic), while the Graphic parameters represent the proportion of molecules in state k at position i (with Graphic).

This Article

  1. Genome Res. 23: 129-141

Preprint Server