Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney

Efficient storage of high throughput DNA sequencing data using reference-based compression

(Downloading may take up to 30 seconds. If the slide opens in your browser, select File -> Save As to save it.)

Click on image to view larger version.

Figure 1.

Schematic of the compression technique. (A) Reads are first aligned to an established reference. (B) Unaligned reads are then pooled to create a specific “compression framework” for this data set. (C) The base pair information is then stored using specific offsets of reads on the reference, with substitutions, insertions, or deletions encoded in separate data structures.