Autoencoders for genomic variation analysis

Table 2.

Compression benchmark for subsets of SNPs of human Chromosome 22

Data 10,000 SNPs 50,000 SNPs 80,000 SNPs 317,400 SNPs
Original size 112.27 561.33 898.14 3536.40
Gzip (clevel 9) 6.48 (×17.3)
[1 m 0.691 s]
40.68 (×13.8)
[6 m 9.370 s]
65.20 (×13.8)
[10 m 45.854 s]
263.30 (×13.4)
[44 m 5.341 s]
ZPAQ (clevel 3) (Mahoney 2005) 5.92 (×18.9)
[1 m 59.611s]
28.83 (×19.5)
[9 m 52.687 s]
45.18 (×19.9)
[24 m 39.143 s]
183.38 (×19.3)
[98 m 46.042 s]
Zstandard (Collet and Kucherawy 2018) 11.29 (×9.9)
[0 m 0.209 s]
57.08 (×9.8)
[0 m 1.017 s]
92.75 (×9.7)
[0 m 2.143 s]
372.74 (×9.5)
[0 m 6.535 s]
Genozip (Lan et al. 2021) 0.94 (×119.4)
[0 m 12.899 s]
29.89 (×18.8)
[0 m 2.681 s]
48.67 (×18.5)
[0 m 3.249 s]
200.13 (×17.7)
[0 m 11.741 s]
bref3 (Browning et al. 2018) 4.35 (×25.8)
[0 m 1.383 s]
19.91 (×28.2)
[0 m 4.322 s]
27.31 (×32.9)
[0 m 10.709 s]
115.52 (×30.6)
[0 m 22.916 s]
VQ-VAE + Zstandard (ours) 3.42 (×32.83)
[0 m 12.905 s]
25.37 (×22.12)
[1 m 0.564 s]
40.17 (×22.4)
[1 m 42.669 s]
160.68 (×22.0)
[6 m 37.681 s]
VQ-VAE + Genozip (ours) 3.59 (×31.3)
[0 m 6.984 s]
19.44 (×28.9)
[0 m 14.447 s]
27.77 (×32.3)
[0 m 26.471 s]
115.24 (×30.7)
[1 m 23.828 s]
  • The file size in MB is compared between methods, along with its compression factor and running time. We mark in bold the top two choices based on compression factors.

This Article

  1. Genome Res. 36: 348-360

Preprint Server