Ultrafast and scalable variant annotation and prioritization with big functional genomics data

Dandan Huang; Xianfu Yi; Yao Zhou; Hongcheng Yao; Hang Xu; Jianhua Wang; Shijie Zhang; Wenyan Nong; Panwen Wang; Lei Shi; Chenghao Xuan; Miaoxin Li; Junwen Wang; Weidong Li; Hoi Shan Kwan; Pak Chung Sham; Kai Wang; Mulin Jun Li

doi:10.1101/gr.267997.120

Ultrafast and scalable variant annotation and prioritization with big functional genomics data

¹ Tianjin Medical University;
² The University of Hong Kong;
³ The Chinese University of Hong Kong;
⁴ Mayo Clinic;
⁵ Sun Yat-sen University;
⁶ Children's Hospital of Philadelphia

↵* Corresponding author; email: mulinli{at}connect.hku.hk

Abstract

The advances of large-scale genomics studies have enabled compilation of cell type-specific, genome-wide DNA functional elements at high resolution. With the growing volume of functional annotation data and sequencing variants, existing variant annotation algorithms lack the efficiency and scalability to process big genomic data, particularly when annotating whole genome sequencing variants against a huge database with billions of genomic features. Here, we develop VarNote to rapidly annotate genome-scale variants in large and complex functional annotation resources. Equipped with a novel index system and a parallel random-sweep searching algorithm, VarNote shows substantial performance improvements (two to three orders of magnitude) over existing algorithms at different scales. It supports both region-based and allele-specific annotations, and introduces advanced functions for the flexible extraction of annotations. By integrating massive base-wise and context-dependent annotations in the VarNote framework, we introduce three efficient and accurate pipelines to prioritize the causal regulatory variants for common diseases, Mendelian disorders and cancers.

Received June 28, 2020.
Accepted September 22, 2020.

Published by Cold Spring Harbor Laboratory Press

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.