Highly accurate quantification of allelic gene expression for population and disease genetics

Analysis of allele-specific gene expression (ASE) is a powerful approach for studying gene regulation, particularly when sample sizes are small, such as for rare diseases, or when studying the effects of rare genetic variation. However, detection of ASE events relies on accurate alignment of RNA sequencing reads, where challenges still remain, particularly for reads containing genetic variants or those that align to many different genomic locations. We have developed the Personalised ASE Caller (PAC), a tool that combines multiple steps to improve the quantification of allelic reads, including personalized (i.e., diploid) read alignment with improved allocation of multimapping reads. Using simulated RNA sequencing data, we show that PAC outperforms standard alignment approaches for ASE detection, reducing the number of sites with incorrect biases (>10%) by ∼80% and increasing the number of sites that can be reliably quantified by ∼3%. Applying PAC to real RNA sequencing data from 670 whole-blood samples, we show that genetic regulatory signatures inferred from ASE data more closely match those from population-based methods that are less prone to alignment biases. Finally, we use PAC to characterize cell type–specific ASE events that would be missed by standard alignment approaches, and in doing so identify disease relevant genes that may modulate their effects through the regulation of gene expression. PAC can be applied to the vast quantity of existing RNA sequencing data sets to better understand a wide array of fundamental biological and disease processes.


Supplemental Material for "Highly accurate quantification of allelic gene expression for population and disease genetics"
This document contains the following supplementary material for the manuscript:

Assessment of variant calling accuracy on ASE detection
Accuracy of variant calls is a known source of error in ASE analysis. For example, variants erroneously called as heterozygous would show complete monoallelic expression in ASE analysis. Further, while ASE analysis is typically carried out only with SNVs, the call accuracy of short indels and copy number variants (CNV) can influence the mapping of reads when personalised genomes are used. As such, we analysed the ability of GATK (Van der Auwera et al. 2013) to identify heterozygous variants using simulated whole genome sequencing data as described in the methods. In total there are 4,042,773 data points in the PGP VCF file for NA12877 and 4,011,226 in the GATK output, showing a true positive rate of 99.22%. The GATK VCF file also contains an additional 5,389 data records not present in the original data, leading to a false positive rate of 0.134%. The GATK VCF misses 36,939 data points, with false negative rate being 0.914%.
Following this, we generated two parental genomes within AlleleSeq (Rozowsky et al. 2011) using phased DNA variant calls from PGP (Eberle et al. 2017) for NA12877 and aligned simulated RNA-seq data to each genome individually before selecting the best alignment for each read pair for the two mappings (see Methods). In the ground truth data there are 13,211 heterozygous sites that have at least 20× coverage. Using this alignment approach, the number of heterozygous sites that also have at least 20× coverage was 12,405, with an average coverage of ~149×, and the correlation between the reference allele ratio at these sites for the aligned data versus the ground truth was R 2 =0.962. In total, 161 sites showed an absolute difference in reference allele ratio (versus the ground truth) of greater than 10%, and 72 sites showed a ratio >20% (Supplementary Table 1).

Effects of read trimming and softclipping
Many RNA-seq processing pipelines perform read trimming for adaptors, base quality and polyA tails before alignment. In ASE analysis, this may lead to biased results if alternative alleles at the ends of reads are preferentially trimmed. As such, we compared the effects of read trimming against using soft-clipping within STAR (Dobin et al. 2013) on allele counts at heterozygous sites, and found a much closer agreement with the ground truth for the latter approach. Including read trimming (stringency of 3bp, removing adaptors and terminal bases with Phred qualities lower than 30), the number of heterozygous sites with at least 20× coverage in both aligned and ground truth data was 12,161 and the correlation of reference allele ratios versus the ground truth decreased to R 2 =0.947. Similarly, the number of sites showing an absolute difference in reference allele ratio of >10% and >20% was 252 and 69, respectively (Supplementary Table 1), showing that applying softclipping with STAR is favourable to read trimming in these circumstances.

Effects of local phasing of variants (phASER)
Accurate phasing of genetic variants is required to construct correct parental genomes.
However, it is difficult to accurately phase rare alleles if they are not present in reference datasets. As such, we used the read-aware mode of phASER (Castel et al. 2016) within our pipeline, which improves local phasing by considering whether nearby genetic variants fall on the same or opposite reads (or pairs). Using this approach, we see a marginal improvement in all parameters: 12,415 heterozygous sites have coverage of at least 20× in aligned data, the correlation between reference allele ratios increases slightly to R 2 =0.963 (versus the ground truth), and the number of sites showing an absolute difference in reference allele ratio between ground truth and aligned data decreases slightly to 157 for differences >10% and 68 for differences >20% (Supplementary Table 1).

Effects of recovering multi-mapping reads
In many RNA-seq experiments, reads that do not align uniquely to the reference genome are typically discarded. For ASE analysis, this can lead to biased allele counts at specific loci, if the expression level of one or both alleles is underestimated due to non-unique alignment.
Methods to recover reads that align to multiple locations (from hereon 'multi-mapping' reads) exist, but current ASE detection methods do not incorporate these reads into the analysis. In order to include such reads, we used RSEM (Li and Dewey 2011) within our pipeline to assign a single location for multi-mapping reads based on read depth of uniquely aligned reads (see Methods) for each parental genome, and then incorporated these reads into the final aligned files before selecting the best alignment for each read pair for the two mappings. Applying this approach, we again improve the accuracy of the alignment, with 12,448 heterozygous sites having at least 20× coverage, the correlation of reference allele proportion at the sites between the ground truth and aligned data of R 2 =0.968, and 140 and 46 sites showing an absolute difference in reference allele ratio of >10% and >20%, respectively (Supplementary Table   1).

Supplementary Figures
Supplementary Figure 1. Ground truth data generation for individual NA12877. In Platinum Genomes VCF, variants were verified using multiple sequencing platforms and analysis methods, and conflicting calls resolved using parental genomic information. Platinum Genomes VCF together with a reference genome were used to generate ground truth genomes using AlleleSeq. Ground truth genomes were used to simulate RNA sequencing reads using RSEM. Parameters were obtained from RNA-sequencing reads of LCLs from individuals NA12890 and 12889 that were the actual parents, obtained from the Geuvadis Project. The simulated RNA reads were used to count coverage at each heterozygous site, called ground truth allele counts. Ground truth genomes were also used to simulate WGS using ART. Parameters for this were obtained from HipSci sample HPSI0114i-eipl_1. Simulated WGS were used to obtain variant calls using GATK best practises. This VCF, together with simulated RNA-seq reads, were used for PAC to obtain allelic count data that were compared against ground truth allele counts at heterozygous sites. Figure 2. Distribution of reference allele ratios in ground truth data. All heterozygous sites in ground truth data with 20× coverage are shown. Ratio of 0.5 implies that both alleles are expressed at equal ratios. KDE = kernel density estimate.

Supplementary Figure 3. Gene-wise comparison of ASE and eQTL signals.
Correlation of allelic fold change (aFC) values derived from ASE and eQTL analyses from 670 GTEx whole blood samples. Genes with a significant eQTL (q-value < 5%) and gene-level ASE information for at least 10 individuals were selected. Pearson's correlation coefficients are shown for eQTL versus ASE aFCs derived using standard alignment (A) and WASP (B).

A) B)
Supplementary