Scalable summary-statistics-based heritability estimation method with individual genotype level accuracy
- 1Department of Computer Science, University of California, Los Angeles, Los Angeles, California 90095, USA;
- 2Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts 02115, USA;
- 3Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
- 4Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California 90095, USA;
- 5Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California 90095, USA
Abstract
SNP heritability, the proportion of phenotypic variation explained by genotyped SNPs, is an important parameter in understanding the genetic architecture underlying various diseases and traits. Methods that aim to estimate SNP heritability from individual genotype and phenotype data are limited by their ability to scale to Biobank-scale data sets and by the restrictions in access to individual-level data. These limitations have motivated the development of methods that only require summary statistics. Although the availability of publicly accessible summary statistics makes them widely applicable, these methods lack the accuracy of methods that utilize individual genotypes. Here we present a SUMmary-statistics-based Randomized Haseman-Elston regression (SUM-RHE), a method that can estimate the SNP heritability of complex phenotypes with accuracies comparable to approaches that require individual genotypes, while exclusively relying on summary statistics. SUM-RHE employs Genome-Wide Association Study (GWAS) summary statistics and statistics obtained on a reference population, which can be efficiently estimated and readily shared for public use. Our results demonstrate that SUM-RHE obtains estimates of SNP heritability that are substantially more accurate compared with other summary statistic methods and on par with methods that rely on individual-level data.
The exponentially decreasing cost of genotyping and sequencing technologies has led to an increase in the number and size of biobanks (Bycroft et al. 2018; Johnson et al. 2023; Kurki et al. 2023), covering a wide range of populations. With large samples of phenotype and genotype data now available in these biobanks, one of the major analyses often performed is estimating heritability, defined as the phenotypic variance explained by the variance in the genotype (Falconer and Mackay 1996). Heritability estimates in these large data sets have allowed researchers to better delineate the scope of the role genetics play in complex traits, ranging from schizophrenia (Niarchou et al. 2020) to height (Yang et al. 2015), and have assisted investigations into their genetic architectures (Lappalainen et al. 2024). Most heritability estimation methods fit linear mixed models (LMMs) (Yang et al. 2011; Loh et al. 2015a,b) to map the variation in genotypes measured at single-nucleotide polymorphisms (SNPs) to the variation in phenotypes and thereby estimate the SNP heritability, that is, the proportion of phenotypic variance explained by genotyped SNPs. Given the high dimensionality of the genotypes and the large sample sizes of biobanks, fitting or parameter estimation in LMMs is computationally prohibitive. Many methods have been proposed to reduce computational complexity while retaining statistical accuracy (Yang et al. 2011; Zhou and Stephens 2012; Loh et al. 2015a,b; Zhou 2017; Wu and Sankararaman 2018; Pazokitoroudi et al. 2020). These methods, although highly accurate, generally take hours or days to run and require access to individual genotypes and phenotypes.
The rise of large-scale biobanks has also brought increased attention to the issue of genomic privacy owing to a surge in security breaches (Frizzo-Barker et al. 2016; Savatt et al. 2019; Akyüz et al. 2021). Consequently, additional measures have been implemented to safeguard individual information throughout its processing, storage, and sharing (Wan et al. 2022). Gaining access to raw individual-level data is now more challenging, exemplified by the UK Biobank's decision to restrict access to its WGS data to its cloud server (Deflaux et al. 2023). Given these developments, there is a growing preference for summary-statistics-based methods due to their portability and speed, even though they may sacrifice some statistical power compared with methods that use individual-level data (Bulik-Sullivan et al. 2015; Shi et al. 2016; Zhou 2017; Hou et al. 2019; Speed and Balding 2019). Such a loss in statistical power is particularly pronounced in smaller sample sizes and may result in inflated estimates of heritability owing to underestimation of linkage disequilibrium (LD) (Zhou 2017), even if correct reference summary statistics were used.
To address these challenges of heritability estimation in large biobanks, we propose SUMmary-statistics Randomized Haseman–Elston regression (SUM-RHE), by extending our previous work, Randomized Haseman–Elston regression (RHE) (Wu and Sankararaman 2018; Pazokitoroudi et al. 2020), to work exclusively on summary statistics. This adaptation leverages the observation that the trace estimates of the squared genetic relatedness matrix (GRM), which are needed to compute the method-of-moments (MoM) estimator underlying RHE, can be related to population-level parameters. By combining these trace estimates from a reference sample with Genome-Wide Association Study (GWAS) summary statistics from a target sample (consisting of individuals sampled from the same population as the reference sample), we can reconstruct the MoM estimates for the target sample without access to the individual data. In comprehensive simulations across various genetic architectures and scenarios, we show that SUM-RHE estimates are on par with methods that rely on individual-level data, and are substantially more accurate than popular summary statistic-based methods, all while exclusively utilizing summary statistics.
Methods
Background on heritability estimation and LMMs
Early attempts to calculate SNP heritability of complex traits by aggregating SNPs identified as GWAS-significant have revealed
the issue of missing heritability, as this estimate of heritability was significantly lower than the narrow-sense heritability
estimated in other studies (e.g., twin studies) (Manolio et al. 2009). The seminal work by Yang et al. (2010) reduced this discrepancy by jointly modeling all the SNPs, such that their effect sizes come from a distribution of some
fixed variance that quantifies the genetic variation. In this LMM framework, the standardized phenotype vector y is modeled as a linear combination of SNP effect sizes β multiplied by the standardized genotype matrix X of M SNPs and N individuals with uniform noise ϵ:
(1)
where the additive effect sizes β are drawn from an arbitrary distribution D with mean zero and variance of
, and the environmental/noise effects ϵ are drawn from a distribution with variance
. In the original work by Yang et al. (2010) and GCTA (Yang et al. 2011), the distribution D was chosen as a normal distribution. The SNP heritability is then defined as the proportion of genetic
variance over total phenotypic variance,
.
One approach to estimating the variance components
is to find the maximum likelihood estimator (MLE) and its variants, such as restricted maximum likelihood (REML) estimators
(Yang et al. 2011; Loh et al. 2015b). These methods often rely on iterative optimizations, which tend to be inefficient, and could lead to biased estimates owing
to the normality assumption (Zhou 2017; Wu and Sankararaman 2018). On the other hand, MoM approaches such as the Haseman–Elston regression (HE) (Haseman and Elston 1972; Sham and Purcell 2001), RHE (Wu and Sankararaman 2018; Pazokitoroudi et al. 2020), MQS (Zhou 2017), or LDSC (Bulik-Sullivan et al. 2015), only require solving the normal equations and do not make any assumptions on the underlying distribution D. Here, we briefly
discuss the MoM estimator (HE), which sets the foundation for our work.
Heritability estimation from individual genotype data using MoM and randomized MoM
The HE MoM estimator of the parameters
can be obtained by minimizing the discrepancy between the sample covariance
and the population covariance matrices. The population covariance is given as
(2)
where
is defined as the GRM. We want to find the estimates of the parameters
that minimize the Frobenius norm (the measure of discrepancy) between the two covariance matrices. This is equivalent to
solving the following normal equations:
(3)
where tr(K) = N and
, given both X and y are standardized. Equation 3 has the analytical solution for the variance components
:
(4)
giving the MoM estimate for heritability
:
(5)
The biggest bottleneck in the equation is calculating
. An exact calculation of the trace involves forming the matrix
, which has a computational complexity of
. Given M ≈ 1,000,000 and N ≈ 1,000,000, this is not tractable in modern biobanks. One of the main contributions of RHE-reg (Wu and Sankararaman 2018) and RHE-mc (Pazokitoroudi et al. 2020) is the efficient estimation of
by leveraging the fact that the trace of
can be approximated by a stochastic trace estimator (Girard 1989; Hutchinson 1989):
(6)
where zb are independent and identically distributed random vectors such that
and
. In both RHE-reg and RHE-mc, random vectors sampled from the standard normal distribution are used. Numerically, it was found
that using B ≈ 100 can estimate the trace of the squared GRM matrix
with high accuracy for moderate sample sizes of N ≈ 5000 and B ≈ 10 for large samples sizes (Wu and Sankararaman 2018; Pazokitoroudi et al. 2020). This stochastic trace estimator reduces the computational complexity to
. Additional optimizations, such as the mailman algorithm (Liberty and Zucker 2009) and implementation of a streaming version of the algorithm, reduce the computational complexity to
and the memory complexity to
, thus allowing estimation of heritability across millions of SNPs and individuals.
Extensive benchmarking of the RHE methods has shown that their performance is on par with other methods that require individual-level data, such as GCTA or BOLT-REML (Wu and Sankararaman 2018; Pazokitoroudi et al. 2020). RHE offers a distinct advantage over these likelihood-based methods, which tend to scale poorly, as well as other MoM approaches in terms of computational and memory efficiency (e.g., HE) or statistical efficiency (LDSC) (Pazokitoroudi et al. 2020). However, RHE is still limited in that it requires access to individual-level genotype and phenotype data, which restricts its applicability in cases in which such data are unavailable or only GWAS summary statistics are available.
Heritability estimation from summary statistics
In this work, we further extend RHE to work exclusively with GWAS summary statistics. Our key observation is the fact that
the left-hand side (LHS) of the normal equations (Equation 3) is related to the LD in the population (Bulik-Sullivan 2015; Zhou 2017) and not the phenotype. Thus, if we can summarize the trace estimate for a reference sample drawn from a population, we can
use these trace estimates to reconstruct the corresponding trace estimates for a target sample drawn from the same population.
Indeed, we find that the expected value of the trace of
can be related to the LD scores of the SNPs. Furthermore, the RHS of the normal equations can be computed from GWAS summary
statistics obtained on the target population.
The LD score of a variant j is defined as the sum of squared correlation with all the variants,
(7)
then
is
(8)
where R is the M × M correlation matrix (or the LD matrix), and
is the sample correlation between variants j, k. This gives
For large sample sizes (N), we have
where ρjk is the expected correlation or population LD between SNPs j and k. Assuming (Xj, Xk) are normally distributed with mean zero and covariance
, we can use Isserlis’ theorem to compute
. We then have
(9)
Substituting Equation 9 into Equation 8,
(10)
where
is the expected or population-level LD score associated with SNP j, whereas ρ can be interpreted as the average (expected) LD across all SNPs in genotype X.
Let
denote the GWAS effect size estimates for SNP j obtained by linear regression. Given the observed count Nj of the SNP, we have
because the genotypes are standardized. The standard error of the GWAS estimate is
. Thus, if we define the vector of adjusted z-scores,
we have that
(11)
Substituting Equations 10 and 11 into Equation 5 gives us
(12)
where
. Given ρ,
can be computed using summary statistics using Equation 12. However, computing ρ requires the exact κ = tr(K2), which is computationally intractable. Instead, we approximate ρ by using the stochastic trace estimates
described in Equation 6, such that
. This gives us the randomized MoM estimator of h2 that can be calculated from summary statistics:
(13)
We propose releasing
as “trace summaries,” which can then be combined with phenotype-specific GWAS summary statistics z to estimate heritability. The estimator in Equation 14 assumes that the
was computed on the same genotypes used to generate the GWAS summary statistics. In settings in which
cannot be computed on the same genotypes, we can use
computed on a reference genotype data set drawn from a population that is similar to the population that was used to generate
GWAS summary statistics (such as The 1000 Genomes Project). This is under the assumption that the LD structures of similar
populations will also be related.
Estimating standard errors
To calculate the standard error of our estimator, we perform SNP-level block jackknife resampling, as done in RHE-mc (Pazokitoroudi et al. 2020). When generating the trace summaries with individual genotypes, we report the
values estimated from jackknife subsamples. Excluding the same SNPs from the PLINK GWAS summary statistics, we can compute
the denominator in Equation 14,
, to get the jackknife replicate of
:
Following the execution on all SNP blocks, we employ jackknife resampling to obtain SE estimates:
(14)
Results
Simulations under varied genetic architectures
We assessed the performance of SUM-RHE against methods that require individual genotypes—RHE (RHE-mc run with a single component), GCTA-GREML, and BOLT-REML—and methods that can work with summary statistics—LDSC and SumHer (Speed and Balding 2019)—on the task of estimating genome-wide heritability. We applied all methods to unrelated White British individuals genotyped on M = 454,207 common SNPs (MAF > 0.01 excluding SNPs in the MHC region) typed on the UK Biobank Axiom array. Because of the computational scalability of GCTA-GREML and BOLT-REML, we tested these and other methods in a small-scale setting in which the number of individuals in the target data set was set to Ntarget = 10,060 (we term this the 10k sample). In addition, we compared all the remaining methods in a large-scale setting in which the number of individuals in the target data set was set to Ntarget = 50,112 (termed the 50k sample). For the summary statistic methods, the GWAS summary statistics were computed on the target data sets. These methods also require population statistics, calculated in the remainder of the N = 291,273 unrelated White British individuals as the reference data set. For the small-scale simulation, the reference set has Nref = 281,213, and for the large-scale simulation, we set Nref = 241,161. On each of the reference sets, we generated SUM-RHE trace summary statistics, LDSC reference LD scores, and LDAK SNP taggings.
We then simulated phenotypes corresponding to nine different genetic architectures:
and causal ratio P = 1.0, 0.1, 0.01 (where the causal ratio represents the proportion of variants with non-zero effects), each with 100 replicates.
Table 1 summarizes the inputs for each method: For the calculation of LDSC LD scores, we used the entire Nref reference sample with a window size of 2 Mb. SUM-RHE trace summaries were calculated by aggregating the trace estimates of
25 runs on the reference set with B = 100 (equivalent to stochastic trace estimation with
random vectors) (for more information, see Supplemental Fig. S2) and 1000 jackknife blocks, yielding a single trace summary statistic with 1000 jackknife estimates of
. SumHer was run assuming the GCTA model to calculate the SNP taggings (consistent with the genetic architecture assumed in
our simulations). RHE was run with B = 100 and 1000 jackknife blocks as well. BOLT-REML/GCTA-GREML was run with default parameter settings. GWAS summary statistics
for each simulated phenotype were generated using PLINK 2.0. Figure 1 summarizes the heritability estimates on the target data. Across the 18 different settings (genetic architectures and sample
sizes) we tested, we found that the accuracy of SUM-RHE was comparable to RHE (Fig. 1) with the mean-squared error of SUM-RHE close to one relative to RHE, despite relying only on the summary statistics (Fig. 2; for the MSE of each method, see Supplemental Fig. S1).
Comparison of SNP heritability estimates across methods. SUM-RHE heritability estimates are comparable to those from RHE or BOLT-REML/GCTA-GREML and are significantly more accurate than those of LDSC and SumHer. Because of computational limitations, BOLT-REML/GCTA-GREML was not run on the N = 50k simulations. Dashed hatches denote individual-level data methods.
Mean squared error (MSE) of heritability estimates of each method relative to RHE. The dot and error bar denote the relative MSE and the 95% CI calculated based on bootstrap resampling (using 10,000 bootstrap samples), respectively. Although the MSE of SUM-RHE is within ±5% of the MSE of RHE, the MSE of LDSC ranges from 245% to 472%, whereas the MSE of SumHer-GCTA ranges from 92% to 320%. (Diamond) BOLT-REML and GCTA-GREML have relative MSE in the range of 80% and 106% for the 10k samples and were not benchmarked on the 50k samples.
Inputs for the different methods evaluated
SUM-RHE has substantially improved accuracy over other summary-statistic-based methods: LDSC exhibits MSE ranging from 244% to 478% relative to that of SUM-RHE (mean: 356%), whereas SumHer-GCTA has MSE ranging from 94% to 331% relative to that of SUM-RHE (mean: 167%). SumHer-GCTA has lower MSE than SUM-RHE for low heritability (h2 = 0.1) and high polygenicity (causal ratio, P = 1.0). The improvement in MSE is particularly pronounced with smaller sample sizes (N = 10,060).
We also tested the calibration of SUM-RHE by simulating phenotypes with
and testing the hypothesis of
with a rejection threshold of α = 0.05 for both N = 10,060 and N = 50,112 to find that SUM-RHE is well calibrated (Table 2).
Calibration of the methods
Simulations with a mixture model
We further test the robustness of our method by simulating phenotypes with a combination of large- and small-effect SNPs.
We selected the first π = 0.05 of the SNPs to account for γ = 0.25 of the total SNP heritability, whereas the rest of the SNPs accounted for the
remainder. The causal SNPs were then selected at random with fixed probability α, such that the effect sizes for the large-effect SNPs were sampled from a different distribution than for the small-effect
SNPs. Specifically, the effect sizes were sampled from two distributions:
Figure 3 shows the boxplots of estimates from LDSC, SumHer-GCTA, SUM-RHE, and RHE, whereas Figure 4 plots the SE and MSE of the three summary-statistics methods (LDSC, SumHer-GCTA, SUM-RHE) relative to that of RHE. We observe that both LDSC and SumHer-GCTA show larger MSEs than SUM-RHE, similar to our previous simulations. LDSC has a MSE in the range of 266% to 444% relative to that of SUM-RHE (mean: 348%), and SumHer has a MSE in 102% to 304% (mean: 151%). These results indicate that SUM-RHE is robust under the mixture model simulations and attains accuracy comparable to individual-level methods across the scenarios we tested.
Comparison of SNP heritability estimates across methods on simulations with mixtures of large and small genetic effects.
Comparison of MSE and SE of the summary-statistics-based methods against RHE on simulations with mixtures of large and small genetic effects. Here we report the relative MSE of the five methods on the mixture-of-effect simulations. Their performances are similar to those in the previous simulations (Fig. 2). SUM-RHE has an MSE within ±5% relative to RHE.
Runtime measurements
We compared the runtime of SUM-RHE to other methods (Table 3). The heritability estimation step for all the summary statistic methods is computationally efficient irrespective of sample size. The generation of the reference statistics will depend on the size of the reference data set but is typically a one-time computation that is relatively efficient even for data sets with hundreds of thousands of individuals with access to a compute cluster.
Runtime estimates of the six methods
Application to traits in the UK Biobank
Finally, we applied three of the summary statistic methods as well as one of the scalable individual genotype-based method (RHE) to real UK Biobank phenotypes measured on N = 291,273 unrelated White British individuals paired with genotypes assayed on 454,207 common array SNPs (we ran SumHer with both GCTA and LDAK SNP taggings for real phenotypes). Here we plot the 15 quantitative traits with the highest z-scores of SUM-RHE heritability estimates, from overall health (z = 34.3) to albumin (z = 17.9), ordered by the heritability estimates. For the summary-statistics-based methods (LDSC, SUM-RHE, SumHer-LDAK/GCTA), we use in-sample statistics.
As expected, SUM-RHE has estimates that agree well with RHE estimates (Fig. 5). SUM-RHE estimates tend to lie in between those from LDSC and SumHer (with both GCTA and LDAK SNP taggings), consistent with our previous work (Pazokitoroudi et al. 2020).
Discussion
Here we propose a summary-statistics-based heritability estimation method, SUM-RHE, that has performance comparable to that of individual genotype-based methods. SUM-RHE is accurate, fast, and highly portable. It uses a trace summary statistic calculated by aggregating stochastic trace estimates and PLINK GWAS statistics. In the era of large biobanks, SUM-RHE will be a useful tool in estimating heritability while maintaining the privacy of the patients.
We conclude with a discussion of limitations and directions for future work. First, heritability estimates from SUM-RHE are accurate under the assumption that the summary statistics are free of confounding owing to population stratification and cryptic relatedness. In a setting in which the summary statistics are affected by confounders, LDSC could potentially be more robust (as confounding would affect the intercept of LDSC whereas the slope would provide a robust estimator of heritability). Second, our preliminary experiments suggest that SUM-RHE retains its accuracy even when reference trace estimates are computed using a smaller number of random vectors on a smaller number of individuals (as low as Nref = 30k and B = 1000) (see Supplemental Fig. S2). These results suggest that the computation of trace summaries can be even more efficient. Third, SUM-RHE is not applicable to the setting of MAF and LD-dependent architectures, nor does it estimate partitioned or local heritability. These applications will require computation and release of partitioned trace summaries. We view this as a promising direction for future work.
Software availability
SUM-RHE source code is available at GitHub (https://github.com/sriramlab/SUMRHE) and as Supplemental Code. Access to the UK Biobank data (genotype and phenotypes measured at baseline) requires an approved application. Details on the application and approval process can be found at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.
Competing interest statement
The authors declare no competing interests.
Acknowledgments
This research was conducted using the UK Biobank Resource under application 33127. We thank the participants of UK Biobank for making this work possible. This work was supported by the National Institutes of Health (GM125055, HG006399) and the National Science Foundation (CAREER-1943497).
Author contributions: S.S. conceived and supervised the project. M.J., A.P., and S.S. developed the methods. M.J. and S.S. wrote the manuscript. M.J. and Z.L. wrote the software code and performed the analyses. All authors read, reviewed, and approved the final manuscript.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279207.124.
-
Freely available online through the Genome Research Open Access option.
- Received February 26, 2024.
- Accepted July 12, 2024.
This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.
















