A spectral component approach leveraging identity-by-descent graphs to address recent population structure in genomic analysis

  1. Eimear E. Kenny1,2,8
  1. 1Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA;
  2. 2Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA;
  3. 3Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA;
  4. 4Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA;
  5. 5Department of Neurology, University of California Los Angeles, Los Angeles, California 90095, USA;
  6. 6Department of Human Genetics, University of California Los Angeles, Los Angeles, California 90095, USA;
  7. 7Department of Computational Medicine, University of California Los Angeles, Los Angeles, California 90095, USA;
  8. 8Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA
  • 9 Present address: Gencove, New York, NY 10016, USA

  • Corresponding author: ruhollah.shemirani{at}mssm.edu
  • Abstract

    Population structure is a well-known confounder in statistical genetics, particularly in genome-wide association studies (GWAS), in which it can lead to inflated test statistics and spurious associations. Traditional methods, such as principal components (PCs), commonly used to adjust for population structure, are limited in capturing fine-scale, nonlinear patterns that arise from recent demographic events, patterns that are crucial for understanding rare variant effects. To address this challenge, we propose a novel method called spectral components (SPCs), which leverages identity-by-descent (IBD) graphs to capture and transform local, nonlinear fine-scale population structure into continuous representations that can be seamlessly integrated into genetic analysis pipelines. Using both simulated data sets and empirical data from the UK Biobank (N ≈ 420,000), we demonstrate that SPCs outperform PCs in adjusting for fine-scale population structure. In simulations, SPCs explain >90% of the fine-scale population structure with fewer components, whereas PCs capture <5%. In the UK Biobank, SPCs reduce the inflation of P-values in the GWAS of an environmental-driven phenotype by 12% compared with PCs, while maintaining a similar performance to PCs in height, a highly heritable phenotype. Additionally, SPCs improve rare variant association analyses, reducing genomic inflation (e.g., from 7.6 to 1.2 in one analysis), and provide more accurate heritability estimates. Spatial autocorrelation analysis further confirms the ability of SPCs to account for environmental effects, reducing Moran's I for both environmental and heritable phenotypes more effectively than PCs. Overall, our findings demonstrate that SPCs provide a robust, scalable adjustment for recent population structure, offering a powerful alternative or complement to PCs in large-scale biobank studies.

    Population structure is a critical factor in statistical genetic studies, as it can confound results if not properly addressed (Price et al. 2006; Lawson et al. 2020). This structure arises from genetic and environmental differences across populations, shaped by historical processes such as migration (Gravel et al. 2013), genetic drift (Nei and Tajima 1981), selection (Berg et al. 2019; Sohail et al. 2019; Mathieson et al. 2023), and assortative mating (Border et al. 2022). These processes create systematic differences in variant frequencies within and between populations that can violate key statistical assumptions in genetic analyses (Price et al. 2010; Sohail et al. 2019). For example, population structure can violate the assumption of independence among individuals in genetic analyses, such as genome-wide association studies (GWAS), leading to false-positive associations and inflated test statistics (Cook et al. 2020). To mitigate these effects, genetic analysis pipelines often use principal components (PCs) (Price et al. 2006) and/or generalized linear mixed models (GLMMs) (Kang et al. 2010). PCs represent major axes of genetic variation derived from dimensionality reduction of a genotype matrix, or the genetic relatedness matrix (GRM), which stores pairwise genetic similarity based on variants. GLMMs complement this approach by simultaneously accounting for population structure and cryptic relatedness by modeling a random effect whose covariance structure is defined by the estimated GRM. Both PC- and GLMM-based strategies have been successfully used to address population structure in analysis involving common variants. However, it is less clear whether these methods will be similarly effective for rare variants and more complex fine-scale population structures (Mathieson and McVean 2012; Privé et al. 2020; Zaidi and Mathieson 2020).

    Large-scale sequencing studies have revealed patterns of recently arisen rare variants and have shown that they tend to be geographically localized (Gravel et al. 2011; Fu et al. 2013). This fine-scale population structure impacting rare variants has been detected in several populations, including Icelandic (Helgason et al. 2005), British (Cook et al. 2020; Nait Saada et al. 2020; Gilbert et al. 2022), and Japanese (Sakaue et al. 2020) populations, revealing distinct genetic patterns shaped by historical movements over the past few thousand years. However, these studies also highlight a critical challenge: Traditional methods to account for population structure struggle to capture this discrete, nonlinear population structure occurring within recent timescales. The nonlinear relationships and sparsity in the GRM introduced by rare variants can make rare variant PCs sensitive to noise and violate the linear assumptions of PCA, complicating the interpretation of results (Baye et al. 2011; Ma and Shi 2020; Zaidi and Mathieson 2020). Likewise, incorporating rare variant information in GLMMs introduces several limitations, including higher computational costs and reduced power compared with common variant counterparts (Lee et al. 2014; Wainschtein et al. 2022). These findings underscore the need for novel analytical methods tailored to the unique properties of rare variants in statistical genetic analyses, enabling more accurate modeling of fine-scale population structure and the complex genetic architecture they reveal.

    To address these limitations, estimates of pairwise haplotypes shared identical-by-descent, or identity-by-descent (IBD), offer a more precise approach for detecting recent fine-scale population structure in large genomic data sets (Shemirani et al. 2021). The length of shared IBD segments provides a direct measure of evolutionary time; longer segments indicate more recent population structure, typically within the past dozen generations (Browning and Browning 2015). At the population level, the aggregated lengths of IBD segments shared between all pairs of individuals across the genome can be utilized to construct graphs of genetic similarity, in which individuals are represented as nodes, and shared IBD haplotypes are represented as edges (Belbin et al. 2021). Previous studies have shown the advantages of utilizing IBD data to represent recent population structure through the extraction of clusters of individuals who share IBD haplotypes across the genome (Dai et al. 2020; Belbin et al. 2021; Caggiano et al. 2023). These IBD-based clusters reveal population structure missed by traditional methods, leading to differential associations with health outcomes, unique enrichment of casual rare variants, and distinctive performance of polygenic risk scores compared with clusters identified with common variants alone (Belbin et al. 2021; Johnson et al. 2022).

    Here, we introduce spectral components (SPCs), a graph theory-driven approach that transform the discrete, localized signals of recent population structure, represented by an IBD graph, into a continuous representation of genetic similarity that can be used in statistical genetic analysis pipelines. Our method leverages graph Laplacian transformations (Grone et al. 1990), a fundamental concept in computational fields, such as dimensionality reduction (Kunegis et al. 2010), graph-based machine learning (Si et al. 2014), and network analysis (Shuman et al. 2013), owing to their ability to encode graph topology and smooth representations of variations in node features across the graph (Shuman et al. 2013). Previous work has applied spectral approaches to common variant data and demonstrated they can be more robust than PCs in finding hidden population structure compared with GRMs (Lee et al. 2010). In this study, we use simulated and empirical data from the UK Biobank project to explore the efficacy of SPCs in adjusting for the effects of population structure compared with traditional methods. We also examine the performance of these approaches in rare variant analysis under environmental confounding. We demonstrate how SPCs extract patterns of population structure that are resilient to the presence of discrete and localized structure in both variant and environmental association, via leveraging graph spectral decomposition.

    Results

    Overview of SPC calculation

    Previous work has demonstrated the efficacy of IBD graphs in capturing fine-scale population structure signals absent from GRMs (Zaidi and Mathieson 2020; Caggiano et al. 2023; Browning and Browning 2024). We developed a pipeline to extract individual frequency components underlying a graph of IBD sharing at a population level, or SPCs (Fig. 1). SPCs were calculated in four steps. First, we phased genotype data and estimated IBD using iLASH (Shemirani et al. 2021). We then aggregated the total sums of IBD sharing for each pair of individuals, which were then utilized to generate an undirected, unweighted relatedness graph, with each participant represented as a node. IBD relatedness in this graph was represented as connecting edges between respective nodes, if their sum of IBD sharing surpassed a minimum threshold of 6 cM. Third, we generated an adjacency matrix representation of this graph. Lastly, we calculated the spectral components of this matrix as SPCs. To conduct comparative analysis, we further calculated alternative PCs of this graph, along with the PCs of the original genotype data, as well as rare variants.

    Figure 1.

    General schema for calculating SPCs. First, phased genotypes are used to estimate segments of genome shared between individuals IBD. Second, the IBD segments are utilized to generate a relatedness graph with vertices representing participants and edges representing aggregated IBD sharing between pairs of participants above a threshold of 6 cM genome-wide. Third, the IBD relatedness graph is transformed into an adjacency matrix (A). Finally, the adjacency matrix A is transformed into its Laplacian form. The eigenvectors corresponding to the smallest eigenvalues of this matrix comprise the set of SPCs.

    Evaluating performance to detect fine-scale population structure on a simulated data set

    To evaluate the performance of SPCs for capturing fine-scale population structure in genomic data sets, we first simulated a population, following the method of Zaidi and Mathieson (2020). We simulated a five-by-five grid of 25 demes with a migration rate among them that emulates a homogeneous population structure (Fig. 2A). This configuration closely resembles the recent fine-scale population structure observed in the White British cohort in the UK Biobank (see Methods). By simulating this nuanced structure, we aimed to evaluate how well SPCs can discern subtle genetic differences that are often overlooked by traditional methods, thereby providing a more detailed understanding of population stratification.

    Figure 2.

    Reconstruction of recent population structure by covariates. (A) A schematic of our simulation of a cohort with homogeneous population structure with 25 demes and constant migration (5% per generation) between neighboring demes. All demes coalesce to single ancestral deme 200 generations ago. Consequently, simulated individuals have similar genetic backgrounds in terms of common variants, while simultaneously harboring distinct structure in terms of rare variation and environment of origin. (B,C) PC and SPC representation of the shared genetic ancestry among simulated samples. Dots in either panel represent samples colored based on the deme of origin.

    The simulated data set included 50,000 diploid samples. Only the Chromosome 1 was simulated for computational tractability. We calculated PCs of common and rare variants. We then downsampled the data set to retain only the common variants, phased the data to replicate phasing artifacts, and estimated aggregated IBD sharing using iLASH.

    Compared to PCs, SPCs demonstrate significant improvements in capturing population structure (Fig. 2B,C). We used correlation with deme of origin as a basic measurement of the ability of SPCs to dissect fine-scale population structure. We indexed demes based on their order on horizontal and vertical axes to generate coordinates of origin for each simulated sample, akin to place of birth for an individual. The first 100 PCs explain 4% and 5% of the variation in the vertical and horizontal coordinates, respectively, of the deme of origin for simulated samples. However, the first 10 SPCs were sufficient to explain 90% of the variation in the same categories of data. This is reflected in better visual separation between samples based on their deme of origin.

    Evaluating performance for disentangling environmental and genetic effects

    To evaluate the performance of SPCs as covariates in genomic analysis pipelines, we simulated seven continuous phenotypes (Fig. 3A), of which two were nonheritable and five had different heritability schemes. The first, referred to as “environmental smooth,” was a continuous phenotype in which variation solely originated from indirect environmental effects. The mean value of the phenotype decreased from north to south (Fig. 3A), and the effects of linear dependencies on deme of origin among the mean values of outcomes were observed. The second, which Zaidi and Mathieson (2020) called “environmental sharp,” aimed to investigate the effects resulting from the invalidity of common linear assumptions (Fig. 3A). For this phenotype, a single target deme had a phenotype distribution of Formula with a nonzero μ, whereas other demes had a phenotype distribution of Formula; mimicking nonlinear indirect environmental effects. Because of variation stemming only from environmental effects, the expected narrow-sense heritability of both environmental phenotypes simulated is zero.

    Figure 3.

    Comparison of PCs and SPCs as covariates to correct for the effects of population structure on the outcomes in simulations. (A) The schema of simulated phenotypes. First is the environmental smooth phenotype, in which the values are determined based on the vertical coordinate of origin for each individual. The phenotype values are drawn from a normal distribution with shared standard deviation. The mean of the distribution in each row is lower than the row above, resulting in a difference of twice the standard deviation between the top and the bottom row. Second, the environmental sharp phenotype is determined by the deme of origin. The phenotype values for all demes other than D6 are drawn from a zero-centered normal distribution, whereas in D6, phenotype values are drawn from a normal distribution with the mean set to twice the standard deviation. Finally, the two polygenic phenotypes were determined by assigning effect sizes to causal variants, randomly selected from windows of 10,000 bp, and calculating the polygenic score based on the effect sizes. Effect sizes were drawn from a random normal distribution with a mean of zero and a variance derived from minor allele frequency, heritability, and selective pressure. σs determines the expected heritability of the phenotype, whereas α determines the selective pressure, with negative values resulting in higher effect sizes assigned to variant with low minor allele frequency. Tested values for σs were 0.8 for the high heritability phenotype and 0.3 for the medium heritability phenotype. α was set to −0.5 across both experiments. (B) Proportion of variation in the phenotypes explained by the first 25 PCs and SPCs in four categories of simulated phenotypes. Adjustment methods are listed on the x-axes, and the y-axes represent proportion of variance explained by each model. The error bars represent the standard deviation. (C) Inflation of P-values in a GWAS analysis of simulated phenotypes using common variants only. (D) Inflation of P-values in a GWAS analysis of simulated phenotypes using rare variants (10 < MAC < 100).

    The next three phenotypes, on the other hand, were heritable (h2 = 0.1, 0.3, 0.8) and were affected only by direct genetic effects and random noise. These phenotypes, which we refer to as “polygenic phenotypes,” are modeled through the randomized selection of causal variants with effect sizes assigned based on their allele frequency and the total heritability of the trait. This approach of simulating effect sizes results in negligible correlation with the deme of birth, as it does not allocate a high impact to common variants that undergo significant drift, subsequently impacting mean phenotype values per deme (Schoech et al. 2019). Absence of such confounding effects eliminates both the biases described earlier and the requirement to adjust for them. Thus, fixed-effect covariates, such as PCs, are not effective in addressing possible confounding effects for this phenotype. Instead, linear mixed models can be used in such scenarios to account for variance components induced by genetic effects (Price et al. 2010).

    The last two phenotypes, henceforth referred to as “hybrid” phenotypes, combine environmental and polygenic phenotypes via replacing the random noise components of the polygenic phenotype with environmental phenotypes. The hybrid smooth phenotype is generated via adding the polygenic phenotype with a heritability of 0.3 and the environmental smooth phenotype (reweighted to comprise 0.7 of the variance of the phenotype). The hybrid sharp phenotype is generated similarly using the environmental sharp with the same weighting scheme. The environmental category was designed to evaluate a type-1 error rate, whereas the polygenic phenotypes assess power. Finally, the hybrid phenotypes evaluate how well calibrated each model is in the presence of both background polygenic effects and confounding environmental effects.

    To measure the efficacy of SPCs in adjusting for environmental effects on phenotypes, we used proportion of variation in the phenotype explained by covariates (PVE) (Fig. 3B; Supplemental Table 1). For the smooth phenotype, the first 25 SPCs accounted for 83% of the PVE, whereas the same number of PCs explained only 0.7%. For the sharp phenotype, the PVE for both models decreased, although the SPCs still explained 40-fold more environmental effects than PCs (2.5% of PVE vs. 0.06%). SPCs achieved their lowest PVE in the analysis of the heritable phenotypes with no environmental effects from recent population structure, in which both models explained <0.5% in the variation of the phenotypes across all heritability models (Supplemental Fig. 1). PVE values from the PC-based model were slightly higher than the SPC-based models (0.1% vs. 0.4%, respectively). Analyzing the hybrid smooth phenotypes as outcomes resulted in a PVEs of 58% and 0.7% for the SPC-based and PC-based models, respectively. PVEs for the hybrid sharp phenotype were 1.7% and 0.2% for the two models, respectively (Supplemental Fig. 1). Overall, these simulations demonstrate that SPCs more effectively capture environmental effects compared with PCs, while not significantly impacting estimates of true heritable effects.

    Evaluating performance for GWAS

    To evaluate how SPCs perform in the context of GWAS, we conducted GWAS using all seven phenotypes, focusing on the distribution of Z-scores and the inflation of P-values as a measure of the unadjusted effects of population structure (Fig. 3C; Devlin and Roeder 1999; Yang et al. 2011). In the GWAS of common variants for the smooth phenotype, the genomic inflation factor (denoted by λgc) was 1.101 (confidence intervals [CI]: 1.084–1.118) when including SPCs in the model compared with 5.410 (CI: 5.305–5.517) when including PCs in the model, indicating that SPCs were better at controlling for population structure. The difference in performance in the sharp model was less pronounced, although statistically significant, at 1.102 (CI: 1.066–1.138) for the PC-based models and 1.037 (CI: 1.019–1.054) for the SPC-based ones. This pattern was not observed in the inflation of P-values in the GWAS of the polygenic phenotype, in which the credible intervals for the λgc overlapped each other for all heritability rates (Supplemental Fig. 2C). In the GWAS of hybrid phenotypes, in which the environmental noise has nonrandom structure, λgc was significantly higher in the PC-adjusted model (λgc = 4.498 [CI: 4.355–4.648]) compared with the SPC-adjusted model (λgc = 1.581 [CI: 1.524–1.635]) for the hybrid smooth phenotypes (Supplemental Fig. 3C). λgc for both models increased compared with the scenario with same narrow-sense heritability (h = 0.3) and random environmental noise. Although SPCs had a lower λgc in the GWAS of the hybrid sharp phenotype (λgc = 1.321 [CI: 1.287–1.355]) compared with PCs (λgc = 1.365 [CI: 1.334–1.396]), their CIs overlapped.

    To better understand the impact of population structure in the analysis of the polygenic phenotype, we stratified variants into causal and noncausal, referring to variants neither truly associated with the phenotype nor in linkage disequilibrium (LD) with causal variants. We assumed that any systematic differences in calculated P-values or effect sizes of noncausal variants in between the two models are likely owing to population structure rather than true genetic association. Among the noncausal variants, P-values calculated using PCs were consistently lower (Wilcoxon P-value: 1.54 × 10−53), which suggests they are more affected by confounding. However, the mean effect size estimates for these variants were higher for the SPC models, suggesting the lower P-values in the PC model were not owing to higher power but may rather reflect differences in how population structure is accounted for. For causal variants and variants in LD with causal variants, there were no differences in estimated effect sizes using SPCs and PCs.

    We used the hybrid phenotypes to evaluate the calibration of each model's effect size estimate against the ground-truth simulated effects. Looking exclusively at estimated effect sizes for variants in high LD (r2 > 0.1) with the causal variants in the GWAS of medium heritability phenotype, both models showed similar accuracy in estimating the true simulated effect sizes. (Supplemental Fig. 4A). Using the same effect sizes along with nonrandom environmental noise, in the form of the hybrid smooth phenotype, resulted in drastically lower accuracy in the PC-adjusted model, while not significantly affecting the accuracy SPC-adjusted model (Supplemental Fig. 4B). Conversely, when looking at variants in low LD (r2 < 0.1) with the causal variants, under random environmental noise, the PC-adjusted model found two false-positive variants with a P-value lower than the multiple-testing threshold (P-value < 5 × 10−8), whereas the SPC-adjusted model found one. After adding structured noise in the hybrid smooth models, the SPC-adjusted model highlighted the same variant, whereas the PC-adjusted model highlighted 1874 false-positive variants that passed the multiple testing threshold.

    To explore the efficacy of SPCs in adjusting for confounding in rare variant analysis, we conducted GWAS of rare variants with all simulated phenotypes. The GWAS of rare variants showed similar λgc trends to the GWAS of common variants (Fig. 3D) across all phenotypes, with the performance gap decreasing because of lower power, with the exceptions of smooth and hybrid smooth phenotypes. In GWAS of the smooth phenotype, λgc of the PC model increased significantly to 7.576 (CI: 7.514–7.639), whereas λgc of the SPC model exhibited a modest increase to 1.119 (CI: 1.114–1.124). Similarly, in the analysis of the hybrid smooth phenotype, we estimated λgc values of 5.480 (CI: 5.336–5.630) and 1.162 (CI: 1.157–1.167) for the PC-adjusted and the SPC-adjusted models, respectively.

    As an additional analysis, we used the environmental phenotypes with no underlying causal genetic variation to calculate false-positive rates in the GWAS of common and rare variants. In the GWAS of the smooth phenotype, adjustment using PCs resulted in 1918.9 (std: 22.38) false-positive variants passing the Bonferroni multiple testing threshold (5 × 10−8) in the analysis of common variants and 12,042.35 (std: 71.32) false positives in the analysis of rare variants. Adjustment by SPCs resulted in 3.6 (std: 4.86) and 0.3 (std: 0.56) false positives in the analysis of common and rare variants, respectively, suggesting that the SPC method has significantly lower type I error rates in the presence of population structure confounding. These results demonstrate the robustness of SPC adjustment to both rare and common allele frequency variants.

    Comparison against alternative strategies of covariate extraction

    We next compared SPCs against alternative strategies to calculate fixed-effect covariates to adjust for recent population structure. Previous work by Zaidi and Mathieson (2020) demonstrated that PCs derived from rare variants can partially alleviate the limitations of common PCs. In our simulations, SPCs outperformed PCs of rare variants in terms of both predictive performance (R2 score) and decreasing the inflation of P-values. Although PCs of rare variants had a higher R2 score with the environmental phenotypes compared with PCs of common variants, their performance was consistently lower than that of SPCs (Supplemental Fig. 1; Supplemental Table 1).

    Additionally, PCs derived from the IBD relatedness matrix were previously used to improve representation of recent population structure compared with PCs in simulations (Zaidi and Mathieson 2020). In line with those findings, we explored alternative strategies for the calculation of continuous covariates using the IBD relatedness matrix for comparison with SPCs. We calculated 11 additional sets of covariates from IBD relatedness data, comparing PCs to spectral components and exploring a range of possible minimum IBD sharing thresholds in the generation of the relatedness matrix (6 cM, 10 cM, and 15 cM). We found SPCs, calculated using a threshold of 6 cM, to be among the best performers in terms of PVE of simulated phenotype (Supplemental Fig. 1). We also repeated the SPC-adjusted GWAS using two higher IBD sharing thresholds (10 cM and 15 cM) and found that the 6 cM threshold yielded λgc values that were equal or lower than those of higher thresholds (Supplemental Fig. 5). Overall, our results demonstrate that SPCs provide a more accurate adjustment for recent population structure than alternative covariate strategies (for further details, see Supplemental Material).

    Evaluating performance in narrow-sense heritability estimation

    Finally, we evaluated the impact of SPCs on estimating narrow-sense heritability, the proportion of phenotypic variance that is owing to a genetic component (Yang et al. 2017), in which adjustment of phenotype values using PCs is a common measure to lower the inflation of estimates owing to population structure (Abdellaoui et al. 2019; Cook et al. 2020). We calculated the narrow-sense heritability of the smooth phenotype using the GREML method implemented in the GCTA software package along with a GRM calculated based on common variants. This resulted in a heritability estimate of 1.00, indicating the phenotype is fully heritable genetically (Supplemental Fig. 6). Adjustment using PCs of rare variants resulted in a heritability estimate of 0.45. However, adjustment using SPCs resulted in a 12-fold decrease of the estimate to 0.08, suggesting a significantly lower genetic heritability component, which is in line with the nonheritability of the phenotype. This illustrates the improved effectiveness of SPCs against PCs when analyzing phenotypes with strong environmental components.

    To test for overcorrection of genetic factors that lead to the underestimation of narrow-sense heritability (Evans et al. 2018), we calculated the heritability of the polygenic phenotypes. Across all heritability values, the estimated CIs of narrow-sense heritability for both approaches overlapped with each other (Supplemental Table 2). Both approaches yielded heritability estimates similar to the unadjusted heritability estimate, demonstrating that the mixed effects model deployed by GCTA adequately addresses population structure in these scenarios. We simulated another heritable phenotype with high heritability (0.8) and lower polygenicity to explore the effects of lower polygenicity on heritability estimates. There, uncorrected estimated heritability for this phenotype was 0.43 (std: 0.04). The estimated heritability after correcting for SPCs did not significantly change (0.42, std: 0.04). However, correcting for PCs decreased the heritability estimate to 0.36 (std: 0.04) (Supplemental Fig. 7). Together, this suggests that SPCs have advantages over PCs in accurately estimating heritability in the context of population stratification.

    Measuring the effects of upstream noise

    Artifacts of noise and errors in IBD estimation pipelines, including phasing switch errors, false-positive IBD segments, missing IBD segments, and incorrect IBD segment boundaries, negatively affect the quality of IBD estimation and, consequently, global sharing estimates. For example, we previously showed that the boundaries of IBD segments might be underestimated (Shemirani et al. 2021). We measured the effects of these artifacts on the performance of SPCs in terms of PVE by comparing SPCs derived from inferred IBD (estimated SPCs) against SPCs calculated using ground-truth IBD data extracted directly from the simulated ancestral recombination graphs (ARGs; cf. Methods). We calculated the PVE in the hybrid smooth phenotype by the first 25 PCs, SPCs, and ground-truth SPCs (Supplemental Fig. 8). Although ground-truth SPCs outperformed the estimated SPCs, their performance was qualitatively similar, with PVEs of 0.59 versus 0.58 for ground-truth and estimated SPCs, respectively, suggesting that upstream artifacts do not cause serious hindrance to the performance of SPCs.

    Analysis of the UK Biobank

    We then sought to evaluate the performance of SPCs in real data using genetic data from individuals with self-reported White British ancestry in the UK Biobanks (UKBB; N ∼ 420,000). We focused on three continuous phenotypes (Fig. 4A): easting coordinates of birthplace (eastings), which display a linear geographical cline and have negligible true heritability; body fat percentage (BFP), which is established to have medium heritability (h2 ∼ 0.3); and height, which is established to have high heritability (h2 ∼ 0.7) (Karczewski et al. 2024). In terms of proportion of variance explained, we saw similar results as the simulated data, with SPCs outperforming PCs for all phenotypes (Fig. 4B). The difference was particularly significant for eastings. The GREML narrow-sense heritability estimates revealed similar patterns, although the gap in estimates was smaller compared with the simulated analysis. The heritability analysis for height and BFP resulted in the same estimates in both models (0.7 ± 0.01 for height, ∼0.31 ± 0.01 for BFP), whereas the estimates for eastings were 0.63 (std: 0.01) and 0.04 (std: 0.01), using PCs and SPCs, respectively. This distinction between easting and the other two phenotypes is similar to that of environmental smooth and polygenic phenotypes in the simulations, respectively.

    Figure 4.

    Comparison of PCs and SPCs as covariates to correct for the effects of population structure on the outcomes in the UK Biobank. (A) The structure of each phenotype. We analyzed three phenotypes with varying degrees of true heritability, increasing from left to right. Heritability estimation using GREML and LDSC methods, after correcting for recent population structure using both approaches, is shown, along with display of average phenotype values in each administrative county in the United Kingdom. (B) Proportion of explained variance in each phenotype by a model that only includes PCs or SPCs as covariates. Error bars represent standard deviation. (C) QQ plot of the distribution of P-values in the GWAS of common variants for each phenotype that measure the ability of covariates to adjust for the effects of recent population structure on the outcomes. (D) QQ plot of the distribution of P-values in the GWAS of rare variants (MAF < 1%) in the imputed data set. (E) Spatial autocorrelation of each phenotype measured in terms of Moran's I as a comparison of the ability of each set of covariates to correct against the artifacts of environmental effects.

    Next, we conducted GWAS for all three phenotypes to measure the inflation of the test statistics owing to population structure, using the first 25 PCs and SPCs, age, age squared, sex, and an age/sex interaction variable as covariates. To account for indirect genetic effects, and cryptic relatedness, which were not present in the simulated phenotypes, we used the BOLT-LMM software, which applies a Bayesian shrinkage prior model to adjust for those effects (Loh et al. 2015). In the GWAS of eastings, SPCs resulted in a lower inflation of P-values, as evidenced by a lower genomic inflation factor compared with PCs: 1.200 and 1.369, respectively. In contrast, the GWAS of height and BFP showed no significant differences in the inflation of P-values as both approaches yielded the same inflation factor (Fig. 4C). We then conducted a GWAS of rare variants using imputed genotype data, using BOLT-LMM. Looking exclusively at the imputed rare variants (MAF < 0.01), we observed lower genomic inflation factor values for both models in all three phenotypes along with similar patterns as the nonimputed GWAS, with SPCs having a lower inflation factor (1.147) compared with PCs (1.311) in easting and both models having the same inflation factor for height and BFP (Fig. 4D).

    We then sought to explore the differences between the effect sizes estimated by each model. LD score regression found identical heritability estimates from both models in all phenotypes (Fig. 4A), with a higher intercept for PCs in easting GWAS (1.362 vs. 1.172). This suggests the difference in performance in eastings was mostly driven by population structure confounding in the PC model (Bulik-Sullivan et al. 2015). For height and BFP, aggregate analysis revealed negligible differences, with the SPC-adjusted slope being lower but not significantly. We explored this further for height by extracting the list of loci for which the estimated single variant effect sizes differed significantly between the two approaches. We found that the minor alleles in these loci tended to be more uncommon, compared to all tested variants (Kolmogorov–Smirnov test P-value: 5 × 10−83), highlighting that the impact of SPCs is most pronounced at lower allele frequencies. The same pattern was also observed in the results for the eastings and BFP GWAS.

    We calculated spatial autocorrelation of the phenotypes, in the form of Moran's I, as a measure of indirect environmental effects (Moran 1948). A previous study showed the spatial autocorrelation of phenotype distributions is reduced after adjustment using PCs, illustrating the efficacy of PC adjustment in addressing confounding owing to such effects (Abdellaoui et al. 2019). Here we replicated that experiment and added SPC adjustment (Fig. 4E) as another comparison approach. The Moran's I for all three phenotypes decreased after adjustment using PCs, 0.52 to 0.41 for easting, 0.22 to 0.12 for BFP, and 0.32 to 0.15 for height. Adjustment using SPCs further reduced Moran's I estimates to 0.38 for easting and 0.10 for BFP and height. In summary, our analysis demonstrates that SPCs consistently outperform PCs in reducing population structure confounding, particularly for phenotypes with strong environmental components, and show comparable performance for highly and moderately heritable traits like height and BFP, especially when considering rare variant effect sizes and spatial autocorrelation.

    As a sensitivity analysis, we conducted a comparison of SPCs against other covariate extraction techniques by evaluating their PVE on a random subset of the UK Biobank participants (N = 50,000) (Supplemental Fig. 9). SPCs outperformed both PCs of rare and common variants. We note that in this data set (unlike in our simulations), PCs of rare variants could not outperform PCs of common variants. We then calculated additional component sets based on minimum IBD sharing thresholds (6 cM, 10 cM, and 15 cM), graph weighting schemes (weighted vs. unweighted graphs), and component calculation methods (spectral vs. principal components of IBD sharing), resulting in 12 sets of covariates (including SPCs), replicating our analysis using simulated data. For each sharing threshold, SPCs outperformed their alternatives, highlighting the ability of spectral components in extracting localized, fine-scale signals of population structure, along with the advantages of utilizing unweighted matrices, especially in the first five and the first 10 dimensions. We further showed how a higher sharing threshold can negatively impact PVE by excluding a significant proportion of IBD signals (Fig. 5). Together with our simulated results, these analyses demonstrate that SPCs calculated at the 6 cM threshold provide a more accurate adjustment for recent population structure than alternative adjustment strategies (for further details, see Supplemental Material).

    Figure 5.

    Bar plots of the PVE in eastings, height, and BFP explained by SPCs (derived from binary IBD relatedness), IBD PCs (derived from binary IBD relatedness), and PCs (derived from genetic variants) in the UK Biobank. Exploring the effects of a minimum IBD sharing parameter and decomposition approach on the performance of covariates. PCs of common and rare variants are displayed on the rightmost side of the figure. For other components, labels on the x-axis indicate the minimum IBD sharing threshold used to generate the components. Spectral (SPC) and principal (IBD PC) components were calculated per relatedness graph, indicated by diagonal and horizontal shading lines, respectively. SPCs are highlighted in brighter colors. PCs of common and rare variants are subscripted as PC-Common and PC-Rare, respectively.

    To measure the robustness of SPCs in the presence of ascertainment bias, we explored the effects of network sparsity on the SPCs. In PC calculation, a lack of balance in representation may result in the PC's inability to capture population structure signals from subpopulations with lower representation (Lee et al. 2010). The IBD analysis of the UKBB cohort, owing to its demographic history, leads to a densely connected graph, in which most individuals have hundreds of IBD connections to others (Supplemental Fig. 10). However, imbalances in representation result in a limited number of participants with lower connectivity. Here, connectivity (or node degree) is the number of edges per node. We searched for individuals with the lowest connectivity levels in a random subset (N = 50,000). The lowest 20 node degrees ranged from four to 131 (mean = 84.65; median = 103). We looked at the distribution of SPC components to see if the nodes with low connectivity cluster near zero owing to lower variability in their latent representation, suggesting failure to capture their underlying structure. We calculated the variance of each SPC dimension and compared them to the variances of a random sample of the same size (Supplemental Fig. 11) and found that the variances calculated for lower connectivity nodes were not significantly lower than those of the random sample. When applying this to PCs, low-connectivity nodes exhibited significantly reduced variance. The distributions across SPC dimensions further illustrate SPCs’ robustness to ascertainment bias (Supplemental Fig. 12).

    Discussion

    We introduced SPCs, a graph-based approach that leverages IBD graphs to capture localized, nonlinear fine-scale population structure and transforms them into a continuous representation for statistical genetics analyses. This method overcomes some limitations of PCs, which are optimized for modeling continuous linear ancestral relationships but struggle to fully account for recent demographic events that shape patterns of rare variation. Our results demonstrate that SPCs outperform PCs, explaining >90% of fine-scale population structure in simulations while reducing genomic inflation in GWAS for both environmental and heritable phenotypes. Notably, SPCs improved rare variant association analyses by reducing genomic inflation, while providing more accurate heritability estimates and avoiding overcorrection for the highly heritable trait of height. Spatial autocorrelation analysis further confirmed that SPCs better control for confounding by environmental effects, reducing Moran's I more effectively than PCs. These findings establish SPCs as a powerful and scalable tool to improve population structure adjustment and confounding control in large-scale, diverse biobank studies.

    This work builds on previous research on IBD-based or rare variant–based GRMs, which have shown that the latter models can often overcorrect for population structure, leading to loss of true signal in genetic analysis. For example, several studies (Browning and Browning 2015; Evans et al. 2018) demonstrate that overcorrection can occur because IBD GRMs disproportionately emphasize recent, localized relatedness, which can obscure genuine associations. Additionally, it extends prior applications of spectral approaches to common variant data, which demonstrated they can be more robust than PCs in finding hidden population structure (Lee et al. 2010). However, those applications use variants that are shared identical-by-state (IBS), rather than haplotypes shared IBD, which lack genealogical context and are expected to have poorer resolution of confounding for recent population structure and rare variant association. The distinction between fixed effects and random effects on these models further complicates the adjustment process. Fixed effects, like PCs, explicitly control for population structure by removing specific sources of variation, whereas random effects models, like those incorporating GLMMs, aim to partition variance across all genetic relationships but may inadvertently attenuate rare variant signals by treating them as background noise. SPCs provide a middle ground by capturing fine-scale population structure without overcorrection, using a fixed-effect approach tailored to discrete, localized signals, while maintaining the broader interpretability of linear mixed models. Moreover, the use of SPCs showed a clear reduction in genomic inflation for rare variants, therefore providing a significantly more robust adjustment. Other work on alternative approaches, such as using ARGs to infer genome-wide genealogies that can be combined with GLLMs to improve power for rare variant association has recently been shown to scale to large biobank data sets (Nait Saada et al. 2020; Zhang et al. 2023; Gunnarsson et al. 2024; Christ et al. 2026). Although these approaches show promise, they currently have some limitations, such as implicating large genomic regions in variant association and requiring methodological improvements to LMM scalability. Future work could explore the use of PCs, ARGs, and SPCs in tandem as a powerful tool for addressing the challenges posed by population structure in modern genomics (Ma and Shi 2020; Zaidi and Mathieson 2020).

    Another focus of this study was to examine how SPCs capture environmental effects, which often overlap with recent population structure (Zaidi and Mathieson 2020). The increased accuracy of SPCs compared with PCs is especially important given the increasing recognition of environmental confounding in GWAS, in which geography and patterns of recent demography can lead to spurious associations. In this context, SPCs explain a significantly greater proportion of variance for environmental phenotypes, such as easting coordinates. This result aligns with earlier work (Leslie et al. 2015; Abdellaoui et al. 2019; Cook et al. 2020; Zaidi and Mathieson 2020), highlighting the difficulty of decoupling environmental and genetic factors in geographically clustered populations. Our findings support this previous work, demonstrating that population structure confounding remains a major challenge even in large-scale homogeneous cohorts like the White Britons from the UK Biobank. We showed this for the highly polygenic trait of height, as demonstrated by measuring the spatial autocorrelation in the form of Moran's I. SPCs were able to further decrease the spatial autocorrelation of height compared with PCs. Our results also support the use of SPCs in heritability analysis, in which adjustment using SPCs reduced genetic heritability estimates by accounting for environmental confounding without overcorrection. By improving adjustment for the environmental confounding effects, SPCs provide a critical advance in the accuracy of genomic analysis for phenotypes that are heavily influenced by recent demography or environmental variables. We used easting coordinates as an extreme example of such phenotypes. Other phenotypes, or their determinants, such as temperature and pollutant exposures, income, and access to health care, depending on the data set, might express linear or nonlinear clines with respect to geography and genetic ancestry. These determinants can in turn affect health outcomes, including laboratory measurements such as vitamin D levels, which have been shown to be affected by altitude (Calvo 2020).

    We highlight several limitations to this study and directions for future development. First, our study was restricted to unrelated samples of White British descent in the UK Biobank. SPCs should be further evaluated in more diverse data sets in which relatedness, cryptic family structure, admixture, and population differentiation may introduce additional challenges. Second, although SPCs consistently outperform PCs in this study, additional work is needed to further validate the power of SPCs across settings such as smaller cohorts with sparser graphs, phenotypes with lower heritability, and dichotomous outcomes in which collinearity may be more pronounced. Third, further work could extend SPCs to additional analytic frameworks, like burden testing, generalized additive models, and polygenic risk score calculation, to assess their ability to improve prediction accuracy and calibration. Fourth, although our analysis focused on SPCs from global IBD graphs, alternative strategies merit exploration, including clustering-based partitioning of IBD graphs and multisegment IBD detection approaches (Browning and Browning 2024; Temple and Thompson 2025), which may capture additional fine-scale relatedness. Finally, SPCs could also be evaluated as covariates in IBD-based mapping of quantitative trait loci, as their ability to resolve fine-scale population structure differences suggests they may help improve mapping accuracy (Chen et al. 2023; Cai and Browning 2025).

    In summary, SPCs offer a robust, continuous alternative to PCs for adjusting for recent population structure in genomic studies. By capturing fine-scale, nonlinear demographic patterns, SPCs provide more accurate adjustment for confounding owing to environmental effects and recent ancestry, particularly in large-scale cohorts. As large-scale biobanks continue to grow in size and diversity, the impact of rare variants in genetic analysis will become increasingly pronounced. In this context, tools like SPCs will be essential for ensuring accurate results, particularly as researchers delve deeper into the genetic architecture of rare variants on common disease and traits influenced by population-structured dynamics. By addressing the combined challenges of population structure, environmental confounding, and rare variant associations, SPC-based approaches pave the way to uncover novel insights into complex disease architecture.

    Methods

    Calculating spectral and principal components

    The calculation of SPCs can be described in four main steps (Fig. 1).

    In step 1, we conducted QC and phasing on array data. The phased data were used by the IBD estimation algorithm, iLASH (https://github.com/roohy/iLASH), to detect pairwise for IBD for each data set. An in-depth explanation of genotype QC and phasing, along with details about running iLASH and estimating IBD was previously described (Shemirani et al. 2021). Throughout this paper, we use the term IBD to refer to long haplotype segments (≥6 cM) that are shared IBS and, given their length, are extremely likely to be inherited from a recent common ancestor. This operational definition is consistent with prior work (Gusev et al. 2009; Browning and Browning 2015; Shemirani et al. 2021). We acknowledge that some definitions of IBD differ, and we have added this clarification to reduce ambiguity. The results of the first step can be represented as S the set of all IBD segments long enough to be detected. Each segment has a starting point and an end point, s1, s2, located on autosomal Chromosome k:Formula In step 2, the aggregated sums of pairwise IBD sharing (i.e., the total or combined length of IBD haplotypes that are shared per pair of individuals) were calculated and used to generate an IBD relatedness graph. On this graph, nodes represented participants, and edge weights represented the aggregated IBD sharing. A minimum threshold of aggregated IBD sharing is enforced both as a QC measure (Belbin et al. 2021) and as a reflection of the population structure of the data set:Formula To calculate SPCs. we used a threshold of 6 cM. However, we explored two other thresholds of 10 cM and 15 cM for comparison (cf. Supplemental Material).

    In step 3, the n × n adjacency matrix A was generated, representing the relatedness graph, in which rows and columns both represent the n participants in the data set. Each element aij in this matrix is equal to one if the aggregated length of IBD segments shared between the two participants i and j on the graph is above the threshold of l = 6 cM; otherwise, the value of the elements is set to zero:Formula We refer to this matrix as the binary IBD relatedness matrix. We also explored the use of nonbinary relatedness matrices for comparison owing to their similarity with common GRMs (cf. Supplemental Material). Unlike GRMs, the majority of entries in the IBD relatedness matrix are zero. In our cohort, >98% of the entries were zero.

    In step 4, any graph adjacency matrix can be transformed into a Laplacian form (Shuman et al. 2013). Here we defined a symmetrically normalized Laplacian:Formula where I is the identity matrix; D is defined as the degree matrix, a diagonal matrix in which each element dii is equal to the total number of neighbors for participant i; and D+ is the (Moore–Penrose) inverse of D. The eigenvectors of the Laplacian matrix are commonly referred to as the spectral components of the graph. Briefly, SPCs are the spectral components of the binary IBD relatedness graph with a cut-off threshold of 6 cM. Thus, akin to PCs, SPCs attribute a set of numerical values to each vertex based on its projections on the set of principal axes of variation in the graph. The level of detail represented by each axis depends on the corresponding eigenvalue associated with it. Eigenvectors associated with the smaller eigenvalues will assign similar number to neighbors on the graph. As the associated eigenvalue increases, similar values are assigned to participants that are incrementally further from each other on the graph (for further details, see Supplemental Material).

    We have implemented this pipeline as a Python library that transforms flat text IBD graph files to SciPy sparse matrices and uses CPU or GPU acceleration to calculate SPCs using the same library.

    Generating simulated data

    We used the MSprime Python library to implement the simulation pipeline to ensure the distribution of IBD connections and their length is concordant with real data sets. The discrete time Wright–Fisher (DTWF) model, implemented in MSprime, was chosen to accurately simulate global IBD sharing networks under realistic demographic scenarios that result in long-range correlations between haplotypes (Baumdicker et al. 2022). Following guidelines on the simulation of homogeneous populations with recent structure akin to that of the UK Biobank (Zaidi and Mathieson 2020), our pipeline simulated N-by-N grids of demes with a constant migration rate of 0.05. We set N = 5 (25 demes) for our main simulation of Chromosome 1 with 50,000 diploid individuals. We also simulated a replication data set with N = 4 with 22 chromosomes and 8000 diploid individuals. The length of the chromosomes in the 8000-sample simulation was adopted from the HapMap Project (International HapMap Consortium et al. 2007). Chromosome-specific recombination rates were adopted from the stdpopsim library metadata (Adrion et al. 2020). In the 50,000-sample simulation, we used the same library to obtain Chromosome 1 specifications, with 248,956,422 bp and a recombination rate of 1.149 × 10−8. The migration rate of 0.05 has been shown to establish demes with average divergence, measured by FST, equal to that of neighboring administrative counties in the UK Biobank data set (Leslie et al. 2015). All demes originate from a single ancestral population 300 generations ago. The first 150 generations going back from the current day were simulated using the DTWF model, implemented in the MSprime library to ensure the IBD length distribution is similar to that observed in human genotype data sets. The simulation pipeline is available publicly on GitHub.

    Although calculating true IBD data is readily possible with MSprime, we sought to replicate common noise present in real-world applications, such as switch errors and false-positive IBD segments. The simulated tree sequences were converted to whole-genome sequence data in PLINK 1.9 format in order to delete the ground-truth phase information. The data were then filtered to retain only common variants (MAF > 1%) and then phased using Eagle v4.2 with default parameters (https://alkesgroup.broadinstitute.org/Eagle/). Genetic map files for Eagle were generated using the recombination rates applied in the simulations. We have provided the code to generate these files in our simulation GitHub repository. IBD segments were estimated using the iLASH software v1.0.1 with default parameters (min_length 2.9, perm_count: 20, shingle_size: 15, shingle_overlap: 0, auto_slice: True, cm_overlap: 1, bucket_count: 5, match_threshold: 0.99, interest_threshold: 0.70). For the comparative analysis of downstream applications, seven types of phenotypes were simulated (Fig. 3A). The first category, referred to as environmental smooth phenotypes, are nonheritable phenotypes with a distribution that depends on the deme of origin. All values were generated using a normal distribution with constant variance across all demes. Moving vertically from top to bottom, the mean of the normal distribution decreases monotonically per row. It decreases from a value of 2σ for the top row to zero for the bottom row. Briefly, in a five-by-five grid simulation, this means the phenotype values on the top row are drawn from the distribution Formula, the values on the second row are drawn from the distribution Formula, each subsequent row has a mean that is 0.5σ smaller than the previous, and last row has a mean of zero. The second phenotype, called the environmental sharp phenotype, is also nonheritable with a normal distribution with a nonzero mean (Formula) at the target deme and a zero-centered normal distribution (Formula) in all other demes. We used a σ = 1 in our analysis for this paper. We generated three heritable polygenic phenotypes, to which we refer, based on their heritability, as polygenic low heritability (h2 = 0.1), polygenic medium heritability (h2 = 0.3), and polygenic high heritability (h2 = 0.08). For these phenotypes, we selected causal SNPs randomly from each window of 10 kb along the genome (24,813 causal variants per phenotype) with effect sizes drawn from Formula, where Formula is calculated based on selective pressure parameter α = −0.5, and the total heritability of the phenotype (σs = 0.1, 0.3, and 0.8) via the formula Formula. We then used the weights along with the simulated whole-genome sequence data and the score calculation functionality in the PLINK software (‐‐score; https://zzz.bwh.harvard.edu/plink/plink2.shtml) to generate the mean phenotype values for each individual. The remaining variance of the phenotypes were added using a random normal distribution with variances of 0.9, 0.7, and 0.2, respectively.

    Finally, we simulated two hybrid phenotypes by combining the environmental smooth and sharp phenotypes and the polygenic phenotype with medium heritability (h2 = 0.3) by reweighting the environmental phenotype to have a variance of 0.7 and adding the values of the two phenotypes to generate hybrid smooth and hybrid sharp phenotypes, respectively.

    LD data to detect variants tagging causal ones were generated using PLINK with the following parameters: “‐‐list-all, ‐‐show-tags, ‐‐tag-kb 2000, ‐‐tag-r2 0.1”.

    Ground-truth IBD data were extracted using the ibd_segment function in the tskit library with a minimum span of 3 cM and a maximum TMRCA of 200.

    UK Biobank data

    Phased genotype data along with the imputed data for the UK Biobank project were downloaded through the accession code 84541. The UK Biobank participants were genotyped using the Applied Biosystems UK BiLEVE axiom array by Affymetrix 807,411. The genotype data were phased using the SHAPEIT4 software and the 1000 Genomes Project as the reference panel (Auton et al. 2015; Bycroft et al. 2018; Delaneau et al. 2019). After removing indels from the 22 autosomal chromosomes, a total of 655,532 SNPs were retained. Data field 21000, based on self-reported race/ethnicity, was used to extract (N = 471,931) individuals with White British ancestry. Information on allele frequency and missingness in both phasing and QC process for GWAS data is available in the work of Loh et al. (2018). We estimated fine-scale IBD sharing using the iLASH software (Shemirani et al. 2021) using the same parameters mentioned in the previous section. Second-degree relatives were identified using KING (parameters: ‐‐related ‐‐degree 2; N = 24,799). From each pair of these relatives, at least one individual was excluded from the analysis (N = 405,508). Height information was accessed using data field 20 (N = 352,254). BFP was accessed using data field 23099 (N = 346,655). The easting coordinates of birth were accessed using data field 130 (N = 343,845). PCs were calculated using PLINK2 software. Phased data were available only for the hg19 human genome assembly. As our analyses rely on aggregate genomic information, using newer assemblies would not affect the results.

    Genome-wide association study

    The GWAS of the simulated data was conducted using the GCTA software (https://yanglab.westlake.edu.cn/software/gcta/) with default parameters and the FastGWA option to run linear regression (Jiang et al. 2019). For the common variant analysis, we excluded variants with MAF < 1% using PLINK2 software. For the rare/uncommon variant analysis, we excluded variants with MAF > 1% using the same software. The GWAS in UK Biobank was conducted using the BOLT-LMM software v1 (https://alkesgroup.broadinstitute.org/BOLT-LMM/) to account for cryptic relatedness and direct genetic effects, absent in the simulated phenotype analysis. Age, age squared, sex, age by sex, and age squared by sex were used as covariates (Loh et al. 2015). The LD reference panel was calculated using genotype data from individuals with reported European ancestry in The 1000 Genomes Project (https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com; access date January 2023). The default genetic map file available in the SHAPEIT4 software package (Delaneau et al. 2019), calculated for Human Genome version 19 (hg19), was used for genetic distances. Minor allele frequency of 0.001 and a minimum INFO score of 0.3 filters were used for the imputed data.

    Computing environment

    All analyses in this paper were performed on the high-performance computing cluster at Mount Sinai. The compute node used featured an Intel Emerald Rapids 8568Y+ CPU with 96 cores at 2.3 GHz (20 cores were used for our applications) and 1.5 TB of RAM (∼100 GB were used during our runs). The operating system was CentOS v7.0, with GCC v8.3.0 and Python v3.5.

    Code availability

    A Python implementation of the SPC algorithm is available as Supplemental Code and at GitHub (https://github.com/roohy/spc). Our simulation pipeline is available as Supplemental Code and at GitHub (https://github.com/roohy/msprime_ancestry_simulation).

    Data access

    Simulated genotype data, along with phenotypes is available on the Mendeley repository available at https://data.mendeley.com/datasets/9dcpdbzv4m/1 (doi:10.17632/9dcpdbzv4m.1).

    Competing interest statement

    The authors declare no competing interests.

    Acknowledgments

    This work was supported by the National Human Genome Research Institute of the National Institutes of Health (NIH) under award numbers R01HG011345, U01HG010971, U01HG011715-04, and UMHG010971-06. This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the NIH under award numbers S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

    Author contributions: Conceptualization was by R.S., E.E.K., C.R.G., and N.Z. Methodology was developed by R.S., E.E.K., and N.Z. Conceptualization of the simulation of population structure was by R.S. and G.M.B. Formal analysis was by R.S., E.E.K., and N.Z. Reviewing and editing were by all authors. Supervision was by E.E.K.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280659.125.

    • Freely available online through the Genome Research Open Access option.

    • Received March 14, 2025.
    • Accepted December 10, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server