A spectral component approach leveraging identity-by-descent graphs to address recent population structure in genomic analysis

  1. Eimear E. Kenny1,2,8
  1. 1Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA;
  2. 2Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA;
  3. 3Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA;
  4. 4Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA;
  5. 5Department of Neurology, University of California Los Angeles, Los Angeles, California 90095, USA;
  6. 6Department of Human Genetics, University of California Los Angeles, Los Angeles, California 90095, USA;
  7. 7Department of Computational Medicine, University of California Los Angeles, Los Angeles, California 90095, USA;
  8. 8Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA
  • 9 Present address: Gencove, New York, NY 10016, USA

  • Corresponding author: ruhollah.shemirani{at}mssm.edu
  • Abstract

    Population structure is a well-known confounder in statistical genetics, particularly in genome-wide association studies (GWAS), in which it can lead to inflated test statistics and spurious associations. Traditional methods, such as principal components (PCs), commonly used to adjust for population structure, are limited in capturing fine-scale, nonlinear patterns that arise from recent demographic events, patterns that are crucial for understanding rare variant effects. To address this challenge, we propose a novel method called spectral components (SPCs), which leverages identity-by-descent (IBD) graphs to capture and transform local, nonlinear fine-scale population structure into continuous representations that can be seamlessly integrated into genetic analysis pipelines. Using both simulated data sets and empirical data from the UK Biobank (N ≈ 420,000), we demonstrate that SPCs outperform PCs in adjusting for fine-scale population structure. In simulations, SPCs explain >90% of the fine-scale population structure with fewer components, whereas PCs capture <5%. In the UK Biobank, SPCs reduce the inflation of P-values in the GWAS of an environmental-driven phenotype by 12% compared with PCs, while maintaining a similar performance to PCs in height, a highly heritable phenotype. Additionally, SPCs improve rare variant association analyses, reducing genomic inflation (e.g., from 7.6 to 1.2 in one analysis), and provide more accurate heritability estimates. Spatial autocorrelation analysis further confirms the ability of SPCs to account for environmental effects, reducing Moran's I for both environmental and heritable phenotypes more effectively than PCs. Overall, our findings demonstrate that SPCs provide a robust, scalable adjustment for recent population structure, offering a powerful alternative or complement to PCs in large-scale biobank studies.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280659.125.

    • Freely available online through the Genome Research Open Access option.

    • Received March 14, 2025.
    • Accepted December 10, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server