Identification and validation of supervariants reveal novel loci associated with human white matter microstructure

As an essential part of the central nervous system, white matter coordinates communications between different brain regions and is related to a wide range of neurodegenerative and neuropsychiatric disorders. Previous genome-wide association studies (GWASs) have uncovered loci associated with white matter microstructure. However, GWASs suffer from limited reproducibility and difficulties in detecting multi-single-nucleotide polymorphism (multi-SNP) and epistatic effects. In this study, we adopt the concept of supervariants, a combination of alleles in multiple loci, to account for potential multi-SNP effects. We perform supervariant identification and validation to identify loci associated with 22 white matter fractional anisotropy phenotypes derived from diffusion tensor imaging. To increase reproducibility, we use United Kingdom (UK) Biobank White British (n = 30,842) data for discovery and internal validation, and UK Biobank White but non-British (n = 1927) data, Europeans from the Adolescent Brain Cognitive Development study (n = 4399) data, and Europeans from the Human Connectome Project (n = 319) data for external validation. We identify 23 novel loci on the discovery set that have not been reported in the previous GWASs on white matter microstructure. Among them, three supervariants on genomic regions 5q35.1, 8p21.2, and 19q13.32 have P-values lower than 0.05 in the meta-analysis of the three independent validation data sets. These supervariants contain genetic variants located in genes that have been related to brain structures, cognitive functions, and neuropsychiatric diseases. Our findings provide a better understanding of the genetic architecture underlying white matter microstructure.


S1. Simulation studies for supervariant discovery and internal validation procedure
To evaluate if the discovery and internal validation procedure can control false positives, we perform simulation studies to apply this procedure on 22 simulated null phenotypes.Specifically, we directly use the 2,723 SNP sets in the real data analysis as the genotype data.Then, we randomly generate 22 continuous phenotypes without genetic effects by  !" =  !" +  !" , where covariate  !" ∼ (0, 1),  !" ∼ (0, 1),  = 1, . ., 22.
We follow the same discovery and internal validation procedure shown in Figure .1B (threshold on part 1: 0.05/(22×2723×2), threshold on part 2: 0.05/22) and repeat our proposed procedure 10 times.We repeat this simulation 10 times to evaluate the type I error rate.At the thresholds mentioned above, no SNP sets pass both discovery and validation requirements more than 5 times, suggesting that this procedure used in real data analysis can well control type I error.

S2. Conditional analysis of known common SNPs for supervariants
To evaluate if novel loci identified in this study are independent from previous ones in GWAS.We perform a conditional analysis for the 539 unique leading SNPs identified in the previous largest GWAS on DTI-derived phenotypes (Zhao et al. 2021a).Specifically, we consider two strategies.First, we include one leading SNP as a covariate in the regression model of supervariants and phenotype while adjusting for age (at imaging), sex, image site, age-squared, the interaction between age and sex, the interaction between age-squared and sex, and top 10 PCs at one time.Second, we aggregate the 539 leading SNPs into one single score by additive coding and include it as a covariate into the regression model of supervariants and phenotype to adjust for the joint effects of 539 SNPs.We summarize the original p-values of three supervariants, the maximal p-values of supervariants among the conditional analysis of 539 known SNPs, and the p-values of supervariants after adjusting for 539 SNPs jointly in the following table.The supervariants preserve low p-values in the conditional analysis, suggesting these loci are independent from previous ones.

S3. Analysis of the impact of different splitting strategies on the power of supervariant identification
To evaluate how the splitting strategy may impact the power of the analysis, we consider a variety of splitting ratios for the two random subsets of the dataset from extremely unbalanced 1:9, 2;8, 8:2, and 9:1 to relatively balanced 3:7, 4:6, 6:4, and 7:3.We investigate how robust the supervariants that we identify through the evenly splits versus these different splitting ratios.Specifically, in the UKB British dataset, we randomly split the dataset with splitting ratios 1:9, 2:8, 3:7, 4:6, 6:4, 7:3, 8:2, and 9:1, respectively, and then follow the same steps as when we use the 5:5 ratio to construct and validate supervariants.In the following table, we summarize the number of supervariants that are reproducible from these splitting ratios and the percentage of overlaps with the use of the 5:5 splitting ratio.Under ratios 6:4, 4:6, 3:7, and 7:3, most of the supervariants that we report can also be identified.However, as the sizes of the two parts become more unbalanced, the number of identified supervariant decreases.The small first part of the dataset may result in an inaccurate estimation of effect size and rank, while the small second part may lead to not enough sample size for the association test when validating the supervariants.Thus, the different splitting strategies affect the results of the analysis, but relatively balanced splitting ratios lead to robust results.

S4. Image processing and derivation of mean fractional anisotropy
We perform consistent standard registration and QC steps based on the ENIGMA-DTI pipeline (Jahanshad et al. 2013;Kochunov et al. 2014) for different datasets (http://enigma.ini.usc.edu/protocols/dti-protocols/).Specifically, we first use linear registration to register each of the FA images to the ENIGMA fractional anisotropy (FA) template at 1 × 1 × 1 mm spatial resolution on the MNI-ICBM-152 standard space.We then apply nonlinear registration to align the linearly registered FA images to this standard space and mask the registered FA images with the template brain mask.Next, we project the ENIGMA skeleton onto the registered images.Finally, we extract the tract-based tract-averaged mean for FA images.The full data analysis steps are summarized as follows.