Inference of selective forces on house mouse genomes during secondary contact in East Asia

The house mouse (Mus musculus), which is commensal to humans, has spread globally via human activities, leading to secondary contact between genetically divergent subspecies. This pattern of genetic admixture can provide insights into the selective forces at play in this well-studied model organism. Our analysis of 163 house mouse genomes, with a particular focus on East Asia, revealed substantial admixture between the subspecies castaneus and musculus, particularly in Japan and southern China. We revealed, despite the different level of autosomal admixture among regions, that all Y Chromosomes in the East Asian samples belonged to the musculus-type haplogroup, potentially explained by genomic conflict under sex-ratio distortion owing to varying copy numbers of ampliconic genes on sex chromosomes, Slx and Sly. Our computer simulations, designed to replicate the observed scenario, show that the preferential fixation of musculus-type Y Chromosomes can be achieved with a slight increase in the male-to-female birth ratio. We also investigated the influence of selection on the posthybridization of the subspecies castaneus and musculus in Japan. Even though the genetic background of most Japanese samples closely resembles the subspecies musculus, certain genomic regions overrepresented the castaneus-like genetic components, particularly in immune-related genes. Furthermore, a large genomic block (∼2 Mbp) containing a vomeronasal/olfactory receptor gene cluster predominantly harbored castaneus-type haplotypes in the Japanese samples, highlighting the crucial role of olfaction-based recognition in shaping hybrid genomes.


Introduction
Our simulations replicated the scenario observed in southern China, where house mouse populations primarily possess the genetic background of the subspecies castaneus, yet their Y Chromosomes are exclusively of the subspecies musculus-type.We postulated that the castaneus-type Y Chromosomes in the southern Chinese population were displaced by musculus-type Y Chromosomes, driven by sexratio distortion (SD) linked to the Sly/Slx loci.In the model, individuals possessing genotypes that cause SD continually migrate from another population.To determine the feasibility of this exclusive Y Chromosome replacement within a plausible range of parameters and timeframes, we executed simulations under varying conditions.

Model
We modeled a Wright-Fisher diploid population A of size N, which consists of males and females with the XY sex-determination system.Population A receives migrants from population B at a rate of m (fraction of migrants in the recipient population) per generation.Assuming that populations A and B are already well differentiated genetically, for each individual in population A, we consider five key genetic parameters: sex, a genomic fraction derived from population B (GA), genotypes at the SD loci on the X (GXSD) and/or Y (GYSD) Chromosomes, genotypes at the two X-chromosomal hybrid incompatibility (XHI) loci (GXHI1 and GXHI2) adjacent to the SD locus, and mitochondrial genotype (GM).We assumed that the SD and XHI loci are recombined at a rate of r per generation.In the following definitions, we assigned the value 0 and 1 to the original alleles in populations A and B, respectively.
We assumed that the autosomal genome would experience a sufficient amount of recombination.
The feature of the offspring's genome, GA, is therefore scored by an average of the parental values.
GA of the F1 hybrid between an individual from non-admixed populations A and B would thus become 0.5, and GA of the backcrossed hybrid in population A would be 0.25.The genetic distance between parental genomes at autosomal locus, dAA, was defined as the absolute difference between the GA values of a father and a mother.Similarly, the genetic distance between autosomes and X Chromosomes, dXA, was defined as the average of the difference (GXHI) between GA and GXHI1 and between GA and GXHI2.Since a female offspring has two X Chromosomes, GXHI in females was averaged over the two chromosomes.The genetic distance between autosomal and mitochondrial genomes, dMA, and the genetic distance between X and Y Chromosomal genomes, dXY, were defined in a similar way.
The SD locus is located on both X and Y Chromosomes.We assumed that the SD locus has copy number polymorphisms and that allele type 1 in population B has a higher copy number than allele type 0 in population A. In a male germline, a sex chromosome that has an SD allele with a higher copy number than the other sex chromosome is assumed to have a higher probability of transmission to the offspring.The parameter α denotes the level of SD when X and Y Chromosomes mismatch in male gametes; when the allele on the Y Chromosome is 1 and the allele on the X Chromosome is 0, the male-to-female ratio of offspring would be α, and vice versa.
The effect of hybrid incompatibility was evaluated for each reproduced offspring.Assuming the multiplicative effect of hybrid incompatibility between autosomal genomes, between the X Chromosome and autosomal genomes, and between the mitochondrial and autosomal genomes, the fitness of offspring, w, is defined according to equation (1): where RAA, RXA, RMA, and RXY represent amplitude factors for each effect.

Simulations
We developed a Python script for an individual-based population genetics simulation, which consists of Nm males and Nf females in population A. Nm and Nf may not always be equal, but their sum, N, remains constant.Each generation in the simulation has two phases.
In the first phase, recombination between each XHI locus and SD locus on the X Chromosomes in females is randomly assigned.Random variables are drawn from the Poisson distribution with a mean of 2rNf, and the specified number of recombination events is randomly allocated to the pool of X Chromosomes in females.If mutations are considered, they are assigned to each genome during this phase in a similar manner.
In the second phase, mating pairs are randomly selected from population A. However, there is a probability of m that one mate will be replaced by a migrant from population B. The offspring's sex ratio is determined by the father's genotype, as described in the previous section.After selecting mating pairs, the offspring's fitness is evaluated.A random variable is drawn from a uniform distribution U(0, 1).If the random variable is higher than w, the individual is discarded, and a new mating pair is chosen until N offspring are produced for the next generation.This process is repeated for an adequate number of generations.

Results
We performed simulations using various parameter settings and present results with N = 2,000 for up to 200 (0.1N) generations.The number of generations is sufficient, considering that house mice have a population size in the order of 100,000-1,000,000, and we take into account the admixture events occurring around these 10,000 (0.1N) generations.We also fixed the migration rate Nm = 1.At this migration rate, we expect that introgressed alleles would not easily reach fixation under neutrality.
We initially ignored recombination between the SD and XHI loci (r = 0) and discuss its effect below.
In all simulations, we considered 1,000 instances.
We initially assessed the required intensity of SD for the rapid fixation of introgressed X and Y Chromosomes, examining whether the introgressed Y Chromosomes fix more rapidly than the X Chromosomes.By varying the value of α while setting all incompatibility parameters (RAA, RXA, RMA, and RXY) to 0, we examined the fixation rates of introgressed X and Y Chromosomes over 0.1N generations.The results indicated that a 10%-20% increase in the male-to-female birth ratio suffices for the rapid fixation of introgressed Y Chromosomes (Supplemental Figure S6A).Conversely, the fixation of introgressed X Chromosomes consistently lagged behind that of Y Chromosomes because of the design of our model, where the SD always occurs in male gametes.An increase of α to 40%-60% led to the fixation of introgressed X Chromosomes within 0.1N generations (Supplemental Figure S6A), but this fixation typically occurred later than that of the Y Chromosomes.Trajectories of allele frequencies in population A with α = 1.5 are shown in Supplemental Figure S7A.In simulations of the reverse scenario, where X and Y Chromosomes with short haplotypes introgressed into a population with long X and Y haplotypes, the fixation of introgressed X and Y Chromosomes did not occur, aligning with expectations.
We then examined the impact of fertility reduction caused by SD.This reduction can be considered as incompatibility between X and Y Chromosomes in males.We set the male-to-female SD parameter, α, to 1.5 and varied the value of RXY.The results, depicted in Supplemental Figure S6B, show that RXY has a minimal effect until it reaches above 0.5.An RXY value of 0.5 corresponds to a fertility reduction of 39%, indicating that even with a 50% increase in SD, the rapid fixation of introgressed Y Chromosomes can still occur despite a relatively high level of fertility reduction.
Although the above results clearly showed the advantage of introgressed Y Chromosomes over introgressed X Chromosomes in the recipient population without the influence of additional factors, we further evaluated the effect of hybrid incompatibility between different genomes because of the accumulating evidence of hybrid incompatibility loci between genomes.In addition, we have observed less introgression of X Chromosomes in southern China and Japan.By changing various hybrid incompatibility parameters (RAA, RXA, and RMA) with α = 1.5, RXY = 0, and r = 0 (no recombination between SD and XHI loci), we first investigated the effect of hybrid incompatibility due to a mismatch in the autosomal genomes (RAA = 1, RXA = 0, and RMA = 0).The parameter set corresponds to a fitness reduction of F1 hybrids to 39%.We found that the level of hybrid incompatibility would certainly reduce the rate of introgression, but the fixation probability of introgressed alleles was still high with the selective advantage of introgressed X and Y Chromosomes (Supplemental Table S2).The trajectories of allele frequencies in population A are shown in Supplemental Figure S7B.The effect of the XHI locus was incorporated into the next case (RAA = 0, RXA = 1, and RMA = 0).The X Chromosomal effect on hybrid incompatibility greatly reduced the fixation rate of introgressed X Chromosomes but did not change the fixation rate of introgressed Y Chromosomes (Supplemental Table S2).The trajectories of allele frequencies in population A are shown in Supplemental Figure S7C.We also incorporated the effect of mitonuclear incompatibility, where introgressed mitochondrial genomes are incompatible with the genomic background of recipients.Fixation rates of introgressed alleles in the case of RAA = 0, RXA = 0, and RMA = 0 are presented in Supplemental Table S2 and the trajectories are shown in Supplemental Figure S7D.
We also investigated the effects of recombination between the SD and XHI loci on genomic introgression.Our findings reveal that even with a high recombination rate, the rate of fixation for introgressed X Chromosomes remains significantly reduced in scenarios involving autosome-X Chromosome incompatibility (Supplemental Figure S8).This outcome is attributed to the fact that recombination among X Chromosomes only occurs within female germlines, leading to an efficient elimination of introgressed X Chromosomes before the SD and XHI loci can be separated through recombination.Furthermore, our model posits that the SD locus is situated between two distinct XHI loci, diminishing the likelihood of the SD locus becoming unlinked from both XHI loci within a single generation.

Demographic parameter estimation
To illustrate the distribution of castaneus-enriched genomic blocks in the Japanese population under neutrality, we inferred demographic parameters for the five population samples (M.spretus, SPR; Kazakhstan, KAZ; Korea, KOR; India (Leh), IND; Japan, JPN) using Fastsimcoal2 software (Excoffier et al. 2013).The total number of samples used for the analysis was 52.The model is presented in Supplemental Figure S6.For simplicity, we assumed that each population size remained constant unless a population split or admixture occurred, that migration rates were symmetric, and that a single pulse admixture event took place during the establishment of the Japanese population.We also assumed that the admixture forming the Japanese population occurred 3,000 generation ago, with the effects of this assumption discussed later in the text.
We generated joint site frequency spectrum data with minor allele frequencies for the five populations using the same samples employed in calculation of the f4 ratio (α), which estimates the genomic proportion of musculus-derived alleles in the Japanese samples.In addition to the mappability filter, we filtered our SNVs using the threshold of "ExcessHet>30", marked by GATK software (Van der Auwera et al. 2002).
The limitations of our genotyping pipeline meant that the monomorphic non-variant sites in the genome were not specified.To address this, we estimated the number of these sites.This estimation was essential to scale the parameters accurately.Our mappability mask identified 1,764,828,698 autosomal sites that passed the filtering process.Out of the 52 samples, we opted for the sites where all samples had been genotyped, leading to a reduction in polymorphic sites by a factor of 0.878.In addition, on the basis of the H1KG project, we postulated that 93% of the autosomal genomes were accessible (http://ftp.1000genomes.ebi.ac.uk/).By combining these data, the total analyzed sites amounted to 1,440,533,483.
We assumed a mutation rate of 5.9 × 10 -9 and a generation time of 1 year, and repeated the maximum-likelihood estimation procedure 300 times, selecting the parameter sets that demonstrated the highest likelihood.We also conducted parametric bootstrap resampling 100 times to capture the 95% confidence interval.Details on the estimated population parameters can be found in Supplemental Table S3.
Following the optimization of parameters, we evaluated the fit of these parameters to the observed data.The log maximum-likelihood value post-optimization was ˗263,231,275, compared with the highest possible log-likelihood of ˗258,320,847.Although this deviation is notable, a comparison of the expected and observed folded site frequency spectra within the Japanese population revealed a close match across all frequencies, with an exception at the frequency of 0.5 (see Supplemental Figure S10).This discrepancy at the 0.5 frequency could be attributed to our exclusion of SNVs that appeared heterozygous in many samples, applied by the ExcessHet>30 filter.

Computer simulations
We generated 20-kbp-length genomic fragments corresponding to the samples using the maximumlikelihood estimators in Supplemental Table S3, with a recombination rate of 1 × 10 ˗8 per site per generation, and estimated α for 100,000 repetitions.We assigned P-values to the observed data using this distribution.When we did not fix the admixture timing for the Japanese population and estimated the timing using Fastsimcoal2, we obtained a slightly earlier admixture event estimate of > 5,000 generations ago, which was a much earlier timepoint than the assumed value.We are uncertain whether this deviation is due to our simplified model and assumptions; however, the older admixture event causes the distribution of α to skew toward higher values, making our test more conservative.
musculus may have reached the Japanese archipelago alongside millet farming, rather than rice farming.
Given that 80%-90% of the modern Japanese genome is derived from continental migrants, human migration from continental Asia to the Japanese archipelago was extensive and continuous (Hanihara 1991;Kanzawa-Kiriyama et al. 2019;Cooke et al. 2021;Osada and Kawai 2021).Considering that all musculus-type mitochondrial and Y-chromosomal haplotypes in the Japanese samples are derived from one or two haplotypes from the Korean musculus lineage, it is likely that there were a limited number of subspecies of musculus migrants.The subsequent population explosion of the subspecies musculus indicates its success in the rice-crop-oriented society in Japan (Yonekawa et al. 1988).
However, how and when the subspecies castaneus reached the Japanese archipelago remain unclear.
Future studies using ancient samples may help to answer these questions.
Our analysis reveals a clear pattern of differentiation between the coastline of the Japanese archipelago.Recent large-scale genome analyses of the modern Japanese population have not shown such a distinct trend (Watanabe et al. 2021).In addition, the majority of animal species in the Japanese archipelago show genetic clustering between north-east and south-west regions (e.g., (Tsuchiya 1974;Kakioka et al. 2012;Okamoto and Hikida 2012;Tominaga et al. 2015;Dufresnes et al. 2016;Kato et al. 2020;Ito et al. 2021).We propose that this unusual pattern of differentiation was shaped based on anthropogenic reasons, particularly shipping trades.In the 18th century, Japan developed highly organized shipping trade routes along the Sea of Japan and Pacific Ocean coastlines.Considering that the subspecies castaneus mainly inhabited northern Japan, as evidenced by mitochondrial genomes, travel by sea may have facilitated the spread of genetic components of castaneus from northern Japan to the Sea of Japan coast.Although we do not have direct evidence to support the hypothesis, it would be worthwhile to test this in future studies using additional samples, including ancient samples.
Chromosomes exhibited a clear bimodal distribution, which allowed us to classify the samples as male or female based on their respective ranges.

Supplemental Method 5
Construction of genome-wide genealogies using RELATE We estimated the genome-wide genealogies using RELATE v.1.1.9 (Speidel et al. 2019).The input files were converted from variant call format (.vcf) to haplotype format (.haps) using the RelateFileFormats command with the "--mode ConvertFromVcf" option implemented in the RELATE package.Non-biallelic SNVs were removed from the analysis, and the M. spretus genome was used to assign derived mutations in M. musculus.A germline mutation rate of 5.7 × 10 ˗9 per bp per generation (Milholland et al. 2017), a generation time of 1 year, and an initial effective population size parameter of 120,000, were used for estimation.We then used the EstimatePopulationSize.sh script in the RELATE package to re-estimate the effective population size changes over time.We set the threshold to remove uninformative trees at 0.5.