The rate and spectrum of new mutations in mice inferred by long-read sequencing

  1. Diethard Tautz2
  1. 1Institute of Ecology and Evolution, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom;
  2. 2Department for Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, 24306 Plön, Germany
  • Corresponding author: elcortegano{at}protonmail.com
  • Abstract

    All forms of genetic variation originate from new mutations, making it crucial to understand their rates and mechanisms. Here, we use long-read sequencing from Pacific Biosciences (PacBio) to investigate de novo mutations that accumulated in 12 inbred mouse lines derived from three commonly used inbred strains (C3H, C57BL/6, and FVB) maintained for 8 to 15 generations in a mutation accumulation (MA) experiment. We built chromosome-level genome assemblies based on the MA line founders’ genomes and then employed a combination of read and assembly-based methods to call the complete spectrum of new mutations. On average, there are about 45 mutations per haploid genome per generation, about half of which (54%) are insertions and deletions shorter than 50 bp (indels). The remainder are single-nucleotide mutations (SNMs; 44%) and large structural mutations (SMs; 2%). We found that the degree of DNA repetitiveness is positively correlated with SNM and indel rates and that a substantial fraction of SMs can be explained by homology-dependent mechanisms associated with repeat sequences. Most (90%) indels can be attributed to microsatellite contractions and expansions, and there is a marked bias toward 4 bp indels. Among the different types of SMs, tandem repeat mutations have the highest mutation rate, followed by insertions of transposable elements (TEs). We uncover a rich landscape of active TEs, notable differences in their spectrum among MA lines and strains, and a high rate of gene retroposition. Our study offers novel insights into mammalian genome evolution and highlights the importance of repetitive elements in shaping genomic diversity.

    All forms of genetic variation originate from new mutations, so it is important to know the rate at which the different types of changes in the DNA sequence arise and their underlying mechanisms. New mutations are fundamental for many questions in evolutionary biology, and efforts to estimate their rates originated early in the history of genetics by phenotypic analysis of mutation accumulation (MA) lines of Drosophila melanogaster (Muller 1928; Bateman 1959; Mukai 1964) and in humans by pedigree analysis (for review, see Nachman 2004). In recent decades, DNA sequencing techniques have greatly enhanced the ability to detect new mutations, allowing the estimation of mutation rates at the molecular level (Halligan and Keightley 2009; Bergeron et al. 2022). This has enabled mutation rates to be estimated in a wide range of taxa (Lynch et al. 2023; Wang and Obbard 2023), including humans and mice (Ohno 2019). However, the most widely available short-read sequencing technologies only allow inference of the rate and spectrum of single-nucleotide mutations (SNMs) and short insertions and deletions in nonrepetitive genomic regions. Thanks to recent advancements in long-range sequencing techniques, it is now possible to overcome these limitations and to infer the complete spectrum of new mutations, including structural mutations (SMs) (López-Cortegano et al. 2023).

    Often defined as mutations longer than 50 bp (Mahmoud et al. 2019), SMs represent a heterogeneous and complex group of mutations that have many diverse mechanisms influencing their rate and genomic distribution. Copy-number variants, including deletions and duplications, may affect gene dosage and are important for adaptation, gene family evolution, and disease (Zhang et al. 2009). Chromosome rearrangements, such as inversions and translocations, alter the landscape of recombination and are involved in processes such as speciation and sex chromosome evolution (Potter et al. 2017). All of these types of SMs emerge predominantly owing to failure of the repair of double-strand breaks and involve mechanisms that typically depend on either homologous recombination or end-joining (So et al. 2017). Thus, chromosome instability and processes such as nonallelic homologous recombination result in “hotspots” of structural variants near highly repetitive sequences (Coghlan et al. 2005; Nesta et al. 2021). Transposable elements (TEs) are repeated mobile DNA sequences whose activity can lead to mutations by means of excision and insertion. Such events can occur in functionally important sequences (van't Hof et al. 2016). In addition, TEs can contribute to new genes and regulatory sequences (Long et al. 2013; Chuong et al. 2017; Etchegaray et al. 2021) and can also cause diverse types of SMs, either from nonallelic homologous recombination between interspersed TE copies or directly as a consequence of transposition. For example, LINE retrotransposons occasionally promote large deletions upon aberrant insertion into the genome (Rodriguez-Martin et al. 2020) and can act as substrate for ectopic recombination, resulting in diverse types of SM (Balachandran et al. 2022).

    SMs have been found to be common but tend to be present in populations at low frequencies, owing to their deleterious effects (Chakraborty et al. 2019). This suggests that they contribute to a genomic architecture of rare alleles with high effect sizes and could reveal new insights concerning the genomic architecture of human complex traits (Cmero et al. 2020; Gao et al. 2023; Olson et al. 2023). For example, information on structural variation in genome-wide association studies can contribute to narrowing the missing heritability gap for human diseases (Billingsley et al. 2023; Kaivola et al. 2023). Despite their broad importance, our understanding of how SMs originate and at which rates remains limited. Long-read sequencing is a relatively new technology for improving SM discovery (Hollox et al. 2022), and its application promises a more comprehensive description of the spectrum of new mutations. For example, Pacific Biosciences (PacBio) long reads have been used to discover de novo mutations in MA lines of the green alga Chlamydomonas (López-Cortegano et al. 2023), revealing that 5% of mutations are SMs and highlighting the important contribution of active TEs in shaping the SM spectrum. Current estimates of SM rates in mammals are limited to studies that used short-read technology (Uchimura et al. 2015; Collins et al. 2020; Belyeu et al. 2021; Flynn et al. 2021; Lee et al. 2023; Steensma et al. 2023), but this may result in an inferred SM spectrum that is biased toward simple copy-number variants. The main objective of the present study is to obtain the complete spectrum of the new SMs that have accumulated in an MA experiment in mice that does not suffer from the above limitation.

    In addition to enabling the detection of large SMs, long-read technology also offers a more robust means than short-read sequencing for identifying mutations within highly repeated regions, such as microsatellite sequences. This capability also enables the accurate estimation of the rate of indels, which we arbitrarily define here as insertions and deletions shorter than 50 bp, including microsatellite contractions and expansions. It is notable that even in compact bacterial genomes, long-read sequencing has led to improved indel rate estimates (Zhou et al. 2023). Copy-number variants of tandem repeats of 1–6 bp, known as simple sequences or microsatellites (Tautz 1993), have significant medical importance owing to their involvement in inherited and de novo acquired disorders (Mirkin 2007; Hannan 2018; Trost et al. 2020; Mitra et al. 2021). Indels and other mutations occurring within microsatellite regions are, however, frequently overlooked. In cases in which indel rates have been estimated in mice by short-read sequencing, they have generally been substantially lower than SNM rates (e.g., ∼5% of the SNM rate) (Uchimura et al. 2015). However, this may be because of the calling criteria, which excluded repeat regions (Uchimura et al. 2015). Similarly in humans, genome-wide indel rate estimates have been low (Ohno 2019). Limited evidence combining short- and long-read sequencing indicates that indel rates are higher than SNM rates (Sun et al. 2012; Kristmundsdottir et al. 2023), but these estimates are restricted to annotated microsatellites of 1–6 bp motifs and may not provide a comprehensive picture of the genome-wide indel rate or its association with sequence repetitiveness. Microsatellites and, more generally, tandem repeat sequences are known to be among the fastest evolving sequences. Consequently, when the genomic background of the aligned samples differs from the reference genome, short tandem repeats may be unusually susceptible to alignment failures. The use of long reads for assembling strain-specific reference genomes may therefore become an essential step for accurately detecting copy-number variations and SMs in fast-evolving sequences, such as repeats (Noyes et al. 2022).

    The mouse is a well-established model mammal that has a relatively short generation time, allowing for the accumulation of a large number of mutations in a short time. Furthermore, the mouse genome is similar to humans, and this has been of immense value for biomedical research (Schofield et al. 2012; Groza et al. 2023). The mouse additionally serves as a model for evolutionary studies (Guénet and Bonhomme 2003; Harr et al. 2016). To gain a deeper understanding of mammal genome evolution, we conducted an MA experiment in mice. Lines were maintained by full-sib mating to minimize the effective population size and the effectiveness of selection acting on newly arising mutations. To study variation in the mutation rate among mouse strains, our experiment included MA lines of three different commonly used inbred mouse strains (C3H, C57BL/6, and FVB). We used PacBio HiFi technology to sequence four MA samples from each of the three strains after 8 to 15 generations of MA. To aid the analysis of the complex and repetitive mouse genome (Lilue et al. 2018; Nurk et al. 2022; Liao et al. 2023), we produced strain-specific chromosome-level assemblies for use as reference genomes in order to improve mutation calling and to avoid reference bias. By employing variant calling approaches based on both sequencing reads and genome assemblies, we investigated the rate and spectrum of new mutations in mice, thereby offering insight into the causes of variation among inbred strains, and the evolution of mammalian genomes.

    Results

    We used PacBio HiFi long reads from inbred mouse MA line founders of the strains C3H/HeNRj (C3H), C57BL/6JRj (BL6), and FVB/NRj (FVB) to generate chromosome-level assemblies (Supplemental Figs. S1–S4). These assemblies were then used as reference genomes for calling mutations. At the contig level, the assemblies have a mean N50 ≈ 50 Mb (Supplemental Table S1) and were thereafter nearly as contiguous as the current GRCm39 reference genome (N50 = 59.5 Mb), which is derived from the C57BL/6J strain. They are also substantially more contiguous than the current C3H/HeJ v3 and FVB/NJ v3 reference genomes (N50 ≈ 2–8 Mb). We subsequently defined ∼96% of the genome as “callable” for each strain (Supplemental Table S2), namely, the sites where mutations could be called with high confidence (see Methods). Uncallable sites were usually located within highly repetitive acrocentric centromeres (Supplemental Figs. S1–S3). Chromosome X had a lower callability in BL6 and FVB than in C3H (i.e., ∼74% vs. 93%) (Supplemental Table S2), presumably owing to the somewhat lower contiguity of the former two assemblies (Supplemental Table S1) and the high density of TEs on that chromosome (Supplemental Figs. S1–S3).

    We also sequenced four MA samples from each strain after 8 to 15 generations of the MA experiment and built de novo assemblies for all samples. Subsequently, SMs were called from read alignments using Sniffles (Smolka et al. 2024), from assembly mappings using dipcall (Li et al. 2018), and from a multisample pangenome built with minigraph (Li et al. 2020). Leveraging the low error rate of our assemblies (genomic quality values QV > 45 on the Phred scale) (Supplemental Table S1), we employed DeepVariant (Poplin et al. 2018) to detect SNMs and indels from read alignments. All variant calls were merged, filtered and curated, guided by the visualization of read and assembly alignments using the Integrative Genomics Viewer (IGV) (Robinson et al. 2011), among other tools (see Methods).

    A list of the types and location of the mutations found in this study is provided in Supplemental Tables S3–S5. The average mutation rate, including all types of mutations across mouse strains, was calculated as M = 44.8 mutations per haploid genome per generation (corresponding to μ = 1.80 × 10−8 mutations per nucleotide site per generation). Mutation rates for each MA sample by type of mutation are detailed in Supplemental Table S6 and displayed in Figure 1. The spectrum of mutations was dominated by SNMs and indels, which occurred at similar rates, and comprised 33% to 66% of mutations across samples (Fig. 1). The proportion of SMs ranged from ∼1% to 5% (Fig. 1). In the sections below, we briefly detail results for SNMs and then elaborate on the rate and spectrum of new indels and SMs.

    Figure 1.

    The complete spectrum of new mutations. The mutation rate per haploid genome per generation (M) for different types of mutations (in colors) is shown for four MA samples from each strain. The percentages at the top of each bar indicate the proportion of SMs.

    Single-nucleotide mutations

    Based on the discovery of 2981 SNMs, we estimate that approximately 20 new SNMs arise de novo per haploid genome per generation in the mouse MA lines (i.e., M = 19.6 and μ = 7.86 × 10−9 SNMs per site per generation). This estimate falls within the range of previous estimates obtained using short-read sequencing (Adewoye et al. 2015; Uchimura et al. 2015; Milholland et al. 2017; Lindsay et al. 2019). It is also close to the estimate of μ = 7.9 × 10−9 obtained by Chebib et al. (2021) based on the observed number of SNMs segregating in the founders of our MA experiment. Moreover, this estimate is not significantly different from our recent estimate obtained by Illumina sequencing of 59 MA line samples (μ = 6.7 × 10−9; Wilcoxon rank-sum test, P = 0.07) (López-Cortegano et al. 2024). The spectrum of SNMs differs from the expectation based on equally frequent SNM types and a GC content of 41.7% and is characterized by a strong bias toward C→T transitions. Furthermore, nearly one-third of C→T mutations occur at methylated CpG sites (Supplemental Fig. S5). In the FVB samples, there was also a marked twofold bias toward C→A transversions, which agrees with previous studies in this strain (Supplemental Fig. S5; Dumont 2019; López-Cortegano et al. 2024). Overall, the SNM spectra were similar to what we previously observed based on Illumina sequencing (χ2 test, χ25 < 0.4, P > 0.9) (López-Cortegano et al. 2024).

    Indels

    We identified a total of 3997 indels, resulting in a rate of M = 24.4 indels per haploid genome per generation (μ = 9.76 × 10−9). Overall, this rate is slightly higher than that observed for SNMs. Compared with previous studies in mice based on short-read sequencing, the higher rate of indels in this study can be largely attributed to the inclusion of microsatellite contractions and expansions, because 90.5% of indels occurred in sites annotated as microsatellites (Supplemental Table S4). Considering the total length of microsatellite sequences within the callable genome, the average indel rate in sites annotated in microsatellites was two orders of magnitude higher than the rate for nonannotated regions (μ = 1.02 × 10−7 vs. μ = 1.05 × 10−9). To better understand the association of indels with tandemly repeated sequences, we investigated the relationship between the mutation rate and the linguistic complexity of the nucleotide sequence, which is a measure of the degree of sequence repetitiveness (Orlov and Potapov 2004). In the C3H data, we observed that on a logarithmic scale the rate of both insertions and deletions increased by almost three orders of magnitude, from μ ≈ 10−9.5 in genomic regions with average repetitiveness (near-zero scaled linguistic complexity) to μ ≈ 10−6 in the most repetitive, least complex sequences (Fig. 2). The rate of SNMs increased approximately linearly from μ = 10−7.7 to 10−8.7 (Fig. 2). We also observed similar relationships between the mutation rate and sequence repetitiveness in the other mouse strains and in data from an MA experiment in the green algae Chlamydomonas reinhardtii (Supplemental Fig. S6; López-Cortegano et al. 2023). The association of the mutation rate with a quantitative measure of sequence repetitiveness also confirms a previous hypothesis on the role of tandem repeats and irregularly spaced repeats as a source of genetic variation (Tautz et al. 1986).

    Figure 2.

    Relationship between SNM and indel rates (on a log10 scale) and sequence repetitiveness. Sequence repetitiveness was measured as the linguistic complexity of 101 bp genomic windows and then scaled to standard deviation units, with a mean of zero. On this scale, more negative values indicate more repetitive sequences. The observed rates of different types of mutations (in colors) are shown in data points. Dashed lines represent regression model fits of mutation rate against linguistic complexity. For SNMs, a linear model was fitted, whereas indel data fitted best a logistic regression; 95% confidence intervals are presented for all regression models.

    We observed that insertions occurred ∼30% more frequently than deletions, but deletions were slightly longer than insertions on average (4.54 vs. 3.93 bp; Wilcoxon rank-sum test, P = 1.24 × 10−2) (Supplemental Fig. S7). The spectrum of new indels was strongly biased toward 4 bp insertions and deletions in all strains, and these represent 66.3% of indels (Supplemental Fig. S7A). Almost all (99.2%) of these 4 bp indels are associated with microsatellite sequences. Among the indels identified in microsatellites (which comprise 90.5% of all indels), the most common microsatellite sequence motif was AGAT (>44%, Supplemental Fig. S7B), highlighting its hypermutability in mice. Our data do not, however, allow us to determine whether modifiers of the indel rate are present in any of the strains (Maksimov et al. 2023). Regarding indels of other lengths, 95.5% involved contractions and expansions of microsatellites that have a motif length equal to the indel length or an integer multiple/divisor of it (Supplemental Fig. S7C), consistent with polymerase slippage as a mechanism of mutagenesis (Ellegren 2004; Montgomery et al. 2013). Short (1 bp) indels predominated in the spectrum of indels unrelated to microsatellite annotation (66.9%) (Supplemental Fig. S7A).

    Structural mutations

    Although the strain-specific reference assemblies derived from the MA line founders are highly contiguous, >74% of the variant calls obtained from the assembly mapping methods were rejected as false positives following manual visualization of read alignments in IGV, underscoring the need for improved assembly mapping, alignment, and calling methods, particularly for long tandem repeats. In total, we report the discovery of 133 de novo SMs (61 in C3H, 39 in BL6, and 33 in FVB) (Supplemental Table S5). On average, we estimate that about one SM arises de novo per haploid genome per generation in the mouse genome (i.e., M = 0.90 and μ ≈ 3.60 × 10−10), and there is no statistical difference in the rate among strains (Kruskal–Wallis test [KW test], χ22 = 0.81, P = 0.67). Among the different types of SM, their rates were also similar across strains (KW test, χ22 < 2.84, P > 0.24) (Fig. 3A). Out of eight possible categories of SMs considered, only five were found. These are contractions and expansions of tandem repeats, deletions, duplications, and insertions of mobile elements. We found no evidence of new inversions, translocations, or excisions of mobile elements. Based on their rates and length distribution, the spectra of SMs can be categorized into two main types of mutations (Fig. 3B). SMs <1 kb are predominantly contractions and expansions of tandem repeat sequences. The spectra of SMs >1 kb differ to a greater extent among strains and are dominated by TE insertions, with remarkable variation in the types of active TEs (see below). SMs >10 kb were only present in BL6.

    Figure 3.

    Rates and spectra of SMs for three mouse strains. (A) The SM rate per haploid genome per generation (M) is plotted as open points and boxplots for different types of SMs (in colors). Data points represent individual MA samples. (B) The spectrum of new SMs by rate (M) and length (in kilobases, rounded to the nearest 0.5 kb).

    Tandem repeat mutations, which include contractions and expansions, are the most common type of SM, contributing M = 0.47 genome-wide (corresponding to μ ≈ 1.88 × 10−10). However, their rate is likely underestimated because highly repetitive sequences are more likely to be uncallable (Supplemental Figs. S1–S3). Additionally, the genome-wide rate of tandem repeat mutations can be recalculated to account for the fact that these mutations can only be called in tandem repeat sequences. This adjustment increases the mutation rate approximately 10-fold, implying a rate of about μ ≈ 2.18 × 10−9 mutations per tandem repeat site per generation.

    Although most contractions and expansions were <1 kb, we observed a single instance of an expansion of a ∼3 kb TCTTCT satellite into a higher-order repeat, possibly >100 kb (Supplemental Fig. S8). Although DNA strand slippage during the polymerization reaction is commonly implicated in short tandem repeat mutations (Schlötterer and Tautz 1992; Ellegren 2004), this cannot explain large expansions such as the one described here. An alternative mechanism is proposed in the Discussion section.

    Repeat sequences were also commonly involved in deletions (M = 8.12 × 10−2, μ ≈ 3.26 × 10−11) and duplications (M = 4.37 × 10−2, μ ≈ 1.76 × 10−11). To elucidate the mechanisms underlying deletions and duplications, we searched for evidence of homology between their breakpoints, following the method of Belyeu et al. (2021). Our analysis revealed that only the shortest (113 bp) and the two longest (6.5 and ∼45 kb) deletions appeared to have arisen through nonhomologous end-joining repair of double-strand breaks. Most (75%) of the remaining deletions could be attributed to homology-mediated DNA repair mechanisms. Among these, three deletions ranging in length between 600 and 1500 kb could be attributed to homologous recombination between interspersed paralogous sequences, namely, nonallelic homologous recombination. These included deletions of regions flanked by repeat annotation, such as TEs, and the deletion of previously duplicated TE sequences (Supplemental Fig. S9). Another six deletions ranging in length between 200 and 2100 bp could be attributed to microhomology-mediated end-joining involving homology tracts <10 bp (Supplemental Fig. S9). Microhomology-based mechanisms of DNA repair, such as break-induced replication, can be mutagenic (Kramara et al. 2018), and we observed evidence of this. Two deletions attributed to microhomology-mediated repair were only separated by 63 bp, whereas another was in close proximity to a 147 bp duplication. Among the remaining five duplications in our data set, three could be explained by nonallelic homologous recombination (Supplemental Fig. S10). Finally, the two smallest duplications, which are shorter than 150 bp, could not be attributed to any homology-mediated mechanism.

    Mobile elements

    Insertions of mobile elements occurred at an average rate of M = 0.30 per haploid genome per generation (corresponding to μ ≈ 1.22 × 10−10) (Fig. 3). This rate suggests that the majority (60%) of individuals in a mouse population experience one new insertion per generation. Most (∼93%) mobile insertions were owing to TEs, with the exception of three gene retrocopies in the FVB strain and one in C3H (see below). We found substantial heterogeneity in the rate and spectra of mobile elements among MA samples of the same strain (Fig. 4).

    Figure 4.

    Mobile insertions. The rate of insertions per haploid genome per generation (M) is shown by MA sample (on the y-axis) and type of mobile element (in colors).

    All active TEs identified corresponded to class I transposons (i.e., retrotransposons). The most active TEs in BL6 and FVB samples were long interspersed element-1 (LINE-1 or L1) transposons (MBL6 = 0.155 and MFVB = 0.174, vs. MC3H = 0.032). However, there was a notable difference in the length of L1 insertions between BL6 and FVB, BL6 exhibiting large and nearly complete L1 elements insertions (ranging from 6.5 kb to 7.3 kb), whereas 50% of L1 insertions in FVB were 5′-truncated to lengths <1.6 kb. Although L1-dependent nonautonomous transposons were mobilized in all strains, their activity was most pronounced in the FVB samples (MFVB = 0.261, vs. MC3H = 0.105, and MBL6 = 0.078). The majority (82%) of insertions of L1-dependent short interspersed elements (SINEs) were of type B2 (∼82%), whereas B1 SINE insertions were only observed in one FVB sample.

    In contrast to BL6 and FVB, TE activity in C3H was predominantly caused by endogenous retroviruses (ERVs). Among these, intracisternal A-particles (IAPs) (Kuff and Lueders 1988) were the most active TEs in this C3H, inserting at similar rates among samples of this strain (M ≈ 0.206) (Fig. 4). IAPs are indeed a type of ERV that is known to be highly active in C3H mice (Ferraj et al. 2023). We identified only one insertion of a 6.3 kb murine ERV in a single C3H sample, but ERV activity was also evident in BL6 via the insertion of a large (1.9 kb) ERV3-dependent nonautonomous TE named “mammal apparent LTR retrotransposon” (MaLR).

    To gain further insight into TE activity in the three mouse strains, we used the megablast algorithm (Camacho et al. 2009) to trace individual TE insertions back to their putative mother copies in the MA line founders’ genomes. For this analysis, only fully autonomous TEs that had multiple insertions in our samples were considered. By filtering the BLAST results by their highest score, all matches had sequence identities >99% and an E-value of zero. We found that nearly all (∼70%) IAP insertions were derived from unique mother copies not shared by other insertions (Supplemental Table S7), contradicting the hypothesis that low-copy “master copies” exist in the genome and are responsible for most recent insertion events (Rebollo et al. 2020). This result is further evidenced by the highly variable length of IAP insertions, which ranged from 5283 bp to 5491 bp (Supplemental Table S7). Among L1 insertions >500 bp, all were traced back to unique copies in the C3H, BL6, and FVB MA founder genomes.

    In addition to TE insertions, we also found four gene retrocopies (i.e., RNA-mediated gene duplications). Retrocopies were identified by using BLASTX to query DNA insertions against the release 106 of the Ensembl peptide library of the GRCm39 reference, which returned 100% sequence identity in all cases. One large retrocopy (3.3 kb long) in C3H corresponded to the gene S-adenosylmethionine decarboxylase 1 (Amd1). The remaining retrocopies were in FVB and corresponded to the bromodomain containing 2 (Brd2), 60S ribosomal protein L12 (Rpl12), and the survival motor neuron 1 (Smn1) genes. Independent retrocopies of two of these genes (Brd2 and Smn1) were also detected in retrocopy surveys of natural mouse populations (Zhang et al. 2021; Zhang and Tautz 2022). According to the EBI Gene Expression Atlas (Kapushesky et al. 2010; Moreno et al. 2022), these genes are highly expressed in the testis and more so than in the ovaries (Supplemental Fig. S11). This parallels the known effect of paternal age on SNM rates (Kong et al. 2012), suggesting that new retrocopies might predominantly originate in males. Additionally, it implies that the actual rates of de novo retroposition could effectively double our rate estimates (see below), because retrocopies would primarily arise in the male germline. Retrocopies were present in C3H and FVB samples with many insertions of L1 or L1-dependent elements (Fig. 4), suggesting high L1 activity. The rates of retroposition in these two strains were MC3H = 1.58 × 10−2 and MFVB = 6.52 × 10−2. The combined rate of new retrocopies across strains and MA samples was M = 2.70 × 10−2, a rate substantially higher than previously described in mammals (Ewing et al. 2013).

    Discussion

    Characterizing the complete spectrum of new mutations is important for understanding the origin and nature of genetic variation. However, mammalian genomes are complex and are rich in repeat sequences, and this complicates the analysis of new variation arising from mutation. Many previous studies on the rate of de novo mutation in humans and mice have focused solely on SNMs (e.g., Kong et al. 2012; Lindsay et al. 2019). Several studies have also reported rates of de novo structural variation in mammals (e.g., Kloosterman et al. 2015; Belyeu et al. 2021; Steensma et al. 2023), but these studies have employed short-read technology, which limits the SMs detected primarily to copy-number variants and hinders the discovery of mutations within repetitive sequences. Here, we provide the first estimates of the de novo mutation rate in the house mouse (Mus musculus) obtained from long-read sequencing of individuals from MA lines. To avoid reference bias and minimize false-positive variant calls, we built strain-specific chromosome-level assemblies from the MA line founders. Using a combination of read and assembly-based approaches, we obtained estimates of the SNM rate that are highly consistent with previous estimates from the same lines based on Illumina sequencing. We estimate that each individual mouse experiences approximately 90 new mutations per generation, of which ∼54% are indels. Additionally, we find that SMs comprise ∼2% of new mutations, involving ∼1.5 kb of sequence change per generation and accounting for >90% of the changes in base pairs, which presumably have larger effects on quantitative phenotypes and fitness-related traits than SNMs and indels (Chakraborty et al. 2019; Mérot et al. 2020).

    We show that the degree of sequence complexity is strongly associated with the mutation rate (Fig. 2; Supplemental Fig. S6). About 90% of indels in our study occurred within low complexity, highly repetitive microsatellite sequences, a sequence class that is usually excluded from short-read analyses. In cases in which indels have been included, short-read sequencing has yielded lower estimates of indel rates, but this is likely because of challenges in variant calling in repetitive regions. For example, Uchimura et al. (2015) reported an indel rate of μ = 3.1 × 10−10, consistent with our estimate in the least repetitive regions (μ ≈ 3.2 × 10−10, μ ≈ 10−9.5 in Fig. 2).

    Our finding that indel rates may be considerably higher than previously believed has important implications across several fields. Tandem repeat mutations are implicated in numerous medical conditions and disorders (Mirkin 2007), including cancers (Vilar and Gruber 2010). Moreover, it has been proposed that tandem repeat contractions and expansions could explain part of the missing heritability for quantitative traits, because these variants are often undetected in the absence of dedicated analysis techniques. These variants, however, may have substantial effects on quantitative traits (Gymrek and Goren 2021; Mukamel et al. 2021). Our findings in mice suggest that indel rates in other mammals may also exceed current estimates (Venn et al. 2014; Harland et al. 2018; Deng et al. 2019; Ohno 2019). However, it is noteworthy that a substantial fraction of indels are associated with “AGAT” microsatellite instability. The reasons for the hypermutability of “AGAT” microsatellites is unknown to us, as is whether this phenomenon extends to other species.

    Tandem repeat mutations, such as indels and large contractions and expansions, can be explained by stepwise models of DNA strand slippage or nonallelic homologous recombination, also known as ectopic recombination. However, the mechanisms underlying the expansion we observed affecting many repeat copies (Supplemental Figs. S7, S8) remain poorly understood. For example, mutations such as the 30-fold expansion of a 3 kb satellite in MA sample BL6 2 (Supplemental Fig. S8) cannot be explained by processes such as DNA strand slippage. An alternative mechanism, extrachromosomal rolling-circle replication followed by reinsertion, has been proposed as a mechanism for the amplification and dispersal of satellite DNA (Garrido-Ramos 2017). Evidence suggests that this mechanism is active in vertebrates (Rossi et al. 1990; Feliciello et al. 2006; Cohen et al. 2010) including in the human germline (Henriksen et al. 2022). It is possible that this mechanism could explain the expansion of a large DNA satellite in MA sample 2 of BL6 (Supplemental Fig. S8). The same mechanism could also explain previous results in mice involving tandemly repeated small nucleolar regulatory RNA (SNORD) loci, in which inter-generational changes in the copy number of SNORD repeats were observed (Keshavarz et al. 2021). However, SNORD clusters consist of tandem repeats that are substantially longer than the PacBio reads obtained here, and we were unable to detect evidence of copy-number variation in these loci. De novo extrachromosomal circular DNA (eccDNA) variants have previously been observed to be transmitted in the germline (Stoll and Roth 1983; Turner et al. 2008), but the use of the circular consensus sequencing (CCS) mode with PacBio, which involves circularizing DNA, did not allow the analysis of eccDNA in our study.

    In addition to tandem repeats, interspersed repeats can mediate large SMs, such as deletions and duplications, via homology-dependent mechanisms (Supplemental Figs. S9, S10). Indeed, ectopic recombination is widely recognized as a major contributor to structural variation in the genome (Gray 2000; Parks et al. 2015; Belyeu et al. 2021; Balachandran et al. 2022), and many instances are mediated by copies of TE sequences. Interspersed TE sequences represent a substantial fraction of mammalian genomes and have been linked to genome size evolution (Naville et al. 2019). Considering all types of SM, we estimate that the mouse genome expands at an average rate of ∼1 kb per generation and that TEs contribute ∼70% of this expansion. TEs are also recognized to play important and diverse roles in adaptation, health, and genome evolution (Konkel and Batzer 2010; van't Hof et al. 2016; Ricci et al. 2018). For example, both L1 and IAP elements can contribute regulatory sequences that could reshape the transcription landscape (Ferraj et al. 2023). TE insertions stand out in our data because of their high rates relative to other types of SM (Fig. 3) and their diverse spectrum among samples and strains (Fig. 4). Notably, we detected movement of IAP elements, a kind of element that is known to be highly active in C3H mice (Gagnier et al. 2019; Rebollo et al. 2020). We also saw movements of L1 elements in all strains, and this is expected because these elements are believed to be 20–30 times more active in mice than in humans (Gagnier et al. 2019). Our analyses appear to suggest a substantially higher rate of de novo TE insertion in mice than in humans (Maksakova et al. 2006; Feusier et al. 2019), but future pedigree-based studies in humans using long-read technology may lead to higher estimates of TE insertion rates.

    Additionally, pedigree-based studies could help determine whether LINEs and SINEs are predominantly inherited via the male germline. A previous study estimating the TE insertion bias from polymorphic TE variants on Chromosome X and the autosomes suggested that the majority of TE insertions occur in the male germline (Nellåker et al. 2012). Consistent with this result, the transcription profile of retrocopied genes in our study (Supplemental Fig. S11) suggests that TEs, particularly LINEs, are most active in the male germline. Because our MA experimental design minimizes selection, our results also suggest that the high rate of parentally inherited de novo TE insertions could be owing to distinct TE suppression pathways between male and female germlines (Almeida et al. 2022). A previous estimate for the generation of new retrocopies obtained by comparing two closely related natural populations of mice suggested a rate of 200 new retrocopies per 10,000 years (Zhang et al. 2021), approximately equivalent to M = 1 × 10−2, assuming two generations per year. The rates MC3H and MFVB observed in the MA lines are higher than this estimated rate for natural populations, supporting the idea that retrotransposed copies are subject to negative selection owing to frequent interference with the expression of the parental genes (Zhang et al. 2021). Finally, it is noteworthy that rates of TE insertion were similar among strains, yet there was substantial variation in the spectra of active TEs (Fig. 4). This finding is biologically meaningful, because different TEs have the potential to cause distinct types of SMs (López-Cortegano et al. 2023) and to induce differential changes in chromatin state (Ferraj et al. 2023), which may be key for understanding the evolution and divergence of closely related lineages. Given the similarities in the relative abundances of TEs and their evolutionary ages across each strain's genome (Supplemental Fig. S12), the observed differences in active TE spectra are likely because of variations in TE suppression machinery. The highly polymorphic family of Krüppel-associated box zinc finger proteins (KRAB-ZFPs), known to bind and repress TEs in mammals (Wolf et al. 2020), varies in gene copy number among the three strains we studied (Supplemental Fig. S13) and also in natural populations of mice (Pezer et al. 2015). This supports the idea that structural changes in clusters encoding KRAB-ZFPs may underlie differences in the rates and spectra of active TEs.

    Many of the new mutations occurring in mice seem to be mediated by various types of repeats, as repeat sequences underlie many of the SMs observed, and greatly increase the rate of SNMs and indels. Although we used long-read sequencing technology, it is possible that we underestimated rates for contractions and expansions in large, uncallable tandem repeat regions. In our study, these regions include the centromeres (Supplemental Figs. S1–S3), which are acrocentric in mice and consist of repeated ∼120 bp motifs forming tens to hundreds of kilobases of long satellite sequences (Iwata-Otsubo et al. 2017; Packiaraj and Thakur 2024). Rodents are karyotypically diverse (Bengtsson 1980) and have experienced more chromosome rearrangements than other mammals, including primates (Bourque et al. 2004). Cytogenetic evidence (Patton and Sherwood 1983; Piálek et al. 2005) suggests that these rearrangements may frequently involve the centromeres (e.g., pericentric inversions and Robertsonian fusions and fissions), which would go undetected in our study. However, it should be noted that previous evidence suggests that chromosome rearrangements are rare in mouse inbred lines. For example, a comparison of our MA founders’ genome assemblies with previous assemblies for the same strains (Lilue et al. 2018) revealed that the only large rearrangement was a 58 Mb inversion (Supplemental Fig. S4), previously described by Akeson et al. (2006) in C3H/HeJ mice. This suggests that inversions and translocations in mice could be highly deleterious or occur at extremely low de novo rates. However, it has been suggested that inversions may have high rates in other mammalian taxa (e.g., in gibbons) (Mercuri et al. 2022). The use of ultralong reads (e.g., Oxford Nanopore Technologies) could help resolving large repeats in genome assemblies and enhance the detection of SMs. Nevertheless, further investigation is required to fully understand the complete spectrum of new mutations, which will depend on the development of improved and more accessible long-read sequencing technologies, along with tools for analyzing large tandem repeats.

    Methods

    Biological samples, high-molecular-weight DNA extraction, and sequencing

    The mouse samples sequenced in this study were part of an MA experiment described in parallel studies (Chebib et al. 2024; López-Cortegano et al. 2024). Briefly, one pair of inbred founders from each of the three mouse strains C3H/HeNRj (C3H), C57BL/6JRj (BL6), and FVB/NRj (FVB) were obtained from Janvier Labs (https://janvier-labs.com), and mated within the facilities of the Max Planck Institute for Evolutionary Biology. From the founders, mice were bred by full-sib mating to establish MA lines for each strain. The C3H, BL6, and FVB MA lines were bred for 15, 8, and 10 generations, respectively. The shorter duration of the BL6 and FVB experiments reflects their discontinuation prior to the opportunity to sequence MA samples from multiple strains with long reads. To limit the influence of environmental factors, all mice were maintained under similar conditions in individually ventilated cages (green line, Tecniplast), with the same diet (“1328 forti” pellets, Altromin). We also limited the effect of paternal age on the rate of mutation by breeding mice within a narrow range of ages (∼12 and 20 weeks) (López-Cortegano et al. 2024). Maintenance and handling procedures adhered to the German animal welfare law (Tierschutzgesetz) and the Federation of European Laboratory Animal Science Associations (FELASA) guidelines. The necessary permits for housing and caring for the mice were acquired from the local veterinary office “Veterinäramt Kreis Plön” (permit number PLÖ-0004697).

    We sequenced the C3H, BL6, and FVB founders and four mice from each strain randomly selected from different MA lines. High-molecular-weight (HMW) DNA was extracted from 8 to 15 mg of kidney tissue using the Bionano animal tissue DNA isolation kit (Plug Lysis). DNA was then eluted into TE buffer and sequenced at the Max Planck Institute of Molecular Cell Biology and Genetics using the PacBio Sequel II platform with the CCS mode to generate HiFi reads. The read coverage values for each sample are given in Supplemental Table S1.

    Raw PacBio files were processed with ccs 6.4.0, part of the PacBio pbccs package (https://ccs.how/), using the option “‐‐hifi-kinetics” to generate highly accurate consensus HiFi reads with kinetic information for 5-methylcytosine (5mC) prediction. Details on the prediction of 5mC sites are given in the Supplemental Material.

    Genome assembly

    De novo contig-level assemblies were generated for all samples sequenced using hifiasm 0.16.1 (Cheng et al. 2021, 2022). Unless otherwise stated, two partially phased assemblies returned by hifiasm for each mouse were used. These partially phased assemblies provide information of the diploid genome, but accurate phasing was not possible in the absence of Hi-C data or sequencing data from the parents.

    Chromosome-level assemblies were generated for each strain using the data from two MA founders. To do so, highly contiguous primary assemblies were first generated to the contig level with hifiasm using a pool of PacBio HiFi reads from the MA founders, following the assumption that the two full-sib inbred founders are nearly identical at the sequence level (Chebib et al. 2021). When running hifiasm on the MA founders, the option “-l0” was invoked to disable the purge duplication step, as recommended for primary assemblies from highly inbred genomes. Raw assemblies were then filtered to retain only high-quality contigs that have at least 5× mean read coverage support and a minimum length of 30 kb. Finally, the polished assemblies were mapped for manual scaffolding to the GRCm39 reference genome (NCBI Genomes database [https://www.ncbi.nlm.nih.gov/home/genomes/] assembly accession GCA_000001635.9). C3H and FVB assemblies were also validated following their mapping against the C3H/HeJ v3 (GCA_921997125.2) and FVB/NJ v3 (GCA_921998635.2) reference genomes, respectively. Further details can be found in the Supplemental Material.

    Genome alignment

    As described below, we used three distinct variant calling methods to detect SMs, each based on a different genome alignment approach. First, aligning individual MA sample reads to the reference genome. Second, aligning individual MA sample assemblies to the reference. Third, aligning assemblies from multiple MA samples into strain-specific “pangenomes.” In all cases, the chromosome-level assemblies of each strain served as reference genomes.

    Read alignments were produced with the PacBio recommended software pbmm2 1.7.0 (https://github.com/PacificBiosciences/pbmm2), using the “‐‐preset HIFI” option. Read alignment files were further processed with the command “calmd” from SAMtools v.1.13 (Danecek et al. 2021) in order to retain mismatching positions in the reads. Assembly alignments for single MA samples were produced with minimap2 v.2.24 (Li 2018), including the option “-x asm5” for the alignment of assemblies that have low divergence from the reference. Pangenomes were produced with the software minigraph 0.17 (Li et al. 2020), using the recommended options “-c” for base alignment, and “-x ggs” to generate an output file in graphical fragment assembly (GFA) format. In addition, pangenomes were produced in hierarchical alignment (HAL) format with cactus 2.7.0 (Garrison et al. 2018; Armstrong et al. 2020) by applying the Minigraph-Cactus pangenome pipeline (Hickey et al. 2024).

    Genome annotation

    Gene annotation for the GRCm39 reference genome was downloaded from Ensembl (https://ftp.ensembl.org/pub/release-106/gff3/mus_musculus/). Gene coordinates from GRCm39 were transferred to the C3H, BL6, and FVB chromosome-level assemblies using the HAL pangenome files and the software halLiftover 2.1 (Hickey et al. 2013).

    Repeat annotation was generated de novo for each reference genome by running RepeatMasker 4.1.2 (https://www.repeatmasker.org) with the option “-species ‘Mus musculus’” in order to include annotation of known murine repeats and TEs from the open library Dfam 3.3 (Storer et al. 2021). The repeat annotation produced by RepeatMasker was complemented with microsatellite and satellite sequences annotated by Tandem Repeats Finder (TRF) (Benson 1999), run with options “2 7 7 80 10 50 1000 -f -d -m -ngs.” When more than one TRF annotation overlapped a genomic site, we selected the annotation with the shortest repeat motif, and if several annotations remained overlapping a site, we then selected the longest annotation.

    As a quantitative measure of sequence repetitiveness, we generated annotations for values of “linguistic complexity” (Orlov and Potapov 2004) using the software NeSSie (Berselli et al. 2018). Such a parameter is measured as the k-mer richness in a sequence, namely, the count of unique k-mers observed in a sequence over the maximum possible in that sequence length. Linguistic complexity was calculated at base-pair resolution (using option “-s 1”) in genomic windows of 50 bp extending upstream of and downstream from each reference site, and scaled to mean zero and to one standard deviation unit.

    Callable sites

    Only genomic sites where mutations could be called with high confidence were considered as “callable” sites. Here, we used a criterion similar to the one defined previously in a study of C. reinhardtii (López-Cortegano et al. 2023). The basis of this approach is to exclude highly repetitive regions where assembly mapping is inaccurate. The software minimap2 was used to map each sample's primary assembly against the reference genome, retaining only mapping sections with the highest mapping quality (MQ = 60). Then, coordinates when mapping failed for any sample were logged, and if the distance between these coordinates for any pair of samples was <30 kb, the entire region was deemed uncallable. Callable sites were further required to be supported by a minimum coverage of five reads for the two MA founders and for at least one MA sample. Because we observed that uncallable sites were usually colocalized with tandem repeat annotation >30 kb, we also removed these regions from the set of callable sites. Finally, uncallable regions at a distance <30 kb were merged together and deemed uncallable. It should be noted that using this criterion, uncallable sites will consist mostly of highly repetitive sequences (Supplemental Figs. S1–S3), and this will likely result in the underdetection of SMs in tandem repeats.

    Identification of de novo SNMs and indels

    We used DeepVariant 1.4.0 (Poplin et al. 2018) to call SNMs and indels from PacBio read alignments enabling the option “‐‐model_type=PACBIO.” A combined VCF containing mutation calls from all the samples was then generated with GLnexus 1.3.1 (Yun et al. 2021) using the configuration preset for raw DeepVariant VCF files (“‐‐config DeepVariant_unfiltered”). To improve indel variant calling, VCF files were postprocessed and left-aligned using gvcf_norm v.0.1.0 (https://github.com/mlin/gvcf_norm) and BCFtools norm 1.9 (Danecek et al. 2021). Mutation candidates were then filtered to retain biallelic sites where the mutation was unique to one MA sample, supported by a minimum of five reads, and a genotype quality (GQ) of 30 in the Phred scale. Homozygous calls required at least 75% of reads supporting the mutation, whereas heterozygous calls required between 25% and 75% of reads supporting the mutation. All calls were inspected visually using the Integrative Genomics Viewer (IGV) 2.13.2 (Robinson et al. 2011), and variants with ambiguous support were removed as false positives (e.g., Supplemental Fig. S14).

    Identification of de novo SMs

    We used a combination of read and assembly-based approaches to call SMs, followed by extensive manual curation. As mentioned above, each approach was based on alignments of different types of genomic data: reads, assemblies, and pangenomes.

    SMs were called from read alignments with Sniffles 2.0.2 (Smolka et al. 2024). To ease the identification of mutations called in multiple samples, VCF files were merged using SURVIVOR v.1.0.7 (Jeffares et al. 2017). To call mutations from the assembly alignment of individual MA samples, we used the minimap2-based pipeline dipcall (Li et al. 2018). For all male samples, we included the pseudoautosomal region (PAR) as required by dipcall. Based on the mapping to the GRCm39 reference, the PARs were on Chromosome X with coordinates 160,721,147–161,549,611 for C3H, 162,666,591–163,590,379 for BL6, and 168,647,154–169,177,736 for FVB. SMs were also called from pangenome GFA files using the gfatools 0.4 command “bubble” (https://github.com/lh3/gfatools).

    All SM calls were inspected visually in both read and assembly alignments using IGV, and those deemed not to be false positives (e.g., Supplemental Fig. S14) were classified into one of eight types: contraction or expansion of tandem repeats, deletion, duplication, excision or insertion of mobile elements (including TEs and retrogenes), inversion, and translocation. Consistent calls of a mutation in the reads and in the assembly alignments were necessary for the mutation to be considered genuine, and a minimum depth of five reads supporting the mutation was necessary. Large SM calls and chromosome rearrangements were also visualized in Ribbon v1.1 (Nattestad et al. 2021). Heterozygote SMs that were ambiguous in the partially phased assembly were inspected visually with Bandage 0.8.1 (Wick et al. 2015). Insertions and other SM breakpoints were queried with BLAST 2.13.0 (Camacho et al. 2009) against a local database of TEs and queried against the Dfam database (Storer et al. 2021; https://dfam.org) to identify known TE families. Self-to-self dotplots were also generated with the MAFFT v7 online server (Katoh et al. 2019) in order to visualize mutation calls, such as copy-number variants, aiding the determination of whether deletions and duplications calls corresponded to contractions and expansions of tandem repeats. In addition, only mutations that were unique to a single MA sample were considered as genuine mutations. Exceptionally, some TE families could be expected to excise in multiple samples if active, but no excisions were found. When contractions or expansions of different lengths occurred in the same microsatellite or satellite in different samples, each distinct contraction or expansion event was counted as a unique mutation. Similarly, if a MA line was heterozygous for two unique mutations (e.g., one contraction plus one expansion) in a tandem repeat, these were considered as two mutation events.

    Calculation of the mutation rate

    The per haploid genome per generation mutation rate (M) was estimated as M = Ni/Naj, where Ni is the total number of mutations found in sample i, and Na(j) is the number of ancestors in the strain j’s pedigree. To calculate the per nucleotide per generation mutation rate (μ), M was divided by 2Nc(j), where Nc is the length of the haploid callable genome of strain j. Additionally, both rates were multiplied to a correction factor (α) similarly as described by López-Cortegano et al. (2024). In short, α is calculated using a mutation dropping approach on each strain's pedigree to calculate the proportion of heterozygous and homozygous mutations expected to be recovered in the sequenced samples. Therefore, α accounts for two sources of bias: (1) the loss of mutations in unsequenced ancestors of the MA samples, and (2) the exclusion of genuine mutations that occurred early in the MA experiment and that were excluded because they were not unique to a single MA sample. The values of α used were 3.67, 3.00, and 3.22, for C3H, BL6, and FVB, respectively.

    When calculating μ at specific regions, for example, at microsatellite sequences for indels, the value of Nc was adjusted accordingly. For SMs other than contractions and expansions, we considered providing μ estimates for repetitive and nonrepetitive regions. However, several challenges complicate their estimation and limit their interpretability. For example, deletions and duplications may contain repeats, but only repeats at the breakpoints are likely to be relevant. Specifically, for SMs mediated by nonallelic homologous recombination, only certain pairs of repeats could be involved, making it difficult to adjust Nc for “repetitive” versus “nonrepetitive” regions in a meaningful way. Regarding TE insertions, different TE families may have different target-site preferences and different values of μ at different contexts. Given these difficulties and the limited number of SMs detected compared to SNMs and indels, we did not partition the rates of SMs per genome annotations.

    Data access

    All PacBio HiFi reads and chromosome-level assemblies generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA1112615.

    Competing interest statement

    The authors declare no competing interests.

    Acknowledgments

    We thank Cornelia Burghardt and Michaela Schwarz for laboratory work and handling the DNA sequencing and Kristian Ullrich for the CCS processing of BL6 and FVB PacBio reads. Analyses performed here made use of the high performance computing resources at the Ashworth Compute Co-operative Cluster (AC3) at the Institute of Ecology and Evolution of the University of Edinburgh. This project has received funding from the European Research Council under the European Union's Horizon 2020 Research and Innovation program (grant agreement no. 694212) and from institutional funds of the Max-Planck Society (Max-Planck-Gesellschaft).

    Author contributions: E.L.-C., J.C., D.T., and P.D.K. conceived the study. E.L.-C. did the computational analyses. A.J., A.V., and S.K. generated the mouse lines and samples. E.L.-C. wrote the manuscript with contributions by P.D.K. and D.T.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279982.124.

    • Freely available online through the Genome Research Open Access option.

    • Received September 6, 2024.
    • Accepted November 26, 2024.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    Articles citing this article

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server