Chromatin structure influences rate and spectrum of spontaneous mutations in Neurospora crassa

Although mutation rates have been extensively studied, variation in mutation rates throughout the genome is poorly understood. To understand patterns of genetic variation, it is important to understand how mutation rates vary. Chromatin modifications may be an important factor in determining variation in mutation rates in eukaryotic genomes. To study variation in mutation rates, we performed a mutation accumulation (MA) experiment in the filamentous fungus Neurospora crassa and sequenced the genomes of the 40 MA lines that had been propagated asexually for approximately 1015 [1003,1026] mitoses. We detected 1322 mutations in total and observed that the mutation rate was higher in regions of low GC, in domains of H3K9 trimethylation, in centromeric regions, and in domains of H3K27 trimethylation. The rate of single-nucleotide mutations in euchromatin was 2.46[2.19,2.77]×10−10. In contrast, the mutation rate in H3K9me3 domains was 10-fold higher: 2.43 [2.25,2.62]×10−9. We also observed that the spectrum of single-nucleotide mutations was different between H3K9me3 and euchromatic domains. Our statistical model of mutation rate variation predicted a moderate amount of extant genetic variation, suggesting that the mutation rate is an important factor in determining levels of natural genetic variation. Furthermore, we characterized mutation rates of structural variants, complex mutations, and the effect of local sequence context on the mutation rate. Our study highlights that chromatin modifications are associated with mutation rates, and accurate evolutionary inferences should take variation in mutation rates across the genome into account.

of nuclei in a given area of hyphae, we used the counts of nuclei and the hyphal areas measured 848 from the microscope images to obtain the number of nuclei per µm 2 . We had images for both VM 849 and sorbose plates, in total we collected 519 measurements. To estimate average density of nuclei 850 for VM and sorbose we used a model where we allowed standard deviations to differ for VM and 851 sorbose media: where y i is the ith density measurement, ρ is the intercept, β s is the effect of sorbose medium, x i 853 an indicator variable for sorbose, α σ is the intercept for standard deviation, and β σ is the effect of 854 sorbose medium on standard deviation. The average density of nuclei in VM medium is ρ and the 855 density of nuclei in sorbose is obtained as ρ s = ρ + β s .

856
To estimate the average size of colonies on sorbose plates, we plated conidia on sorbose plates 857 as in the MA experiment and photographed the plates. Millimeter paper was used as a scale. 858 Colony area was measured from these images with ImageJ2 version 2.0.0-rc-43/1.50e. The pixels 859 per millimeter calibration value was set by measuring the number of pixels per 1 mm of millimeter 860 paper. The images were enhanced with the sharpen tool to make the colony outlines more distinct.

861
The colony area was measured using the elliptical selection tool. We used 10 different genotypes α g ∼ N(ᾱ, σ g ) α ∼ N(0, 3) σ, σ g ∼ hT(3, 0, 10) where y i is the ith area measurement,ᾱ is the overall mean, α j is the mean for jth genotype, σ g genotype, σ g is the genotype standard deviation, and σ is the error standard deviation. Priors Kit, and DNA quality was checked by running 2 µL of the sample on an 0.8% agarose gel. 913 Read mapping and genotyping 914 To be able to map reads to the mating type locus in the mat a strains, we included the mating type 915 a region, as well as the mitochondrial genome, as additional contigs.  were considered correct if the simulated and detected SVs were 1) of the same type 2) on same chromosome and 3) both start and stop locations were within 50 bp. The callers that performed the 983 best were DELLY and LUMPY as they showed high sensitivity score and low false discovery rate 984 (FDR) score (Table S6), and they were selected to call SVs on the MA lines.  (Table S7).

1009
For genotyping CNVs in the MA lines we excluded MA line sites if the start or stop location of 1010 these where within 500 bp of any site detected in the ancestor. Also, we only retained the sites that 1011 were detected by both callers CNVnator and CNVseq (if 1000 bp or less overlapped at the start or 1012 end location). The remaining sites were manually verified by inspecting the alignment file in IGV.

1013
However, we did not find any evidence of copy number changes in the MA lines.
where τ j is an offset term for class j that allows taking into account differences in the abundance To model the effects of epigenetic domains and GC-content on mutation rate we used the following 1042 model: where y i is the number of mutations a class of i intervals contained, τ i is the number of class i interindicates presence or absence of H3K27me3, and c i indicates presence or absence of centromeric 1046 region. β coefficients are the corresponding effects and α is the intercept.

1047
Model selection was a combination of biological and statistical reasoning, and we tested mod-1048 els representing plausible biological hypotheses. For instance, we had a clear biological reason to 1049 expect that GC-content influences mutation rate, and we saw a large improvement in model pre-

1057
When we assessed how well did the mutation model predict the natural genetic variation we 1058 used the predicted mutation rates from model S10 as a response and θ W calculated from a popula- 2). Those regions that were marked as duplicates, but which did not overlap with H3K9me3 or sitions. As C → T transitions were not over-represented, action of RIP is unlikely to be responsible 1068 for these mutations, which is expected as RIP is active only during meiosis.

Effects of local sequence context
To analyze effects of local base composition on the mutation rate, we estimated the effects of the We compared different linear models (Table S4) with the same reasoning as above. We did not 1079 include an intercept in this model, as we wanted to obtain estimates for all trinucleotide classes, 1080 and not set one class as the intercept against which the others are compared. This does not alter any 1081 biological conclusions. 1082 We further investigated how the flanking base pairs influenced the relative mutation rates of 1083 the trinucleotides. We extracted estimates of the relative mutation rates for the trinucleotides from 1084 model S11, and used these as a response in a model where we predicted relative mutation rates with 1085 the identities of the flanking base pairs and the mutating base. Since our estimates of the relative 1086 mutation rates contain uncertainty, we included the estimated error of the relative mutation rates in 1087 the model. The model was: where y obs,i is the median of ith observed relative mutation rate, y sd,i is the observed standard 1089 deviation of the ith relative mutation rate, y est,i is the ith estimated relative mutation rate, α is the 1090 intercept, β b is the effect of C:G relative to A:T for the mutating base, β 5 is the effect of C:G relative low GC-content, and while we did observe that coverage went down in regions of < 15% GC, 1108 those regions constitute a very small fraction of the genome. Next, we explored the accuracy of our 1109 mutation calls after the mutations had been called by our pipeline and manually inspected in IGV. 1110 We observed that overwhelming majority of mutations had the highest possible genotype quality 1111 score determined by the GATK pipeline ( Figure S1). Median genotype quality for mutations was  Figure S14, S15, S16, S17, S18, S19, S20, S21, S22). When mutations had genotype quality 1122 score of 99 they were unambiguous ( Figure S14, S17, S20). When genotype qualities were around   (Table S1). As such, there are likely not many reads  (Table S1). This allowed us to discriminate 1161 between true mutations and mapping errors. With this kind of sequencing depth, sequencing errors 1162 are simply not an issue anymore and they have no impact on calling the mutations, e.g. Figure S18 1163 shows a mutation in repetitive region that as a consequence has higher frequency of sequencing 1164 errors, but with so many reads identifying the real mutation is not a problem. same mycelium). However, we did not find any evidence of true mutations in heterokaryotic state.

1171
Whenever sites appeared as heterozygous, multiple sites were found close together ( Figure S19),

1172
indicating that read mapping errors were the more likely explanation. Because of these factors, 1173 our study differs substantially from studies that need to call heterozygous sites from data with low 1174 sequencing depth and the problem of calling genotypes correctly is of different nature.

1175
In summary, overwhelming majority of mutations that our pipeline detected had the highest

1192
We simulated 40 different MA lines for each scenario with a transition / transversion rate of 1.08.

1193
We then generated simulated reads from these simulated genomes, using DWGSM (Homer, 2021), or assembly errors, it does give us confidence that we will be able to detect a real difference in mu-1228 tation rates. Moreover, since we did not observe any false positive mutations, we are confident that 1229 mutation calling cannot generate spurious results in our case. We did observe slightly higher pro-1230 portions of false negative mutations in H3K9me3 regions. However, if this bias is true for real data, 1231 this would make our estimate of the elevated mutation rate in H3K9me3 regions more conservative.

1232
Robustness of relationship between θ W and predicted mutation rate 1233 We wanted to evaluate the robustness of the observed relationship between θ and the predicted 1234 mutation rate. One potential issue is that there are windows in the genome, especially for small 1235 window sizes, where the observed θ is zero. Since zero is the minimum value that θ can obtain, and there is a clumping of θ = 0 observations in the data, this violates the assumption that response is gaussian and could lead to biased estimates. However, since there so many data points, the model 1238 may be robust to observations where θ = 0. First, we tested the effect of window size, calculating 1239 θ over longer windows reduced the number of windows where θ = 0. Increasing window size 1240 slightly improves the amount of variation explained by the model (Figure S5). Thus, results are 1241 robust the to different window sizes.

1242
Then we tested whether the results were robust to different models. Data that can take zero or

1246
We used a conventional Tobit regression and robust Tobit regression, for both cases the results

1247
were very similar to an ordinary regression model ( Figure S6). Then, we tested a log-normal RIP that was independent of genetic effects. 1294 We calculated the mutation rate per meiosis for the euchromatic regions of the genome using a 1295 multilevel model with cross as a random factor. The model was where y i is the number of mutations in euchromatic regions in the ith tetrad,ᾱ is the average inter-1297 cept, α c is deviation from average intercept for each cross, and σ c is the cross standard deviation.

1298
Prior for σ c was the half-location scale version of Student's t-distribution, with 3 degrees of free-1299 dom, location 0, and scale 10. Based on posterior predictive checks, this model fitted the data.

1300
Mutation rate was calculated from posterior distribution ofᾱ as where N is the number of called nucleotides, and n t is the number of tetrads.
where y i is the number of mutations in the H3K9me3 domains in the ith tetrad, ϕ is the dispersion 1312 parameter, and other parameters were same as above. The prior for ϕ was a gamma distribution in mutation rate in the H3K9me3 regions, mutation rate those regions seems higher. 1320 We examined the spectrum of mutations that occurred in the euchromatic and the H3K9me3 re-1321 gions separately, in the same way we did for asexual mutations. We observed that in the H3K9me3

1322
regions there was a substantial over-representation of C:G → T:A transitions due to the action of 1323 RIP ( Figure S11C). However, the mutation spectra that occurred in euchromatic regions was much   Figure S2: H3K9me3 / euchromatin mutation rate ratio in the simulated data. Estimates calculated from called mutations from the simulated reads are in red, and estimates calculated using the true simulated mutations are in blue. The two different scenarios are: mutation rate was higher in regions of the genome marked by H3K9me3, and mutation rate was uniform across the genome.
Points are means and range shows the 95% HPD interval of the ratio. Interval estimates overlap in both scenarios.     Mutation rate relative to euchromatin Figure S10: Mutation rates for indels in the different domains relative to euchromatin. The dashed line shows the relative mutation rate of one. Facets show deletions and insertions for all indels, for indels that did not occur in repeats, and for indels that occurred in repeated sequences. Estimates are medians and ranges show 95% HPD intervals. Intervals that do not overlap with one, that is, those where mutation rate is higher than in euchromatin are colored red.  Figure S12: Ratios of the relative mutation rates during meiosis over mitosis. Points are medians and ranges show 95% HPD interval of the ratios. If the interval estimate is higher than one, mutation rate in meiosis is higher, if the interval estimate is lower than one, mutation rate in mitosis is higher.       Genotype quality of the mutation is 45 as the mutation is located in region with reduced mapping quality. Some reads that do not support the mutation map to this location. However, those reads also have other changes that are not supported by other reads. This suggest that reads not supporting the mutation are mapping errors. This mutation was confirmed by Sanger sequencing.   Table S2: Model comparisons among different models that predict the mutation rate by GC-content and chromatin modifications. Model terms are different linear model parts, α is the intercept, β GC is the slope effect of GC-content, β K9 is the effect of H3K9 domain, β K27 is the effect of H3K27 domain, β C is the effect of centromeric domain, β I is the interaction effect between GC-content and H3K9 domain, β I2 is the interaction effect between GC-content and centromeric domain, β I3 is the interaction effect between GC-content and H3K27 domain. d i , g i , and c i are indicator variables, and x i is GC-content in percentage points. WAIC = widely applicable information criterion, SE = standard error. Model terms WAIC diff (± SE) weight