Nascent RNA folding mitigates transcription-associated mutagenesis

  1. Jianzhi Zhang
  1. Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan 48109, USA
  1. Corresponding author: jianzhi{at}umich.edu

Abstract

Transcription is mutagenic, in part because the R-loop formed by the binding of the nascent RNA with its DNA template exposes the nontemplate DNA strand to mutagens and primes unscheduled error-prone DNA synthesis. We hypothesize that strong folding of nascent RNA weakens R-loops and hence decreases mutagenesis. By a yeast forward mutation assay, we show that strengthening RNA folding and reducing R-loop formation by synonymous changes in a reporter gene can lower mutation rate by >80%. This effect is diminished after the overexpression of the gene encoding RNase H1 that degrades the RNA in a DNA–RNA hybrid, indicating that the effect is R-loop-dependent. Analysis of genomic data of yeast mutation accumulation lines and human neutral polymorphisms confirms the generality of these findings. This mechanism for local protection of genome integrity is of special importance to highly expressed genes because of their frequent transcription and strong RNA folding, the latter also improves translational fidelity. As a result, strengthening RNA folding simultaneously curtails genotypic and phenotypic mutations.

Because mutation is the ultimate source of genetic variation and evolution, measuring the mutation rate and understanding various mutational mechanisms are of vital importance (Lynch 2010a; Hodgkinson and Eyre-Walker 2011). During transcription, nascent RNA has the potential to anneal back to its template DNA after they both exit the RNA polymerase (RNAP), creating a structure known as the R-loop, consisting of a stable RNA–DNA hybrid and a single-stranded DNA (Fig. 1A). Because the single-stranded DNA is naked, it is subject to increased mutagenesis induced by mutagens. R-loops can also prime unscheduled error-prone DNA synthesis (Aguilera and García-Muse 2012). These and other mechanisms cause transcription-associated mutagenesis (TAM) (Aguilera and García-Muse 2012; Kim and Jinks-Robertson 2012). Although genomic lesions can also be repaired by transcription-coupled repair (TCR) (Hanawalt and Spivak 2008), genome-wide analyses in the bacteria Escherichia coli and Salmonella typhimurium, the budding yeast Saccharomyces cerevisiae, and the human germline have demonstrated that the mutation rate of a gene tends to increase with its expression level, likely because TAM exceeds TCR (Lind and Andersson 2008; Park et al. 2012; Chen and Zhang 2013, 2014).

Figure 1.

A schematic diagram of the hypothesis that nascent RNA folding mitigates transcription-associated mutagenesis. (A) With weaker RNA folding, an R-loop accumulates, which increases the exposure time of the naked nontemplate DNA and error-prone DNA synthesis, leading to a higher mutation rate. (B) With stronger RNA folding, the R-loop is dissolved, which reduces the exposure time of the naked nontemplate DNA and error-prone DNA synthesis, resulting in a lowered mutation rate.

Because for a given sequence, RNA–RNA duplexes are energetically generally more stable than RNA–DNA hybrids (Lesnik and Freier 1995), it is possible for the nascent RNA to fold on itself, allowing the template DNA strand to anneal back with the nontemplate DNA strand, effectively dissolving the R-loop (Fig. 1B). Indeed, under thermodynamic equilibrium, the greater the RNA folding strength, the more likely that the R-loop is dissolved (Supplemental Fig. S1). Because dissolving the R-loop should reduce TAM, we hypothesize that increased nascent RNA folding decreases the mutation rate of a transcribed region (Fig. 1). We first provide experimental evidence for this hypothesis using a yeast forward mutation assay. We then demonstrate the generality of this hypothesis by genomic analyses of yeast mutation accumulation strains and human intronic DNA polymorphisms. Finally, we discuss the biological implications of this finding.

Results

Yeast forward mutation assay shows that nascent RNA folding mitigates mutagenesis

To test the hypothesis that increased nascent RNA folding reduces mutagenesis, we used the yeast CAN1 forward mutation assay (Lippert et al. 2011; Takahashi et al. 2011). In media containing the toxic arginine analog canavanine, having a functional CAN1 gene, which codes for arginine permease, is lethal; whereas CAN1 null mutants are viable. We synthesized two modified versions of CAN1 by altering 5% of synonymous sites of the wild-type CAN1 gene, one with increased and the other with reduced RNA folding, relative to the wild-type (Supplemental Table S1). To our knowledge, no existing experimental method can probe nascent RNA folding in vivo. We thus resorted to computational prediction. Specifically, we quantified the nascent RNA folding strength (FRNA) by the negative of the minimum free energy of the folded structure, estimated computationally by a sliding window approach with various window sizes (see Methods); the higher the FRNA value, the stronger the folding. The wild-type CAN1 and the two modified versions span a large range of FRNA of all yeast genes (Fig. 2A; Supplemental Fig. S2). The three CAN1 versions were each coupled with one of two promoters: the endogenous CAN1 promoter (pCAN) and a galactose-regulated GAL1 promoter (pGAL); the latter becomes constitutive and substantially stronger than the former in strains lacking the Gal80 repressor of pGAL (Lippert et al. 2011). Hereinafter, the pCAN-CAN1 gal80 strains and pGAL-CAN1 gal80 strains are referred to as low- and high-transcription strains, respectively.

Figure 2.

Predicted nascent RNA folding and measured R-loop signals for the three versions of CAN1. (A) Frequency distribution of the RNA folding strength (FRNA) of all yeast genes. The three versions of CAN1, with weak, intermediate (wild-type), and strong FRNA values, respectively, are indicated by arrows. FRNA is computationally predicted using sliding windows of 26 nt and then standardized to a per site value. Computational predictions based on other window sizes are shown in Supplemental Figure S2. (B) Experimentally determined R-loop signals, relative to that of ACT1, for the weak and strong FRNA versions of CAN1 in two probed segments. Error bars indicate standard error. P-values are based on two-tailed t-test.

To examine whether the weak and strong FRNA versions of CAN1 have different probabilities of R-loop formation, we used DNA:RNA immunoprecipitation (DRIP) followed by quantitative polymerase chain reaction (PCR). Taking advantage of the high specificity and affinity of the S9.6 monoclonal antibody toward DNA–RNA hybrids of various lengths, DRIP efficiently purifies R-loops. Quantitative PCR is then used to measure the DNA component in the R-loops formed in specific regions of the genome. We probed two nonoverlapping segments of CAN1 with 91 and 112 nt, respectively (Supplemental Table S1). For each of these segments, the strong FRNA version of CAN1 has stronger predicted nascent RNA folding than the weak FRNA version. We found that, in both segments, the relative R-loop concentration is significantly lower for the strong FRNA version than the weak FRNA version (P = 0.002) (Fig. 2B), as hypothesized (Fig. 1) and computationally predicted (Supplemental Fig. S1).

To compare the mutation rates among the three versions of CAN1 requires that they have similar mutational target sizes. We confirmed this by estimating the relative probability that a random point mutation is a missense (Supplemental Fig. S3A) or nonsense (Supplemental Fig. S3B) in each version. We also confirmed that these versions have similar numbers of AT/TA and TC/CT dinucleotide repeats (Supplemental Fig. S3C), which are major deletion hotspots (Lippert et al. 2011). Sequencing of canavanine-resistant (CANR) mutants detected no significant difference among the three versions of CAN1 in the fraction of mutations that are insertions/deletions (Supplemental Fig. S4).

We estimated the null mutation frequency of CAN1 by quantifying the fraction of CANR mutants in a cell population after three generations of growth in a nonselective medium, followed by a correction for potential false positives (see Methods). Among the low-transcription strains, the mutation frequency of the weak FRNA strain and that of the strong FRNA strain are 22% higher (P = 9 × 10−3, Mann-Whitney U test) and 17% lower (P = 0.02), respectively, compared with that of the wild-type strain, which has the intermediate FRNA (Fig. 3A). These mutation frequency differences are in the direction predicted by our hypothesis. Among the high-transcription strains, the mutation frequency of the weak FRNA strain and that of the strong FRNA strain are 36% higher (P = 6 × 10−5) and 76% lower (P = 6 × 10−14), respectively, compared with that of the wild-type strain (Fig. 3B). Thus, the strong FRNA version has a mutation frequency that is only 18% of that of the weak FRNA version (P = 6 × 10−14) (Fig. 3B). We estimated that the mutation frequency of each CAN1 version is 19–72 times higher in the high-transcription strain than in the low-transcription strain (see Methods), comparable to previous reports (Lippert et al. 2011; Takahashi et al. 2011). Using quantitative reverse transcription PCR (RT-PCR) (see Methods), we confirmed that, for each CAN1 version, the expression level in the high-transcription strain is 36–50 times that in the low-expression strain (Supplemental Fig. S5), as expected (Takahashi et al. 2011). Interestingly, for the three strains with the same promoter of either pCAN or pGAL, we observed a monotonic increase in expression level with FRNA (Supplemental Fig. S5). Because transcription is mutagenic, these expression differences render our estimate of the impact of FRNA on mutation frequency conservative. They also make the estimate of the difference in R-loop signal between weak and strong FRNA versions (Fig. 2B) conservative.

Figure 3.

Null mutation frequency at CAN1 decreases with its nascent RNA folding strength (FRNA). The mutation frequency of a strain is presented relative to that of the strain with the same promoter and wild-type (i.e., intermediate FRNA) CAN1 without overexpressed RNASEH1 (dotted line). (A) Relative mutation frequencies in low-transcription strains (carrying the promoter pCAN). (B) Relative mutation frequencies in high-transcription strains (carrying the promoter pGAL). (C) Relative mutation frequencies in low-transcription strains with overexpressed RNASEH1. (D) Relative mutation frequencies in high-transcription strains with overexpressed RNASEH1. In each panel, the left y-axis shows the mutation frequency relative to the dotted line, whereas the right y-axis shows the mutation frequency relative to the wild-type CAN1 in the same panel. The bottom and top of each box are the first and third quartiles, and the band inside the box shows the median. The whiskers extend to the most extreme data point that is no more than 1.5 times the interquartile range from the box edges. Circles show outliers, which lie outside the range shown by the whiskers. P-values are based on Mann-Whitney U tests.

TAM has the distinctive feature of higher frequencies at G/C sites than A/T sites (Lippert et al. 2011; Takahashi et al. 2011). If the above mutation rate difference between the strong and weak FRNA versions of CAN1 is due to TAM rather than other mechanisms, such as different levels of engagement of translesion DNA polymerases (Goodman and Woodgate 2013), the mutation rate per site at G/C sites relative to that at A/T sites (γ) should be higher for the weak FRNA version than the strong FRNA version. Because the exact size of the mutational target for generating CANR is unknown, we chose to focus on G/C and A/T sites where point mutations can be nonsense. We sequenced CANR mutants from the two versions of CAN1 under low transcription and found that γ for the weak FRNA version is more than twice that of the strong FRNA version (P = 0.023, simulation test) (Table 1), supporting the hypothesis that the higher mutation rate of the weak FRNA version relative to the strong FRNA version is owing to different levels of TAM.

Table 1.

Relative point mutation rates at nonsense target sites of low-transcription CAN1 without RNASEH1 overexpression

To verify the role of R-loop in the influence of nascent RNA folding on mutagenesis, we inserted into the yeast genome an RNase H1 gene controlled by a strong promoter (see Methods). RNase H1 hampers R-loop formation by degrading the RNA in an RNA–DNA hybrid (Wahba et al. 2011). Under our hypothesized mechanism of the influence of nascent RNA folding on mutagenesis, adding a highly expressed RNASEH1 should reduce not only the mutation rate but also the impact of nascent RNA folding on mutation rate. Indeed, introducing RNASEH1 decreased the CAN1 mutation frequency in all examined strains. Specifically, in the low CAN1 transcription strains, RNASEH1 reduced the mutation frequency by 20% (P = 2 × 10−4, Mann-Whitney U test), 19% (P = 9 × 10−3), and 14% (P = 0.02) in strains with the weak, intermediate, and strong FRNA, respectively (Fig. 3C). In the high CAN1 transcription strains, RNASEH1 reduced the mutation frequency by 46% (P = 1 × 10−11), 36% (P = 2 × 10−4), and 11% (P = 0.46) for the three versions, respectively (Fig. 3D). Further, after the introduction of RNASEH1, the mutation frequency is no longer significantly different between the weak and intermediate FRNA versions in both low-transcription (P = 0.13) (Fig. 3C) and high-transcription (P = 0.44) (Fig. 3D) strains. Similarly, the mutation frequency is no longer significantly different between the strong and intermediate FRNA versions in low-transcription strains (P = 0.12) (Fig. 3C). In the high-transcription strains, although the mutation frequency difference between the strong and intermediate FRNA versions remains significant (P = 8 × 10−8), the difference has shrunk from 4.2-fold (Fig. 3B) to 2.9-fold (Fig. 3D). Together, these experiments strongly suggest that nascent RNA folding reduces the mutability at the CAN1 locus by dissolving R-loops.

Yeast mutation accumulation genomic data support that nascent RNA folding mitigates mutagenesis

To confirm that the impact of FRNA on mutagenesis is not limited to CAN1, we analyzed a set of yeast mutation accumulation lines derived from a strain deficient in mismatch repair (Fares et al. 2013). However, it is unknown what window size is most relevant for folding nascent RNAs. Transcription by RNAP II is known to be intermittent (Churchman and Weissman 2011), with rapid elongations interrupted by long pauses that play roles in nascent RNA folding (Pan and Sosnick 2006). It is thus appropriate to fold nascent RNAs using windows corresponding to RNA segments between consecutive pauses. However, reliable information on individual pauses is lacking for many genes. As an approximation, we used yeast native elongating transcript sequencing (NET-seq) data (Churchman and Weissman 2011) to estimate the median distance between pauses in each gene and then estimate the median value across all genes, which should be insensitive to the imprecision of the pause data from individual genes. We found the median to be 26 bases (Supplemental Fig. S6A) and thus computationally estimated FRNA using sliding windows of 26 bases (see Methods). In support of our hypothesis, the mutation rate per site for a gene is significantly negatively correlated with the average FRNA of the gene (ρ = −0.047, P = 5 × 10−4) (Fig. 4A), and similar correlations were observed when window sizes of 10–40 bases were used in RNA folding (Supplemental Fig. S7). This correlation remains significant after the control of potential confounding factors such as gene expression level (Park et al. 2012) (partial correlation ρ = −0.043, P = 1 × 10−3) (see Fig. 4B for the comparison among a subset of genes with similar expression levels), nucleosome occupancy (Chen et al. 2012) (partial correlation ρ = −0.051, P = 1 × 10−4) (see also Fig. 4C), and replication timing (Stamatoyannopoulos et al. 2009; Koren et al. 2010; Lang and Murray 2011) (partial correlation ρ = −0.046, P = 6 × 10−4) (see also Fig. 4D).

Figure 4.

Genome-wide correlations among the mutation rate of a gene, its nascent RNA folding strength (FRNA), and R-loop score in yeast. For each gene, mutation rate per site is estimated from a set of mutation accumulation lines. (A) Mutation rate decreases with FRNA. (B) Mutation rate decreases with FRNA for the subset of genes whose expression levels are between 0.95 and 1.05 times the mean expression level of all genes. (C) Mutation rate decreases with FRNA for the subset of genes whose nucleosome occupancy levels are between 0.95 and 1.05 times the mean nucleosome occupancy level of all genes. (D) Mutation rate decreases with FRNA for the subset of genes whose replication timings are between −0.2 and 0.2 (in a standard normal distribution). (E) The among-gene average of the within-gene partial Pearson's correlation between FRNA and R-loop score after controlling for GC content is significantly more negative than the random expectation. The arrow indicates the actual observation, whereas the bars show the frequency distribution of the corresponding value derived from 1000 sets of genes with FRNA values randomly shuffled within genes. (F) Mutation rate increases with R-loop score for the subset of genes whose expression levels are between 0.95 and 1.05 times the mean expression level of all genes. In AD, dots from left to right, respectively, contain the 10%, 20%, 30%, …, and 100% of genes with the lowest FRNA values. In F, dots from left to right, respectively, contain the 10%, 20%, 30%, …, and 100% of genes with the lowest R-loop scores. In all panels, error bars show one standard error.

To verify that the genome-wide signal of the impact of FRNA on mutation rate is mediated by R-loops, we analyzed a microarray-based data set of RNA–DNA hybrid propensity (Chan et al. 2014). Specifically, for each gene, we calculated partial Pearson's correlation between the RNA–DNA hybrid signal (i.e., R-loop score) of a probe and the RNA folding strength of the probe (FRNA) among all probes, after controlling the GC content of the probe, a known confounding factor for microarray intensity signal (Xia 2010). We also calculated the same partial correlation after randomly shuffling the FRNA values of all probes within a gene. We found that the mean correlation for all genes was more negative from the actual data than from the shuffled data in each of 1000 sets of random shuffling (P < 10−3) (Fig. 4E), supporting our hypothesis that strong nascent RNA folding weakens R-loop formation.

To verify the mutagenic effect of R-loops at the genomic scale, we obtained the R-loop score for a gene by the intensity of each probe set (see Methods). Because gene expression level affects the microarray-based R-loop score, we correlated the R-loop score of a gene with its mutation rate for a set of genes whose expression levels are within 0.95 and 1.05 times the mean expression level of all genes. We found this correlation (ρ = 0.119) to be significantly positive (P = 0.02) (Fig. 4F), confirming the mutagenic effect of R-loops. This result was further validated (ρ = 0.070, P = 1 × 10−7) using a recently published data set of yeast R-loops based on RNase H targets (El Hage et al. 2014). Together, these analyses in yeast provide genome-wide evidence for the role of nascent RNA folding in reducing mutagenesis by weakening R-loops.

Human population genomic data support that nascent RNA folding mitigates mutagenesis

To examine if the effect of FRNA on mutation rate exists in multicellular organisms, especially humans, we studied human intronic single nucleotide polymorphisms (SNPs) in HapMap (The International HapMap 3 Consortium et al. 2010). We analyzed 542,575 SNPs in 7852 genes that passed various filters for data quality and selective neutrality (see Methods). For each SNP, two non-SNP flanking sites in the same intron were used for comparison, each randomly picked from each side of the SNP with a distance from the SNP between 100 and 200 nt. Because mutation rate is expected to be higher at SNP sites than at non-SNP control sites under the assumption that intron mutations are neutral, our hypothesis predicts that, within a gene, SNP sites have lower FRNA than non-SNP control sites. FRNA at each nucleotide was calculated with 48-base sliding windows, because human global run-on sequencing (GRO-seq) data (Core et al. 2008) showed a median distance of 48 bases between pauses of RNAP II in introns (Supplemental Fig. S6B). We found 1159 genes to support our prediction at the nominal P-value of 0.05 (Mann-Whitney U test), compared to 252 genes that are against our prediction; the former is significantly greater than the latter (P < 10−99, binomial test). To further evaluate the relationship between human nascent RNA folding and mutation rate, we constructed a 2 × 2 table for each gene by respectively classifying its SNPs and non-SNP control sites into two categories based on whether their FRNA values are higher or lower than the mean FRNA of all SNPs and non-SNP control sites of the gene that was analyzed. We then calculated an odds ratio (OR1) from the table (see Methods); a gene is supportive of our hypothesis if its OR1 is lower than 1. We combined the OR1s from all genes using the Mantel-Haenszel (MH) procedure and found the overall OR1 to be significantly smaller than 1 (P < 10−99) (Fig. 5A), supporting that strong nascent RNA folding reduces human germline mutation rate. A similar trend was observed when G/C and A/T sites were analyzed separately (Fig. 5A). Because mutation rates were compared between intronic SNPs and their flanking control sites, other potentially confounding factors such as the replication timing and expression level were automatically controlled.

Figure 5.

Human population genomic data of introns reveal the negative impact of nascent RNA folding strength (FRNA) on R-loop formation and mutation rate. SNPs are considered to have higher mutation rates than non-SNP control sites. (A) Evidence for a negative impact of FRNA on mutation rates at all sites, A/T sites, and G/C sites. Odds ratio <1 indicates that SNP sites have lower FRNA than non-SNP control sites within a gene. Shown are the results combined from all genes by the MH procedure. Error bars represent 95% confidence intervals estimated by bootstrapping the genes 1000 times. The dotted line indicates an odds ratio of 1. (B) The among-gene average of the within-gene Pearson's correlation between FRNA and R-loop score is significantly more negative than the random expectation. The arrow indicates the actual observation, whereas the bars show the frequency distribution of the corresponding value derived from 1000 sets of genes with FRNA values randomly shuffled within genes. (C) Evidence for a positive impact of R-loop on mutation rate. Odds ratio >1 indicates that SNP sites have higher R-loop scores than non-SNP control sites within a gene. Shown are the results combined from all genes by the MH procedure. Error bars represent 95% confidence intervals, estimated by bootstrapping the genes 1000 times. The dotted line indicates an odds ratio of 1.

To confirm that the impact of nascent RNA folding on human mutation rate is R-loop-dependent, we estimated human R-loop scores using DNA–RNA immunoprecipitation sequencing data (Ginno et al. 2013). We correlated FRNA with R-loop score across intronic SNPs and non-SNP control sites in each gene. The same correlation was also calculated after we randomly shuffled FRNA among all these sites within the gene. The mean correlation of all genes from the actual data is significantly more negative than the expectation derived from 1000 sets of randomly shuffled data (P = 0.037) (Fig. 5B), supporting that strong nascent RNA folding weakens R-loops.

To verify that weakening R-loops decreases mutation rate, we constructed a 2 × 2 table for each gene by respectively classifying the intronic SNPs and non-SNP control sites based on whether their R-loop scores exceed the mean value of all of these sites. An odds ratio (OR2) is then calculated (see Methods); the higher the OR2 relative to 1, the stronger the evidence for our hypothesis. Indeed, the combined OR2 from all genes significantly exceeds 1 (P = 3 × 10−3) (Fig. 5C), supporting that weakening R-loops reduces mutagenesis. Note that, due to the lack of R-loop data from the human germline, we used the R-loop data from a pluripotent human embryonal carcinoma cell line (Ntera2). However, the fact that a significant correlation still exists between this R-loop signal and the mutation rate estimated from polymorphisms suggests that the true correlation between mutation rate and germline R-loop signal is likely stronger.

Discussion

Taken together, our forward mutation assay in CAN1 and genome-wide analyses of yeast mutation accumulation lines and human intronic SNPs provide consistent evidence that strong nascent RNA folding during transcription reduces the mutation rate of the transcribed region. Thus, the mutagenesis of DNA is modulated locally, not only by the biochemical activities of DNA such as replication and transcription, but also those of its transcriptional product. Whether RNA-binding proteins also play a role in this process remains to be studied. Note that although strong nascent RNA folding would predict strong folding of the nontemplate ssDNA, our findings are not attributable to ssDNA folding because a previous study did not find ssDNA folding to reduce TAM in yeast (Park et al. 2012). Because of the difficulty in determining nascent RNA folding experimentally, we resorted to computational predictions with fixed window sizes, which are at best moderately accurate (Park et al. 2013). This inaccuracy explains at least in part why many of the genomic correlations detected are small in magnitude and suggests that the true impact of FRNA on mutation rate is greater, as demonstrated in the CAN1 case.

When comparing the nascent RNA folding strength among genes, we discovered that highly expressed genes tend to have strong folding in both yeast (ρ = 0.167, P = 1 × 10−36) and human (ρ = 0.261, P < 10−99). This correlation could have resulted from differential natural selection for minimized mutational load, because highly expressed genes are functionally more important (Zhang and He 2005) and subject to severer TAM (Park et al. 2012) than lowly expressed genes. Nonetheless, selection for reduced mutational load is extremely weak (Chen and Zhang 2013), and our calculation indicates that the selective advantage of a single-nucleotide change that reduces the mutation rate at a few sites by enhancing local nascent RNA folding is one to several orders of magnitude smaller than what natural selection can detect in yeast and human (see Methods). Strong nascent RNA folding is known to alleviate the backtracking of RNAP II during transcription and thus may enhance the speed of RNA synthesis (Zamft et al. 2012), but it may also decrease splicing efficiency (Braberg et al. 2013). Hence, these potential effects do not unambiguously explain why highly expressed genes should have stronger nascent RNA folding. Interestingly, it was recently discovered that strong mRNA folding is selected for in highly expressed genes (Zur and Tuller 2012; Park et al. 2013), likely because it enhances translational accuracy (Yang et al. 2014). Having a high translational accuracy could reduce the harm arising from translational error-associated protein misfolding, misinteraction, and energy waste (Drummond and Wilke 2008; Yang et al. 2012, 2014; Zhang and Yang 2015) and is of special importance to abundantly produced proteins simply due to their large amounts. Calculations suggested that differential selection for translational accuracy can lead to stronger mRNA folding for more highly expressed genes (Yang et al. 2014). Because mRNA folding strength and nascent RNA folding strength are positively correlated (yeast: ρ = 0.549, P < 10−99; human: ρ = 0.220, P < 10−99), it is likely that selection for translational accuracy results in enhanced mRNA folding, of which strengthened nascent RNA folding is a byproduct. Although the relatively strong nascent RNA folding of highly expressed genes may have been fortuitous, this property does enlarge the benefit of nascent RNA folding in lowering the mutational load. To the best of our knowledge, this is the first example in which a demand for accurate processing of genetic information (i.e., translational fidelity) also results in accurate transmission of the information (i.e., low mutation rate); and RNA folding is the first known mechanism that simultaneously and concordantly regulates the genotypic mutation rate and phenotypic mutation rate (Bürger et al. 2006). The full biological ramifications of the intriguing coupling between the processing and transmission accuracies of genetic information via RNA folding await further explorations.

Methods

Computational prediction of nascent RNA folding strength

The secondary structures of yeast RNAs were predicted using RNAfold (http://www.tbi.univie.ac.at/~ivo/RNA/) (Hofacker 2009) under 30°C, following a previous study (Park et al. 2013) with minor modifications. Our prediction used overlapping windows with a window size of L nucleotides and a step size of S = 1 nucleotide (Lange et al. 2012). The folding strength of a window is defined by −ΔG, the negative of the minimum free energy of the folded RNA. The folding strength for a nucleotide is calculated by the mean folding strength of all windows covering the nucleotide divided by L. The folding strength of an RNA is the mean folding strength of all nucleotides of the RNA. In the yeast RNA–DNA hybrid microarray, the folding strength of a probe is simply the negative of the minimum free energy of the folded 25-nt RNA sequence for the probe divided by 25. The secondary structures of human RNAs were predicted in the same way as yeast RNAs, except that a different L was used under 37°C. To predict nascent RNA folding, we used L = 26 for yeast, 48 for human introns, and 21 for human exons, unless otherwise noted. They represent the genome-wide median distances between RNAP II pauses in yeast, human introns, and human exons, respectively.

Preference of RNA folding over R-loop formation

To quantify the thermodynamic preference of nascent RNA folding over R-loop formation during transcription, we used a sliding window of L = 26 nt and S = 1 nt to scan each of the yeast intronless genes. Within each window, we used RNAfold (Hofacker 2009) to predict the minimum free energies of RNA folding (ΔG1) (Mathews et al. 2004), DNA duplex (ΔG2) (SantaLucia and Hicks 2004; Turner and Mathews 2010), RNA/DNA hybrid duplex (ΔG3) (Lorenz et al. 2012), and folding of the transcribed (nontemplate) ssDNA (ΔG4) (SantaLucia and Hicks 2004; Turner and Mathews 2010) under 30°C. The preference of RNA folding over R-loop is measured by ΔG3 + ΔG4 − ΔG1 − ΔG2, which is presented as per-site minimum free energy along with FRNA (= −ΔG1) in Supplemental Figure S1.

Modified versions of CAN1 and strain construction

We changed 5% of synonymous sites in the coding region of the wild-type CAN1 to create a strong FRNA version and a weak FRNA version of CAN1, respectively. The details of the design of these two sequences are provided in Supplemental Methods. The haploid S. cerevisiae strain BY4741 was used for CAN1 forward mutation assays. See Supplemental Methods for experimental details of strain construction.

DRIP followed by quantitative PCR

The DRIP experiment was performed following a published protocol (El Hage et al. 2014) with minor modifications. Quantitative PCR was performed on a 7500 Fast Real-Time PCR System (Applied Biosystems). The experimental protocol is detailed in Supplemental Methods.

Forward mutation assay

The assay follows the standard protocol (Lippert et al. 2011; Takahashi et al. 2011) with modifications. Experimental details are provided in Supplemental Methods.

Sequencing CANR colonies

Seven CANR colonies from each of three replicated cultures of each strain were isolated and directly amplified by PCR (Ex Taq, Takara). The amplified fragments were Sanger sequenced. Primer sequences are shown in Supplemental Table S2. Identical mutations observed more than once in a single culture were counted only once to avoid multiple counting of the same mutation when estimating the fraction of mutations that are insertions/deletions (indels). False positives are colonies in which the CAN1 sequence bears no mutation. We found no false positives among low-transcription colonies, but some among high-transcription colonies. The higher false positive rate in high-transcription strains was probably because these strains had a high concentration of CAN1 (see the next section) and thus accumulated a high concentration of cellular arginine before being transferred to the selective medium, where some cells with the functional CAN1 may still be able to grow into colonies owing to the availability of a large amount of cellular arginine (Gong 2008). The false positives do not affect our conclusion, because our conclusion is based on comparisons among strains carrying the same promoter. Given our observation that the expression level of CAN1 under the control of the same promoter increases slightly with FRNA (see the next section), our assumption of the same true positive rate across the three strains carrying pGAL probably causes an overestimation of the mutation frequency in the strain with strong FRNA, rendering our conclusion of the negative impact of nascent RNA folding on mutagenesis conservative. The absence of false positive colonies of low-transcription CAN1 strains verified the presumption that even low transcription of wild-type CAN1 is lethal.

To better estimate the fraction of mutations that are indels in the three FRNA versions of CAN1 in high-transcription strains, strains were grown in YPGE for seven generations to increase the number of mutations. Seven CANR colonies from each of three replicated cultures of each strain were isolated, directly amplified by PCR as above, and Sanger sequenced.

Quantifying CAN1 expression level by quantitative RT-PCR

The expression level of CAN1 was measured using quantitative RT-PCR, with experimental details provided in Supplemental Methods.

Relative mutation rates at G/C and A/T sites

To compare the mutation rate at G/C sites relative to that at A/T sites between the strong and weak FRNA versions of CAN1, we used low-transcription strains (due to their zero false positive rates and relatively low fractions of mutations that are indels) without overexpressing RNASEH1. Three CANR colonies from each of 112 replicated cultures of each strain were isolated, directly amplified by PCR as described above, and Sanger sequenced. Only nonsense point mutations were considered. Mutational target size for G/C sites (or A/T sites) was the number of G/C sites (or A/T sites) where a point mutation could be nonsense. We respectively calculated the mutability at G/C sites (rG/C) and A/T sites (rA/T) for each CAN1 version, and then computed their ratio γ = rG/C:rA/T.

To test the null hypothesis that the ratio in γ between the weak and strong versions of CAN1 is equal to or smaller than 1, we conducted a computer simulation. Let the mutational target size at G/C sites and A/T sites be a and b for the strong FRNA version, and c and d for the weak FRNA version, respectively. We first generated a binomial random number x following B(a, rG/C) and another random number y following B(b, rA/T), where rG/C and rA/T are both from the strong FRNA version. We then generated two binomial random numbers, z and w, following B(c, rG/C) and B(d, rA/T), respectively, where rG/C and rA/T are from the weak FRNA version. We then calculated g = [(z/c)/(w/d)]/[(x/a)/(y/b)] = (yzad)/(xwbc). We repeated this process 10,000 times and calculated the fraction of times (P) in which g ≤ 1. P is the probability that the null hypothesis is true.

Mutation rates in yeast mutation accumulation (MA) lines

Fares and colleagues accumulated mutations in several lines of a mismatch repair deficient S. cerevisiae strain (BY4741; Mata; his3D1; leu2D0; met15D0; ura3D0; msh2::kanMX4) for approximately 2200 generations by 100 plate-to-plate passages of single colonies (Fares et al. 2013). In the final generation, a total of 1003 base substitutions were detected in the genomes, of which 691 affected protein-coding regions. The identified mutations in the MA lines were previously mapped to the S. cerevisiae genome in Ensembl version 59 (Cunningham et al. 2015) by Fares and colleagues (Fares et al. 2013). The upstream 17 nt, the mutation itself, and the downstream 17 nt were extracted from Ensembl and remapped to the S288C genome at Saccharomyces Genome Database (SGD, February 2011) using SOAP (Li et al. 2008) with perfect matches. The mutation rate of a gene was estimated by the total number of mutations identified in the gene divided by the gene length.

Mutation rates in humans

We used HapMap release 27 (The International HapMap 3 Consortium et al. 2010), corresponding to Ensembl human genome version 54, to identify intronic SNPs. To ensure the data quality and selective neutrality of the genomic regions considered, we applied the following filters: First, SNPs should be in an intron sandwiched by two constitutive exons (based on the presence of the exons in all transcripts of the gene annotated in Ensembl), and should not be in any overlapping region of multiple genes; second, SNPs should not be within 200 nt from any other SNP; third, SNPs should not be within any 33-nt sequence whose genomic origin cannot be unambiguously determined by SOAP (Li et al. 2008) (i.e., repeats); and fourth, genes with less than 10 qualified SNPs were not considered. In the end, 7852 genes with 542,575 SNPs passed the preceding filters and were used in the analysis.

Distances between pauses of RNAP II

We used yeast native elongating transcript (NET-seq) data (Churchman and Weissman 2011) and human global run-on sequencing (GRO-seq) data (Core et al. 2008) to infer distances between pauses of RNAP II. Details of the computational analysis are provided in Supplemental Methods.

R-loop scores

Yeast R-loop scores were estimated using DNA–RNA immunoprecipitation tiling microarray (DRIP-chip) data (Chan et al. 2014) and R-loop data based on RNase H targets (El Hage et al. 2014). Human R-loop scores were estimated using human DNA–RNA immunoprecipitation sequencing (DRIP-seq) data (Ginno et al. 2013). Detailed computational analysis is provided in Supplemental Methods.

Nucleosome occupancy

We used the DNA micrococcal-nuclease-digested sequencing (MNase-seq) data (Weiner et al. 2010) to estimate nucleosome occupancy at each nucleotide in yeast. The protocol of our computational analysis is provided in Supplemental Methods.

Replication timing

DNA replication timing data from yeast (Koren et al. 2010) were analyzed. Details of the computational analysis are provided in Supplemental Methods.

Experimental mRNA folding strength data

The in vitro mRNA folding strength data (Kertesz et al. 2010) generated from S. cerevisiae strain S288C in YPD at 30°C were downloaded from NCBI (accession numbers: SRR066400–SRR066405, SRR066398–SRR066399, and SRR063372–SRR063374). The procedure of genome masking and short read alignment was the same as in the analysis of the NET-seq data. After a read is mapped to a transcript, the site in the transcript that is 1 nt upstream of the site aligned to the 5′ most nucleotide of the read is considered to have received a hit. Let the number of hits received by a nucleotide under RNase V1 treatment be NV and the corresponding number under RNase S1 treatment be NS. According to the original report (Kertesz et al. 2010), the mRNA folding strength of a site was defined by PARS = log2(NV/NS). The higher the PARS value, the stronger the mRNA folding for the site. The mRNA folding strength of a gene was defined by the mean PARS of all of its nucleotides. Overlapping regions between multiple genes were excluded.

Human in vitro mRNA folding data (Wan et al. 2014) generated from GM12878, GM12891, and GM12892 cell lines were download from NCBI (accession number: GSE50676). Let the total number of reads from these cell lines mapped to a nucleotide be NV and NS under RNase V1 and RNase S1 treatments, respectively. The mRNA folding strength of a site was defined by PARS = (NV + 1)/(NS + 1), because some sites have NS = 0. The site was ignored if NV + NS = 0. The overall folding strength for the mRNA is log2(mean PARS), where mean PARS is the average PARS of all considered sites. In each gene, only the longest transcript was considered.

To examine the relationship between mRNA folding strength and nascent RNA folding strength, we correlated mRNA folding strength with FRNA for yeast genes as well as human exons.

Yeast and human transcriptome data

Yeast RNA sequencing (RNA-seq) data (Nagalakshmi et al. 2008) were generated from S. cerevisiae strain BY4741 in YPD at 30°C. Human RNA-seq data (Mortazavi et al. 2008) were generated from the testis. In these two RNA-seq data sets, gene expression is measured in reads per kilobase of exon model per million mapped reads (RPKM).

Odds ratios

We defined and calculated two odds ratios. To calculate OR1, a 2 × 2 table was constructed for each gene by respectively classifying its SNPs and non-SNP control sites into two categories based on whether their FRNA values are higher or lower than the mean FRNA of all intronic SNPs and non-SNP control sites of the gene. Let the numbers of sites that fall into the four categories be: a1 (SNPs with FRNA > mean); b1 (SNPs with FRNA ≤ mean); c1 (non-SNP control sites with FRNA > mean); and d1 (non-SNP control sites with FRNA ≤ mean), respectively. OR1 = (a1d1)/(b1c1); thus, OR1 < 1 if strong nascent RNA folding reduces mutation rate.

To calculate OR2, a 2 × 2 table was constructed for each gene by respectively classifying its SNPs and non-SNP control sites into two categories based on whether their R-loop scores are higher or lower than the mean R-loop score of all SNPs and non-SNP control sites of the gene. Let the numbers of sites that fall in the four categories be: a2 (SNPs with R-loop scores > mean); b2 (SNPs with R-loop scores ≤ mean); c2 (non-SNP control sites with R-loop scores > mean); and d2 (non-SNP control sites with R-loop scores ≤mean), respectively. OR2 = (a2d2)/(b2c2); thus, OR2 > 1 if weak R-loops decrease mutation rate.

Selective advantage of nascent RNA folding for reducing mutational load

Because nascent RNA folding affects the mutation rate only locally, it is reasonable to assume no recombination between an antimutator and its target (i.e., sites where the mutation rate is reduced). Under no recombination, the fitness advantage (k) conferred by an antimutator approximates the reduction in deleterious mutation rate of its target (Δμd) (Kimura 1967; Lynch 2011). The per-generation mutation rate in yeast is on average 3.3 × 10−10 per nucleotide (Lynch et al. 2008). Assuming that (1) a nucleotide change that increases local nascent RNA folding can reduce the mutation rate of a target of 10 sites by 80%; (2) the fraction of mutation that is deleterious at the target is 75%; and (3) the mean mutation rate at the target is 10 times the genomic average, we can estimate that Δμd = 3.3 × 10−10 × 10 × 10 × 0.75 × 0.8 = 2.0 × 10−8. Note that because the assumptions made here are quite extreme, the actual benefit is likely to be smaller. Because Δμd is much smaller than 10−7, the inverse of the effective population size of yeast (Wagner 2005), the benefit is too small to be detectable by natural selection (Kimura 1983). For humans, the mutation rate is on average 1.3 × 10−8 per nucleotide per generation (Lynch 2010b). Using the same assumptions made for yeast, we can estimate that Δμd = 1.3 × 10−8 × 10 × 10 × 0.75 × 0.8 = 7.8 × 10−7. This tiny benefit is much smaller than 10−4, the inverse of the human effective population size (Takahata 1993). These analyses suggest that it is unlikely for natural selection to enhance nascent RNA folding for the benefit of reducing mutational load. Thus, selection for reduced mutational load is unable to generate differential nascent RNA folding strengths among genes with different expression levels.

Data access

The DNA sequences from this study have been submitted to NCBI GenBank (http://www.ncbi.nlm.nih.gov/genbank/) under accession numbers KT852337KT852338.

Acknowledgments

The RNASEH1 overexpression plasmid was kindly provided by the Douglas Koshland laboratory. We thank Calum Maclean, Wenfeng Qian, and three anonymous reviewers for constructive comments. This work was supported in part by a US National Institutes of Health grant R01GM103232 and a US National Science Foundation grant MCB-1329578 to J.Z.

Footnotes

  • Received May 28, 2015.
  • Accepted October 14, 2015.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

References

| Table of Contents

Preprint Server