Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Gregor Diensthuber; Leszek P. Pryszcz; Laia Llovera; Morghan C. Lucas; Anna Delgado-Tejedor; Sonia Cruciani; Jean-Yves Roignant; Oguzhan Begik; Eva Maria Novoa

doi:10.1101/gr.278849.123

Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

¹Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain;
²Universitat Pompeu Fabra, Barcelona 08003, Spain;
³Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, 1015 Lausanne, Switzerland;
⁴Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-University Mainz, 55128 Mainz, Germany

↵5 These authors contributed equally to this work.

Corresponding authors: oguzhan.begik{at}crg.eu, eva.novoa{at}crg.eu

Next Section

Abstract

In recent years, nanopore direct RNA sequencing (DRS) became a valuable tool for studying the epitranscriptome, owing to its ability to detect multiple modifications within the same full-length native RNA molecules. Although RNA modifications can be identified in the form of systematic basecalling “errors” in DRS data sets, N6-methyladenosine (m⁶A) modifications produce relatively low “errors” compared with other RNA modifications, limiting the applicability of this approach to m⁶A sites that are modified at high stoichiometries. Here, we demonstrate that the use of alternative RNA basecalling models, trained with fully unmodified sequences, increases the “error” signal of m⁶A, leading to enhanced detection and improved sensitivity even at low stoichiometries. Moreover, we find that high-accuracy alternative RNA basecalling models can show up to 97% median basecalling accuracy, outperforming currently available RNA basecalling models, which show 91% median basecalling accuracy. Notably, the use of high-accuracy basecalling models is accompanied by a significant increase in the number of mapped reads—especially in shorter RNA fractions—and increased basecalling error signatures at pseudouridine (Ψ)- and N1-methylpseudouridine (m¹Ψ)-modified sites. Overall, our work demonstrates that alternative RNA basecalling models can be used to improve the detection of RNA modifications, read mappability, and basecalling accuracy in nanopore DRS data sets.

RNA modifications are chemical groups that decorate RNA molecules and are collectively referred to as the epitranscriptome (Saletore et al. 2012). They play a crucial role in a wide variety of molecular processes, including splicing, translation, and RNA stability (Frye et al. 2016; Roundtree et al. 2017; Gilbert and Nachtergaele 2023). Our understanding of the epitranscriptome has dramatically increased in the past years with the development of novel methods that can chart the modification landscape in a transcriptome-wide fashion (Helm and Motorin 2017; Wan et al. 2022; Lucas and Novoa 2023). High-throughput sequencing–based methods were the first to revolutionize the field of RNA modification detection by providing high-accuracy maps for N6-methyladenosine (m⁶A), 2′-O-methylation (Nm), N4-acetylcytosine (ac⁴C), and pseudouridine (Ψ), among others (Dominissini et al. 2012; Meyer et al. 2012; Carlile et al. 2014; Schwartz et al. 2014; Marchand et al. 2016, 2018; Sas-Chen et al. 2020). More recently, nanopore direct RNA sequencing (DRS) has emerged as a promising alternative for RNA modification detection owing to its ability to directly detect RNA modifications, without the need for prior chemical treatment, in native full-length RNA reads, thus permitting isoform-specific RNA modification analyses (Novoa et al. 2017; Garalde et al. 2018; Begik et al. 2022; Jain et al. 2022).

In nanopore DRS, single-stranded RNA molecules are threaded through protein pores, embedded within a nonconductive membrane, to which a voltage is applied. This causes the RNA molecules to migrate through the pores, leading to changes in the current intensity signal that are distinct for the four canonical nucleobases. More importantly, systematic changes to the current intensity signal and/or downstream signals are also caused by RNA modifications, providing a basis for modification detection with nanopore sequencing (Garalde et al. 2018; Workman et al. 2019). In the last few years, DRS has been successfully applied to detect diverse RNA modification types, including m⁶A (Liu et al. 2019; Lorenz et al. 2020; Gao et al. 2021; Leger et al. 2021; Pratanwanich et al. 2021; Hendra et al. 2022), Ψ (Begik et al. 2021; Huang et al. 2021; Tavakoli et al. 2023), Nm (Begik et al. 2021; Stephenson et al. 2022), N7-methylguanosine (m⁷G) (Smith et al. 2019; Delgado-Tejedor et al. 2023), and inosine (I) (Nguyen et al. 2022), and in a variety of RNA species, such as long noncoding RNAs, viral RNAs, mRNAs, and tRNAs (Liu et al. 2019; Workman et al. 2019; Kim et al. 2020; Leger et al. 2021; Lucas et al. 2024). Identification of modified RNA bases has been typically achieved either (1) by measuring changes in the current signal caused by the presence of modified nucleotides (Stoiber et al. 2017; Leger et al. 2021; Pratanwanich et al. 2021; Acera Mateos et al. 2024) and/or (2) by identifying nonrandom basecalling “errors” that occur at positions where the RNA is modified (Furlan et al. 2021; Jenjaroenpun et al. 2021; Liu et al. 2021; Piechotta et al. 2022).

A central step in nanopore sequencing is the basecalling process, which involves converting the raw electrical signal data generated by the nanopore into a string of nucleotide bases that represent the RNA sequence. To convert the current intensity signal into a nucleotide sequence, the basecalling algorithm requires a basecalling model. These models are typically trained with synthetic and/or in vivo native RNA reads and can predict the four canonical bases (A, C, G, and U in the case of RNA). Current basecalling models, available through the Guppy basecalling software, are not specifically trained to predict RNA modifications. Therefore, when the model encounters a modified RNA nucleotide, it may call the wrong base (mismatch error), not call any base (deletion error), or insert a small piece of sequence (insertion error) at the modified site (Liu et al. 2019; Begik et al. 2021; Jenjaroenpun et al. 2021). Several software exploit this feature and use these basecalling “errors” to identify RNA modifications in DRS data sets (Parker et al. 2020; Jenjaroenpun et al. 2021; Liu et al. 2021; Piechotta et al. 2022; Abebe et al. 2022b). The use of basecalling errors to detect RNA modifications is well suited to detect certain RNA modification types, such as Ψ, which causes a very strong U-to-C mismatch signature at the modified site (Begik et al. 2021; Huang et al. 2021; Tavakoli et al. 2023). In contrast, other RNA modification types, such as m⁶A, yield relatively modest basecalling “error” signatures (Begik et al. 2021), leading to a relatively high number of false positives and false negatives when predicting them in a transcriptome-wide fashion (Liu et al. 2019).

Basecalling “errors” in nanopore sequencing data can arise because of various factors, such as homopolymeric regions, RNA secondary structures, or helicase missteps (Noakes et al. 2019; Shaw et al. 2023; Liu-Wei et al. 2024). However, a major factor affecting the basecalling “error” patterns is the RNA basecalling model of choice and, more specifically, the data used to train this model. For example, the “default” RNA basecalling model (rna_r9.4.1_70bps_hac) used in Guppy was trained with native mRNA sequences from diverse species and thus contains a significant number of m⁶A-modified sites in the reads that were used to train the model. Consequently, this basecalling model learned to predict both “m⁶A” and “A” as “A,” leading to low frequencies in basecalling “errors” when an m⁶A modification is encountered compared with when the model encounters other RNA modification types that were not included in the training data.

Here, we hypothesized that the detection of RNA modifications could be improved at the stage of basecalling by using alternative basecalling models. We trained novel RNA basecalling models and examined their performance in terms of basecalling accuracy (i.e., sequence identity), read mappability, and RNA modification detection ability. In addition to their applicability for RNA modification detection, these alternative “higher-accuracy” basecalling models may be used for enhanced mRNA vaccine quality control as well as for improved SNP detection, identification of splice sites, and transcript annotation when using DRS data.

Previous Section Next Section

Results

RNA basecalling model choice affects both per-read accuracy and RNA modification “errors”

To test whether the use of alternative basecalling models could enhance the detection of RNA modifications using alignment feature-based tools, we trained novel RNA basecalling models using two different approaches (Supplemental Fig. S1A). The first approach consisted in training a model using fully unmodified RNA sequences generated via in vitro transcription (“IVT” model), which we reasoned should lead to increased “error” signatures at modified bases present in mRNAs, compared with the default basecalling model. The second approach consisted in generating a “superaccuracy” RNA basecalling model (“SUP” model) (see Supplemental Table S1), based on the latest CTC-CRF (bonito) DNA model architecture, achieving 96% accuracy for DNA basecalling, which we reasoned should show increased basecalling accuracy for RNA (see Methods) (Supplemental Fig. S1B) and, consequently, improved detection of RNA modifications owing to an improved signal-to-noise ratio. Next, we assessed their effect on both basecalling accuracy and on basecalling “error” patterns, which can be used as proxy for RNA modifications (Fig. 1A).

View larger version:

Download as PowerPoint Slide

Figure 1.

Benchmarking RNA basecalling models and their effect on basecalling accuracy and RNA modification detection. (A) Schematic overview of the in vitro “curlcakes” used in this work, depicting the seven different modification types that were sequenced (m⁶A, m⁵C, hm⁵C, ac⁴C, Ψ, m¹Ψ, and m⁵U) and the three RNA basecalling models benchmarked (default, IVT, and SUP). To assess the performance of tested RNA basecalling models, error signatures (mismatch, deletion, and insertion frequency) were used. (B) Comparison of basecalling accuracies obtained from each RNA basecalling model on human transcriptomic data (see Methods). The curve represents the density of the per-read identities for each model. Solid lines represent Guppy, and dashed lines represent Dorado as basecalling software. We would like to note that a subset of reads from the HEK293T samples used here was also used for training our “SUP” model (see Methods); however, see also Figure 4A, which shows similar accuracy for SUP model in other species/data sets not included in model training. (C) IGV snapshots showing the effect of using either default, IVT, or SUP models on unmodified (UNM) or modified (m⁶A, Ψ, and m¹Ψ) “curlcake” sequences for two positions each (top and bottom row). Asterisks indicate modified sites. Positions at which the mismatch frequency exceeds 0.2 are colored, and deletions are visualized as drop in coverage. (D) Comparison of the summed errors (mismatch, deletion, and, insertion frequency) for all 5-mers with a central modified base for the three tested basecalling models (default, IVT, and SUP). (E) Box plots showing the delta summed error for 5-mers of selected modifications (m⁶A, Ψ, and m¹Ψ) at the central (modified) position. For equivalent plots of the remaining RNA modifications tested (m⁵C, hm⁵C, ac⁴C, and m⁵U), see Supplemental Figure S2B. Statistical analysis was performed using a two-sided nonparametric Wilcoxon test with “default” as the reference group. Results were corrected for multiple-hypothesis testing using the Benjamini–Hochberg procedure to obtain adjusted P-values: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01, (***) P ≤ 0.001, (****) P ≤ 0.0001. The sample size reported by n represents the number of 5-mers contributing to each box plot per panel. For D and E, the box is limited by the lower-quartile Q1 (bottom) and upper-quartile Q3 (top). Whiskers are defined as 1.5 × IQR with outliers represented as individual dots. (F) Comparison of the median delta summed error (=SumErr_MOD − SumErr_UNM) for centrally modified (position 0) 5-mers. A Z-score normalization per modification (see Methods) was performed to highlight the effect that different basecalling models have on the obtained delta summed error. Unscaled results are reported in Supplemental Figure S2B.

We compared the basecalling accuracy of both models on human transcriptomic data, finding that the SUP model was the most accurate out of the three models tested, reaching a median read accuracy of 97% on human cells (Fig. 1B), whereas the default and IVT models only showed 91% and 88% accuracy, respectively. Additionally, we included the most accurate model for the newest Oxford Nanopore Technologies (ONT) basecaller Dorado (default model) provided for RNA002 chemistry. The Dorado model reached a median accuracy of 94%, suggesting that our SUP model using Guppy as basecalling software is currently the most accurate basecaller for data generated with RNA002 chemistry (Fig. 1B). Furthermore, we did not observe any significant differences in runtime or CPU usage when comparing the default model to our custom models (Supplemental Fig. S1C,D).

We then examined how basecalling “error” signatures produced by m⁶A, m⁵C, hm⁵C, ac⁴C, Ψ, m¹Ψ, and m⁵U in synthetic “curlcakes” (Liu et al. 2019) would vary depending on the RNA basecalling model choice (default, IVT, or SUP) (Fig. 1C; see also Supplemental Fig. S1E). Of note, “curlcakes” are synthetic constructs that contain all possible 5-mers, thus allowing the study of the effect in all possible sequence contexts that affect the basecalling process. To this end, we employed per-position mismatch, deletion, and insertion errors, which we consider jointly as summed errors (SumErr; see Methods). Our analyses revealed that SumErr values followed similar trends irrespective of the basecalling model of choice, with pseudouridine modifications (Ψ and m¹Ψ) giving the highest error values and cytidine modifications (m⁵C and hm⁵C) giving the lowest basecalling errors overall (Fig. 1D). Moreover, we observed that SumErr values in unmodified reads were significantly decreased when basecalled with the SUP model (Supplemental Fig. S2A), in agreement with the increased accuracy of the model.

We then assessed the ability of each model to detect RNA modifications in the form of basecalling errors. For this purpose, we computed the delta summed error (ΔSumErr = SumErr_MOD − SumErr_UNM) for each position, model, and modification type, which corrects for the different background error rates of each model on a per-5-mer basis. Our results showed that, for most modifications examined—and especially for m⁵U, Ψ, and m¹Ψ—the ΔSumErr was significantly increased when using the SUP model, relative to default (Fig. 1E,F; see also Supplemental Fig. S2B), indicating the signal-to-noise ratio was increased and thus confirming the hypothesis that the use of alternative basecalling models can improve our ability to detect RNA modifications in DRS data. In contrast, the use of the IVT model did not increase the ΔSumErr for most RNA modifications, with the exception of m⁶A basecalling errors, which were modestly, yet significantly, increased when using the IVT model, compared with the default (Fig. 1E,F), in agreement with our hypothesis.

In addition, we noticed that all RNA modifications, regardless of the basecalling model, showed increased basecalling errors at the modified site (position 0), with the exception of m⁵C and hm⁵C, which showed increased errors at position +1, in agreement with previous reports (Begik et al. 2021). Moreover, we observed that for m⁶A, the increase in “error” using the IVT model compared with the default model was mainly owing to high deletion signatures at the modified site (Supplemental Fig. S2C,D). On the other hand, the increased signal in the SUP model to detect Ψ and m¹Ψ modifications was caused by an increase in the global U > C mismatch frequency in SUP, compared with the default (Supplemental Fig. S2C,D; see also Supplemental Fig. S3). Taken together, these results show that the SUP and IVT basecalling models achieve a stronger signal-to-noise ratio by either changing the error signature (IVT model for m⁶A) or improving previously observed patterns (SUP model for Ψ and m¹Ψ).

IVT model outperforms other models in detecting m⁶A modifications in DRACH contexts

m⁶A is the most prevalent epitranscriptomic mark in mammalian mRNA molecules, with an average of one to three sites per transcript (Dominissini et al. 2012; Meyer et al. 2012). Thus, the use of native RNA molecules as a training data set for a basecaller (as is the case for both default and SUP models) (see Supplemental Fig. S1) will lead to basecalling models learning to interpret both “m⁶A” and “A” as “A,” because the training data contain large amounts of m⁶A-modified residues, whereas its output is limited to the four canonical nucleobases (A, C, G, and U). For this reason, we speculated that the relative basecalling “error” rates in our IVT model should be particularly improved in those sites that naturally contain m⁶A (DRACH, where D denotes A, G, or U; R denotes A and Gl; and H denotes A, C, or U) and are therefore masked in “default.”

To investigate this, we independently examined the performance of the three models in DRACH and non-DRACH motifs. Indeed, we observed that the modest increase in IVT model performance at m⁶A-containing sites (Fig. 1E) was significantly stronger when only DRACH sites were taken into account (Fig. 2A), in agreement with our hypothesis. A detailed examination of which individual 5-mers were contributing to this phenomenon revealed that certain DRACH 5-mers showed no or only mild improvement, whereas others were strongly enhanced in their signal-to-noise ratio (Fig. 2B; see also Supplemental Fig. S4A). Notably, the largest improvement in the signal difference was observed for GGACT (Fig. 2B; see also Supplemental Fig. S4B), which is the most prevalent m⁶A-motif observed in mammalian species (Liu et al. 2020), suggesting that mammalian native RNA was most likely used to train the “default” basecalling model.

View larger version:

Download as PowerPoint Slide

Figure 2.

The effect of custom basecalling models on biologically relevant m⁶A motifs and at different m⁶A stoichiometries. (A) Comparison of the delta summed errors (ΔSumErr = SumErr_MOD − SumErr_UNM) at the central modified position for 5-mers that fall within the DRACH motif (D = A, G, or U; R = G or A; H = A, C, or U) and others (non-DRACH). Statistical analysis was performed using a two-sided nonparametric Wilcoxon test with “default” as the reference group. (B) Comparison of the ΔSumErr for individual motifs that are DRACH and non-DRACH. Each dot represents a single observation of that 5-mer within the “curlcake” sequences. To compare the relationship between individual 5-mers, a paired two-sided Wilcoxon test was performed. (C) Schematic representation of the experimental approach to obtain “curlcakes” at different m⁶A stoichiometries. ATP and m⁶ATP were mixed at different ratios and used for the in vitro transcription of “curlcakes” with the remaining NTPs, followed by poly(A)-tailing, library preparation, and direct RNA sequencing (DRS). (D) Comparison of error-signature amplitude at different modification stoichiometries between 5-mers that fall within the DRACH motif. The corresponding plot for non-DRACH motifs is reported in Supplemental Figure S5B. Lines were fitted using locally estimated scatterplot smoothing (loess), and lightly shaded areas correspond to the 95% confidence interval. For A,B the box is limited by the lower-quartile Q1 (bottom) and upper-quartile Q3 (top). Whiskers are defined as 1.5 × IQR with outliers represented as individual dots. All statistical tests were corrected for multiple hypothesis testing using the Benjamini–Hochberg correction, and adjusted P-values are displayed: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01.

Finally, we examined the sensitivity of the different basecalling models with varying modification stoichiometries. To this end, we generated “curlcake” in vitro transcripts at varying modification levels by mixing ATP and m⁶ATP at different ratios (see also Methods) (Fig. 2C). Our results showed that even at the lowest modification level (12.5%), the IVT model was able to generate a significantly increased error signature at DRACH-bearing 5-mers compared with the default model, which was especially evident in the GGACT context (Fig. 2D; see also Supplemental Fig. S5A–C).

Improved in vivo detection of m⁶A-modified sites in human samples using the IVT model

Next, we examined whether the IVT model would improve the detection of modified sites in vivo using third-party tools for the detection of differentially modified m⁶A sites. Therefore, we used publicly available nanopore DRS data from HEK293T wild-type and Mettl3^−/− cells, which are incapable of placing the methyl mark (Pratanwanich et al. 2021; Liu et al. 2023). We basecalled the data using our three basecalling models, mapped the reads to the reference genome, and finally used ELIGOS2 (Jenjaroenpun et al. 2021), a tool developed to identify RNA modifications through pairwise comparison of error profiles, to detect m⁶A sites (Fig. 3A). Comparison of the control and Mettl3^−/− samples using ELIGOS2 revealed a similar number of sites detected when basecalling the data using the default (4859 m⁶A sites) and IVT model (4979 m⁶A sites) (Supplemental Table S3), showing that the use of the IVT model modestly increased the number of predicted m⁶A-modified sites. In agreement with our in vitro analyses, we observed a significant increase in the ΔSumErr for sites identified with the IVT model compared with the default when considering m⁶A sites within the consensus DRACH motif (Supplemental Fig. S6A). Moreover, as previously shown on the “curlcake” data, the largest difference between the IVT and default models was observed for sites falling into the GGm⁶ACT motif, the most common m⁶A-containing motif in humans, suggesting that IVT has a higher potential to report true m⁶A-modified sites than the default model (Supplemental Fig. S6B).

View larger version:

Download as PowerPoint Slide

Figure 3.

Detection of m⁶A-modified sites from HEK293T cells using a pairwise comparison approach and validation of identified sites with orthogonal methods. (A) Schematic representation of the workflow to obtain m⁶A-modified sites using DRS. Samples from wild-type HEK293T and METTL3^−/− knockout cells were sequenced using DRS. Raw FAST5 files were basecalled using one of the three models tested (default, IVT, and SUP) and mapped to the human reference genome (hg38). To obtain m⁶A-modified sites, pairwise comparisons within each model were performed using ELIGOS2 (Jenjaroenpun et al. 2021). (B) Venn diagrams showing the overlap in absolute numbers between the nanopore used with either the default or IVT model and m6ACE-seq, miCLIP, and GLORI-seq (Pratanwanich et al. 2021; Liu et al. 2023). Bar plots show the relative amount of sites supported. (C) Metagene plot of sites identified by either the default or IVT model show the known pattern of m⁶A distribution with a strong enrichment around the stop codon. Corresponding motifs identified de novo for default (motif AvRec = 0.35) and IVT (motif AvRec = 0.401), using BaMM. (D) Scatterplot depicting the delta summed error (=SumErr_WT − SumErr_KO) of m⁶A sites predicted with ELIGOS2 using default and IVT models, and their corresponding absolute m⁶A-levels as reported by GLORI-seq. Individual points represent single m⁶A sites. Lines represent models fit using a generalized additive model (GAM).

To further examine this point, we compared the set of ELIGOS2-predicted m⁶A sites to a list of m⁶A sites predicted in HEK293T cells using orthogonal methodologies (m6ACE-seq [Pratanwanich et al. 2021], GLORI-seq [Liu et al. 2023], and miCLIP [Cruciani et al. 2023]). This analysis revealed that 67% of the sites reported by ELIGOS2 on reads basecalled with the default model were supported by at least one orthogonal method, whereas 76% of the sites reported by the IVT model had orthogonal support (Fig. 3B; see also Supplemental Fig. S6C–E; Supplemental Table S4). Sites that were supported by at least one orthogonal method demonstrated the typical m⁶A distribution along the transcript, and motif enrichment of the identified m⁶A sites matched the mammalian m⁶A motif, suggesting that these sites are true m⁶A sites (Fig. 3C). Furthermore, we noted that m⁶A sites supported by at least one orthogonal method that were predicted with the IVT model captured a higher proportion of GGACT sites compared with those predicted when basecalling with the default model (Supplemental Fig. S6F,G).

Next, we examined the relationship between the reported ΔSumErr and m⁶A stoichiometry on in vivo sites. To this end, we focused on sites overlapping with those identified by GLORI-seq, which reports absolute quantitative m⁶A information per site. We found that our IVT model generated higher error-signatures than the default across the entire range of m⁶A-levels, suggesting that even though it has a higher overall noise, owing to reduced basecalling accuracy, it shows stronger discriminative power than the default model even at lowly modified sites (Fig. 3D). In addition, we found that the IVT model predicted a larger proportion of high-stoichiometry GLORI-seq sites, suggesting that sites predicted with IVT-basecalled data contain a higher proportion of true positives compared with the default (Supplemental Fig. S6H).

These results show that ELIGOS2-reported m⁶A sites on IVT basecalled data sets are more accurate than m⁶A sites compared with ELIGOS2-reported m⁶A sites on default basecalled data sets. In addition, the IVT model was capable of detecting m⁶A sites that were not reported by any of the orthogonal methods or the other two basecallers tested, but are supported by the most comprehensive m⁶A database available (Liang et al. 2023), suggesting that IVT can identify a unique subset of sites (Supplemental Fig. S6I). In contrast, the same analysis performed with SUP reported only a very small number of m⁶A sites (97), which lacked overlap with the ground-truth data sets (Supplemental Fig. S6D,E), suggesting either that SUP is not suited for the detection of m⁶A using the pairwise-comparison of error-profiles or that ELIGOS2 software, or the default parameters used, are not suited for models that do not generate strong “error” patterns.

A high-accuracy IVT_SUP model does not improve RNA modification detection

We then reasoned that a model trained with IVT data (fully unmodified) and also with the more accurate CTC-CRF architecture might allow for improved RNA modification detection while showing increased accuracy, compared with our “IVT” model (Fig. 1B). We found that this model, which we termed “IVT_SUP,” was able to reach similar basecalling accuracy (97%) to the previously trained “SUP” model (Supplemental Fig. S7A). Subsequently, we assessed its ability to detect m⁶A, Ψ, and m¹Ψ in the form of basecalling errors (Supplemental Fig. S7B,C). We observed a reduced ΔSumErr for pseudouridine modifications (Ψ and m¹Ψ) compared with SUP, suggesting that the IVT_SUP model was suboptimal for the detection of these modifications. In contrast, we observed an increase in ΔSumErr at m⁶A-modified sites compared with the IVT model. A detailed analysis revealed that the increase in overall signal at m⁶A-modified sites could be largely attributed to an increased signal at non-DRACH-containing motifs, whereas DRACH sites showed a similar ΔSumErr to that observed with the default model (Supplemental Fig. S7D). Altogether, our results show that the IVT_SUP model outperformed neither the IVT model (for m⁶A) nor the SUP model (for Ψ and m1Ψ) at detecting RNA modifications in the form of basecalling errors, at least when using ΔSumErr as a metric to identify modified sites. For this reason, subsequent analyses were only performed with the “IVT” and “SUP” models.

SUP model demonstrates 97% read accuracy and up to a 50% increase in mapped reads

Nanopore sequencing shows relatively high error rates compared with other sequencing platforms, especially in the case of DRS, with ∼12% per-read error rates (Jain et al. 2022). This high background limits the technology's applicability in several areas such as SNP detection, transcript annotation, and RNA modification detection. In addition to changes in pore chemistry that can mitigate sequencing errors, the introduction of basecalling models with super high accuracy represents an alternative approach to enhancing the technology's accuracy. To this end, we examined the per-read accuracy of the different models trained, across a wide range of model species, using previously published DRS data sets (Parker et al. 2020; Begik et al. 2021; Pratanwanich et al. 2021; Nguyen et al. 2022). Our results revealed a significant increase in median read accuracy of “SUP” compared with the default model, with the SUP model reaching a median accuracy of 97% in human samples (Fig. 4A; see also Supplemental Fig. S8A).

View larger version:

Download as PowerPoint Slide

Figure 4.

SUP is a super high accuracy model that leads to improved read accuracy and mapping. (A) Cross species comparison of read accuracies obtained from each model. Reads were aligned to the respective reference transcriptome (see Supplemental Table S5). To account for vastly different expression levels that would bias the analysis toward strongly expressed genes, we filtered for the longest reads of each transcript, keeping at most 10 reads per transcript. (B) Comparison of reads mapped to the human reference transcriptome (n = 3) across three bins of read lengths. Error bars indicate ± 1 SD. Statistical significance was determined using a two-sided t-test, corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P ≥ 0.05, (*) P < 0.05). (C) Effect of distinct basecalling models on the proportion of mapped reads obtained from standard poly(A)-selected human samples (median length ≅ 850 bp) and a synthetic RNA library consisting of three polyadenylated in vitro transcripts (median length ≅ 200 nt). Statistical significance was determined using a one-sided t-test, and results were corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P ≥ 0.05, (*) P < 0.05, (**) P < 0.01. Box plots depict the underlying size distribution of the sequenced libraries. The box is limited by the lower-quartile Q1 (bottom) and upper-quartile Q3 (top). Whiskers are defined as 1.5 × IQR with outliers being removed.

We then examined whether the increase in read accuracy would be reflected in the form of improved read mapping during alignment. We hypothesized that improved mapping should particularly impact shorter reads, as errors in those will be penalized more by most mapping algorithms, and may lead to reads remaining unmapped. To test this, we examined the relative proportion of mapped reads compared with the default, binning the reads by their read length. Notably, we found a consistent and significant increase in the number of mapped reads with the SUP model across all bins, with the observed difference to the default model being the largest in the category containing the shortest reads (<750 bp; relative increase compared with default = 117%) (Fig. 4B; see also Supplemental Fig. S8B). We then wondered whether this phenomenon would become even more evident in very short RNAs (mRNA-short, median length ≅ 200 bp), which are frequently observed in highly fragmented samples, such as blood plasma or sperm (Roszkowski and Mansuy 2021; Reggiardo et al. 2023). To test this, we used a library consisting of three polyadenylated in vitro transcripts with an average length of ∼200 bp (see Methods). Indeed, this analysis revealed an even stronger increase in the amount of mapped reads of ∼50% compared with the default model (Fig. 4C; see also Supplemental Fig. S8C).

Taken together, these results demonstrate that our super high accuracy (SUP) model significantly increases the average read accuracy, leading to substantially more reads being mapped, which becomes particularly noticeable in short RNA libraries (<200 bp).

Enhanced U > C mismatch signature and reduced background error in a vaccine model using the SUP model

Recently, the need for an improved quality-control pipeline to characterize mRNA vaccine production has been pointed out (Gunter et al. 2023). In this regard, nanopore DRS offers a unique platform as it can simultaneously provide information about sequence identity, RNA integrity, poly(A)-tail length, and modification status of the RNA molecule. Here, we wondered whether the use of our SUP model, which produced high U > C conversion rates at m¹Ψ-modified sites (55%) in the “curlcake” sequences (Supplemental Fig. S9) and demonstrated superior basecalling accuracy (Fig. 4A), would be useful in the quality control of m¹Ψ-containing mRNA vaccines, such as those used against COVID-19 (Nance and Meier 2021). To this end, we re-examined publicly available m¹Ψ-modified and unmodified matched DRS data, which consisted of a synthetic mRNA vaccine that contained an alpha-globin 5′ UTR, an enhanced green fluorescent protein coding sequence, and an AES-mtRNR1 3′ UTR (Gunter et al. 2023). Our results revealed that all synthetic sequences showed a reduced background error when basecalled with our SUP model, in both the m¹Ψ-modified and unmodified variants (Fig. 5A; see also Supplemental Fig. S10A), while simultaneously demonstrating a significant increase in the mismatch errors observed at modified sites (Fig. 5B; see also Supplemental Fig. S10B,C). Indeed, out of a total of 170 m¹Ψ-modified sites, we observed an increased U > C error for 151 sites (≈80%) in SUP over the default (Fig. 5C). More importantly, for sites that previously showed a very poor signature in the default, we observed a global improvement in U > C conversion, suggesting that our SUP model is more sensitive toward m¹Ψ in this context (Fig. 5D; see also Supplemental Fig. S10D). Overall, our results suggest that our SUP model can be a preferable basecalling model for performing quality control on mRNA vaccines and testing the incorporation of modified nucleotides, owing to its reduced background error and increased signal-to-noise ratio at m¹Ψ-modified sites.

View larger version:

Download as PowerPoint Slide

Figure 5.

SUP improves the identification of m¹Ψ-modified residues in synthetic mRNA vaccines. (A) IGV screenshot showing the synthetic mRNA vaccine containing either unmodified uridine (U) or N1-methylpseudouridine (m¹Ψ). The 20,000 longest, uniquely mapped reads were selected for this analysis. Positions at which the mismatch frequency exceeds 0.2 are colored, and those with a mismatch frequency below 0.2 are shown in gray. (B) Comparison of mismatch frequency across all m¹Ψ sites (n = 170) found in the synthetic eGPF vaccine. To test for statistical significance, the nonparametric Wilcoxon-test was used, and values were corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure: (ns) P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01, (***) P ≤ 0.001, (****) P ≤ 0.0001. (C) Comparison of U > C mismatch frequency for all 170 m¹Ψ sites between the default and SUP. Points located in the upper half of the dotted line represent sites with higher U > C frequency in SUP (n = 151), and points below the dotted line represent sites with higher U > C frequency in the default. (n = 19). Points are colored by the absolute difference between the default and SUP. (D) Comparison of U > C mismatch frequency across the most lowly modified sites in the default (median = 3.45) compared with the SUP (median = 19.43). A paired two-sided Wilcoxon test was performed to test for statistical significance. For B,D, the box is limited by the lower-quartile Q1 (bottom) and upper-quartile Q3 (top). Whiskers are defined as 1.5 × IQR with outliers removed.

Previous Section Next Section

Discussion

Recently, nanopore DRS has solidified its position as a promising tool for investigating the epitranscriptome, owing to its unique capability of sequencing native RNAs, thus making it possible to detect multiple modifications within the same native RNA molecule (Garalde et al. 2018; Smith et al. 2019; Begik et al. 2021). Indeed, multiple works have already employed this technology to characterize the epitranscriptomic landscape of mRNAs, rRNAs, and tRNAs (Parker et al. 2020; Delgado-Tejedor et al. 2023; Lucas et al. 2024). Early approaches for the detection of modifications using the nanopore platform have relied on the fact that modified residues will cause an error during the basecalling process (Begik et al. 2022). In this regard, several tools such as EpiNano (Liu et al. 2019), DiffErr (Parker et al. 2020), ELIGOS2 (Jenjaroenpun et al. 2021), DRUMMER (Abebe et al. 2022a), and JACUSA2 (Piechotta et al. 2022) have been developed to exploit this information and perform pairwise comparisons between a modified sample and an unmodified control to identify modified sites (Liu et al. 2019; Parker et al. 2020; Jenjaroenpun et al. 2021; Piechotta et al. 2022; Abebe et al. 2022a). In more recent years, current-based methods such as Tombo (Stoiber et al. 2017), xPore (Pratanwanich et al. 2021), nanoRMS (Begik et al. 2021), Nanocompore (Leger et al. 2021), or m6Anet (Hendra et al. 2022), which interrogate the raw signal space, and modification-aware basecallers such as m6ABasecaller (Cruciani et al. 2023) and Dorado (https://github.com/nanoporetech/dorado), have been developed. These tools show some key advantages compared with error-based methods, such as the possibility of achieving per-read modification resolution (e.g., in the case of m6Anet, m6ABasecaller, and Dorado). Nevertheless, error-based approaches still remain a better approach in specific biological scenarios in which RNA modifications are highly abundant, such as in tRNAs (Lucas et al. 2024). Moreover, training modification-aware basecallers for these RNA species would require the synthesis of RNAs that contain combinations of different modifications, which currently is not feasible. Therefore, error-based approaches provide a valuable alternative to current-based methods and modification-aware basecallers.

A critical component for error-based methods is the basecalling model, which is trained on a specific data set and is responsible for converting the raw current intensity signal that is output by the sequencing machine into a nucleotide sequence. The currently used basecalling model for RNA002 chemistries, which we refer to as the “default” (provided with Guppy 6.0.6 and upward), shows an overall per-read accuracy of 90% using ONT's proprietary basecaller Guppy (Liu-Wei et al. 2024). Previous efforts have been made to develop novel, more-accurate basecalling software, such as RODAN for RNA and Chiron and DeepNano for DNA (Boža et al. 2017; Teng et al. 2018; Neumann et al. 2022), among others. However, the effect of using alternative basecalling models on the detection of RNA modifications has so far not been explored.

Here we present two novel basecalling models, IVT and SUP (Supplemental Table S1), and benchmark their ability to detect a battery of RNA modification types, as well as their basecalling accuracy and effect on read mapping. We find that using a model that was trained on in vitro transcribed RNA (IVT model) significantly improves the detection of m⁶A modifications (Fig. 1E,F) by generating a strong deletion signature at the modified site (Supplemental Fig. S2C,D), improving the signal-to-noise ratio. This leads to a significant increase in the total number of high-confidence m⁶A sites recovered in in vivo samples compared with the default model (Fig. 4B). We argue that this improvement is because the default model was trained on native RNA from different organisms containing large amounts of m⁶A and therefore learned to recognize both unmodified A and m⁶A as A, especially at naturally modified 5-mers (Fig. 3A). We find evidence for this hypothesis in the fact that the most improved 5-mer coincides with the most highly modified 5-mer in humans GGm⁶ACT (Fig. 3B; see also Supplemental Fig. S4B).

Our second model, SUP, was trained with the intention of maximizing read accuracy. Therefore, we changed the model architecture from flip-flop, used in both default and IVT, to CTC-CRF (currently used in ONT DNA-seq; see Methods). We found that SUP reaches a per-read accuracy of 97% on human data compared with 91% in the default, which is a substantial improvement and, to our knowledge, the most accurately basecalled DRS run to date (Fig. 4A). We additionally found that this improved accuracy was accompanied by an increased number of mapped reads, becoming particularly apparent in short reads, in which the increase in read mappability reached ∼50% (Fig. 4B,C). Similar performance has been reported regarding ONT's newest direct RNA sequencing chemistry, SQK-RNA004 (96% accuracy). We should note that our improvements in basecalling accuracy were obtained using the SQK-RNA002 chemistry. Furthermore, our SUP model also increased basecalling error signatures at Ψ- and N1-methylpseudouridine (m¹Ψ)-modified sites, making it the preferred model for highly modified substrates such as mRNA vaccines (Fig. 5A).

Previous Section Next Section

Methods

In vitro transcription

This study used synthetic RNAs known as “curlcakes,” which were designed to include all possible 5-mer sequences while minimizing RNA secondary structures (Liu et al. 2019) and consist of four in vitro transcribed constructs: (1) Curlcake 1, 2244 bp; (2) Curlcake 2, 2459 bp; (3) Curlcake 3, 2595 bp; and (4) Curlcake 4, 2709 bp. To generate “curlcake” synthetic RNAs, the plasmids were digested with BamHI-HF (NEB R3136L) and EcoRV-HF (NEB R3195L) followed by a clean-up using phenol/chloroform/isoamyl-alcohol 25/24/1, v/v (pH 8.05; Sigma-Aldrich P3803). Linearized plasmids were used for in vitro transcription with the AmpliScribe T7-flash transcription kit (Lucigen ASF3507) using 5-methyluridine-5′-triphosphate (5-mUTP, Trilink N-1024-1) instead of UTP, N1-methylpseudouridine-5′-triphosphate (m1meΨTP, Jena Bioscience NU-890S) instead of UTP, or N4-acetyl-cytidine-5′-triphosphate (N4-acetyl-CTP, Jena Bioscience NU-988S) instead of CTP. Products were treated with Turbo DNase (Invitrogen AM2238) to remove the template plasmid and were cleaned up using the RNeasy MinElute kit (Qiagen 74204). All constructs were polyadenylated using Escherichia coli poly(A) polymerase (NEB M0276S) according to the manufacturer's instructions followed by a cleanup using the RNA Clean & Concentrator-5 kit (Zymo Research R1013). Concentrations were determined using the Qubit 2.0 fluorometer. Transcript integrity, as well as polyadenylation success, was determined using an Agilent 4200 TapeStation. For previously sequenced curlcakes used in this study, including unmodified NTPs (UNMs), N6-methyladenosine triphosphate (m⁶ATP), 5-methylcytosine triphosphate (m⁵CTP), 5-hydroxymethylcytosine triphosphate (hm⁵CTP), and pseudouridine triphosphate (ΨTP), publicly available raw FAST5 files were used as input for the analysis (see Supplemental Table S2; Liu et al. 2019; Begik et al. 2021).

DRS of in vitro transcribed RNA

The RNA libraries for DRS of ac⁴C, m⁵U, m¹Ψ-containing curlcake constructs, and UNM-S (which included Bacillus subtilis guanine riboswitch, B. subtilis lysine riboswitch, and Tetrahymena ribozyme) were prepared following the ONT DRS protocol version DRS_9080_v2_revR_14Aug2019-minion. Briefly, 800 ng of poly(A)-tailed RNA (200 ng per curlcake construct and 250 ng per riboswitch) was ligated to ONT RT adaptor (RTA) using concentrated T4 DNA ligase (NEB M0202T). The optional reverse transcription step was performed using SuperScript III (Thermo Fisher Scientific 18080044) in the case of ac⁴C, m⁵U-containing curlcake constructs, and UNM-S and using Superscript IV (Thermo Fisher Scientific 18090010) in the case of m¹Ψ. The products were purified using 1.8× Agencourt RNAClean XP beads (Thermo Fisher Scientific NC0068576) and washed with 70% freshly prepared ethanol. RNA adapter (RMX) was ligated onto the RNA:DNA hybrid, and the mix was purified using 1× Agencourt RNAClean XP beads, followed by washing with wash buffer (WSB) twice. The sample was then eluted in elution buffer (ELB) and mixed with RNA running buffer before loading onto a primed R9.4.1 flowcell. The samples were run on a MinION sequencer for 24–72 h. The sequencing runs were performed on independent days and using a different flowcell for each sample.

Training a high-accuracy RNA basecallers (SUP model)

We trained a new RNA basecalling model (SUP) using bonito v0.5.1 (https://github.com/nanoporetech/bonito). The SUP model is based on the latest CTC-CRF (bonito) DNA model architecture, which achieves 96% accuracy for DNA basecalling with chemistry R9.4.1. DNA CTC-CRF models use a window of 19 and stride of five. For our RNA models, we used a window of 31 and stride of 10 (similar to RNA flip-flop models).

Bonito model training requires chunks of fixed length (10,000 samples) and underlying sequences. We kept only chunks that corresponded to 30–270 bases. Chunks from each read were generated by stepping every 50 bases in the sequence space. Training was performed using bonito by keeping 3% of randomly selected chunks for validation. The SUP model was trained using publicly available human data sets containing triplicates of wild-type and Mettl3^−/− cells (Pratanwanich et al. 2021). We kept the five longest reads per transcript and ran them in order to balance highly and lowly expressed transcripts. In total, 158,413 reads were used for training. The training was stopped after five epochs. We selected the model with the lowest validation error as the final one from each training, corresponding to epoch 5 for the SUP model. The resulting model was calibrated using the calibrate_qscores_byread.py script from taiyaki and reached 97% median accuracy on native human RNA (Fig. 1B).

Training an RNA basecaller using in vitro sequences (IVT model)

The IVT model was trained using publicly available IVT direct RNA data obtained from CEPH1463 cells (UCSC_Run1_IVT_RNA). RNA model training procedure was performed for 50,000 iterations with following taiyaki hyperparameters: ‐‐size 256 ‐‐stride 10 ‐‐winlen 31 ‐‐chunk_len 4000 ‐‐batch_size 50. We selected up to 100 chunks from every read using a step of 25 bases. IVT model was trained using taiyaki v4.1.0 (https://github.com/nanoporetech/taiyaki).

Demultiplexing DRS

To evaluate the effect of differing m⁶A-stoichiometries on the sensitivity of each basecalling model, we used publicly available, multiplexed curlcake data (obtained from the NCBI BioProject database [https://www.ncbi.nlm.nih.gov/bioproject/] under accession number PRJEB61874) containing a range of stoichiometries (barcode1 = 12.5%, barcode 4 = 25%, barcode 3 = 50%, and barcode 2 = 75%; m⁶ATP). Demultiplexing of the barcoded DRS libraries was performed using DeePlexiCon (Smith et al. 2020), keeping reads with demultiplexing confidence scores above 0.80 (-f multi -m resnet20-final.h5 -s 0.8).

Basecalling and mapping of DRS data sets

Publicly available data sets for Homo sapiens (Pratanwanich et al. 2021), Mus musculus (Nguyen et al. 2022), Xenopus laevis (Nguyen et al. 2022), Saccharomyces cerevisiae (Liu et al. 2019), Arabidopsis thaliana (Parker et al. 2020), and the synthetic eGFP mRNA vaccine (Gunter et al. 2023) were downloaded from their respective data repositories reported in Supplemental Table S2. Raw FAST5 files were basecalled and mapped using the MasterOfPores pipeline (version 2.0) (Cozzuto et al. 2023). For basecalling, Guppy (version 6.0.6) was used with either the default, IVT, or SUP model, disabling the filtering step based on read quality (‐‐disable_qscore_filtering) (Supplemental Table S5). For the H. sapiens data set, we additionally examined the basecalling accuracy of Dorado (version 0.6.0) using the highest accuracy model available for the RNA002 chemistry (rna002_70bps_hac@v3) (Fig. 1B). For highly modified curlcake runs and the eGFP mRNA vaccine, basecalled FASTQ files were aligned to the curlcake reference using GraphMap (version 0.5.2) (Sović et al. 2016) with “sensitive” settings (‐‐rebuild-index -v 1 ‐‐double-index ‐‐mapq -1 -x sensitive -z 1 -K fastq ‐‐min-read-len 0 -A 7 -k 5) (Begik et al. 2021). For the in vivo m6A-analysis, basecalled FASTQ files were aligned to the soft-masked human genome (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa) using minimap2 (v2.17) with “default” settings (-uf -ax splice -k14). Uniquely mapped reads were reported with SAMtools (version 1.15.1) (Li et al. 2009) using samtools view -c -F 260 (Supplemental Table S5).

For the cross-species comparison, reads were mapped to the respective reference transcriptome (Supplemental Table S5) using minimap2 (v2.17) with “transcriptome” settings (-ax map-ont -k14). Aligned reads were first filtered for uniquely mapped reads with a mapQ > 15 using SAMtools (samtools view -q 15 -F 3840 <in.bam> > <out.bam>) (Li et al. 2009). To account for the heterogeneous expression profiles of each transcriptome and to prevent biasing our analysis toward highly expressed genes, we sampled the longest reads per transcript, keeping at most 10 reads per transcript (bam2select.py). Per read identities were calculated using a custom Python script (identity_density_v2.py). All scripts required to reproduce the results in this paper are provided on the project's GitHub page (https://github.com/novoalab/basecalling_models). Silhouettes of the model species were downloaded from PhyloPic (https://www.phylopic.org/).

Comparative analysis of system requirements

To compare the computational requirements of the default model with our custom IVT and SUP models, we subsampled 2 × 100,000 reads from a publicly available mouse brain DRS data set (Nguyen et al. 2022), which was also used in the cross-species comparison (Fig. 4A; Supplemental Table S2). Both replicates (100,000 reads each) were basecalled with Guppy (version 6.0.6), using the Master of Pores pipeline (‐‐granularity 25) on a single GPU (RTX A4000, CUDA11) and a single CPU with 12 GB of memory allocated. Computational requirements (runtime and %CPU usage) were extracted from the nextflow report (‐‐with-report) (Supplemental Fig. S1C,D).

Comparative analysis of basecalled features in curlcakes

To qualitatively assess the difference in basecalling errors introduced by default, IVT, and SUP upon RNA modifications, the sorted and indexed BAM files were visualized using the Integrative Genomics Viewer (IGV, version 2.13.1) (Robinson et al. 2011). Using EpinanoRMS (version 1.1) (Begik et al. 2021), basecalled features (mismatch frequency, deletion frequency, and insertion frequency) were extracted from the alignment files on a per-position basis. To obtain per 5-mer information, this output was converted using epinano_to_kmer.R script, which applies a sliding window of a given size to compile the per k-mer information (R Core Team 2023). These results were further processed to calculate the SumErr (Σ(mismatch frequency, deletion frequency, insertion frequency)) and ΔSumErr(SumErr_MOD − SumErr_UNM) on a per-position basis and visualized using in-house R scripts (Wickham 2016). To calculate the mismatch directionality, the number of reads corresponding to the correct reference base for that position was divided by the total number of reads obtained for that position. All scripts required to reproduce these results are available under the project's GitHub repository (https://github.com/novoalab/basecalling_models).

Detection of m⁶A-modified sites in HEK293T using comparative error analysis

To determine whether alternative basecalling models can improve the detection of m⁶A in vivo, the alignment files of individual replicates for the wild type and Mettl3^−/− were merged to obtain greater coverage. To reduce computational load, only protein-coding genes were considered in the comparison. For this reason, lines in which the gene biotype corresponds to protein coding were extracted from the genome annotation file and converted to BED format using bedops gtf2bed (version 2.4.41) (Neph et al. 2012). The resulting BED file was split into individual chromosomes that would serve as regions for testing in the pairwise comparison step. To perform pairwise comparisons, we ran ELIGOS2 (v2.1.0) (Jenjaroenpun et al. 2021) using the singularity image provided on the software's GitLab repository (https://gitlab.com/piroonj/eligos2) with a P-value cut-off of 1 × 10⁻⁵, an odds ratio of 2.5, and each of the individual chromosome files as regions for testing. Next, results obtained from individual chromosomes were merged using an in-house shell script and filtered to remove non-A bases as well as homopolymers to identify putative m⁶A sites (eligos2 filter -sb A ‐‐homopolymer). For comparative analysis of basecalling models, we extracted error-signatures of ELIGOS2-predicted m⁶A sites using EpiNano (v1.2.4; https://github.com/novoalab/EpiNano).

Metagene analysis and motif identification

To visualize the transcriptome-wide distribution of putative m⁶A sites identified with distinct basecalling models, we used the GuitaR package (version 2.12.0) (Cui et al. 2016). For de novo motif discovery, sites obtained from ELIGOS2 were converted into the FASTA format, which contained 6 bp upstream of and downstream from the modified site using the table2fa_eligos.mergeextend.sh script. Subsequently, those files were used for de novo motif prediction using BaMM motif (version 1.4.0) (Siebert and Söding 2016) using default settings and a Z-score threshold of five.

Comparison of identified m⁶A sites against m⁶ACE-seq, GLORI-seq, and miCLIP-seq

To identify high-confidence m⁶A sites, we intersected results obtained by each of the three models (default, IVT, and SUP) with data obtained from orthogonal methods. These data sets included m⁶ACE-seq data that were matched with the nanopore results, as well as GLORI-seq and miCLIP-seq results obtained from the same cell line (Linder et al. 2015; Pratanwanich et al. 2021; Liu et al. 2023). These data sets were filtered to only retain sites for which we had sufficient coverage in the nanopore runs (≥20×) within protein-coding genes (https://github.com/novoalab/basecalling_models). Overlaps between data sets were visualized using the ggvenn R package (version 0.1.9).

For stoichiometric analysis, sites overlapping with GLORI-seq were used as this data set provides per-nucleotide resolution information as well as an estimated stoichiometry for each identified m⁶A site (Liu et al. 2023). To this end, results obtained by GLORI-seq were first binned based on their reported stoichiometry into sites that were considered either HIGH (≥0.75 modified), MED-HI (≥0.50 and <0.75 modified), MED-LO (≥0.25 and <0.50 modified), or LOW (≤0.25 modified). Only those sites were retained in the final set in which both GLORI-seq replicates were reported within the same bin. This resulted in a total of 47,722 high-confidence sites split between HIGH (n = 8900), MED-HI (n = 7947), MED-LO (n = 12,697), and LOW (n = 18,178).

Processing of eGFP mRNA vaccine data and comparative analysis

Aligned reads were filtered to obtain the 20,000 longest reads and matched between default and SUP (bam2select.py). Mismatch, deletion, and insertion frequencies were calculated and extracted on a per nucleotide basis using EpinanoRMS (Liu et al. 2019). The per-base frequencies were calculated by dividing the number of each of the four canonical nucleobases (A, C, G, or T) for each position by the number of total observations for that position. To determine differences in sensitivity between the two models, we extracted the 20 most lowly U > C-modified sites in default and compared them to the same positions in SUP. All scripts available at GitHub (https://github.com/novoalab/basecalling_models).

Software availability

All code used to build figures shown in this work can be found on GitHub (https://github.com/novoalab/basecalling_models) and as Supplemental Code. All trained basecalling models used in this work are available in OSF (https://osf.io/2xgkp/). Processing of raw FAST5 files was performed using the MasterOfPores (version 2.0) nextflow pipeline (Di Tommaso et al. 2017; Cozzuto et al. 2023; https://github.com/biocorecrg/MOP2).

Previous Section Next Section

Data access

FAST5 DRS data sets basecalled with the default, corresponding to “curlcakes” containing ac⁴C, m⁵U, and m¹Y, as well as UNM-S rep2 and UNM-S rep3, generated in this study have been submitted to the European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/home), under accession code PRJEB67632. All processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE246151. Further data sets used in this study are specified in Supplemental Table S2.

Previous Section Next Section

Competing interest statement

E.M.N. has received travel and accommodation expenses to speak at Oxford Nanopore Technologies (ONT) conferences. G.D. and O.B. have received travel bursaries from ONT to present their work at conferences. E.M.N. is a member of the scientific advisory board of IMMAGINA Biotech. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could otherwise be construed as a conflict of interest.

Previous Section Next Section

Acknowledgments

G.D. is part of the ROPES ITN, which received funding from the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement no. 956810. O.B. is supported by funds from Merck Innovation 2020. L.P.P. was supported by funding from the European Union's H2020 research and innovation program under Marie Sklodowska-Curie grant agreement no. 754422, and is currently supported by European Research Council funds (ERC-StG-2021 no. 101042103 to E.M.N.). J.-Y.R. is supported by funds from the Swiss National Science Foundation (310030_197906) and the Deutsche Forschungsgemeinschaft (RO 4681/4-2, RO 4681/6-1, and RO 4681/12-1, TRR319 RMaP). This work was supported by funds from the Spanish Ministry of Economy, Industry, and Competitiveness (MEIC; PID2021-128193NB-100 to E.M.N.) and the European Research Council (ERC-StG-2021 no. 101042103 to E.M.N.). We acknowledge the support of the MEIC to the EMBL partnership, Centro de Excelencia Severo Ochoa, and CERCA Programme/Generalitat de Catalunya.

Author contributions: G.D. performed the bioinformatic analyses included in this work, together with O.B., S.C., and A.D.-T. L.P.P. trained the RNA basecalling models used in this work. L.L., M.C.L., and O.B. generated curlcake data sets used in this work. S.C. and A.D.-T. contributed with bioinformatic analyses. O.B. and E.M.N. supervised the project, with the contribution of J.-Y.R. E.M.N. conceived the project and acquired funding. G.D. built the figures. G.D. and E.M.N. wrote the manuscript, with the contribution from all authors.

Previous Section Next Section

Footnotes

[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.278849.123.

Received December 12, 2023.
Accepted September 10, 2024.

© 2024 Diensthuber et al.; Published by Cold Spring Harbor Laboratory Press

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

Previous Section

References

↵

Abebe JS, Price AM, Hayer KE, Mohr I, Weitzman MD, Wilson AC, Depledge DP. 2022a. DRUMMER-rapid detection of RNA modifications through comparative nanopore sequencing. Bioinformatics 38: 3113–3115. doi:10.1093/bioinformatics/btac274

CrossRef Medline Google Scholar
↵

Abebe JS, Verstraten R, Depledge DP. 2022b. Nanopore-based detection of viral RNA modifications. mBio 13: e0370221. doi:10.1128/mbio.03702-21

CrossRef Medline Google Scholar
↵

Acera Mateos P, Sethi AJ, Ravindran A, Srivastava A, Woodward K, Mahmud S, Kanchi M, Guarnacci M, Xu J, Yuen ZWS, et al. 2024. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun 15: 3899. doi:10.1038/s41467-024-47953-7

CrossRef Medline Google Scholar
↵

Begik O, Lucas MC, Pryszcz LP, Ramirez JM, Medina R, Milenkovic I, Cruciani S, Liu H, Vieira HGS, Sas-Chen A, et al. 2021. Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing. Nat Biotechnol 39:1278–1291. doi:10.1038/s41587-021-00915-6

CrossRef Medline Google Scholar
↵

Begik O, Mattick JS, Novoa EM. 2022. Exploring the epitranscriptome by native RNA sequencing. RNA 28: 1430–1439. doi:10.1261/rna.079404.122

Abstract/FREE Full Text
↵

Boža V, Brejová B, Vinař T. 2017. DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 12: e0178751. doi:10.1371/journal.pone.0178751

CrossRef Medline Google Scholar
↵

Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. 2014. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515: 143–146. doi:10.1038/nature13802

CrossRef Medline Google Scholar
↵

Cozzuto L, Delgado-Tejedor A, Hermoso Pulido T, Novoa EM, Ponomarenko J. 2023. Nanopore direct RNA sequencing data processing and analysis using MasterOfPores. Methods Mol Biol 2624: 185–205. doi:10.1007/978-1-0716-2962-8_13

CrossRef Medline Google Scholar
↵

Cruciani S, Delgado-Tejedor A, Pryszcz LP, Medina R, Llovera L, Novoa EM. 2023. De novo basecalling of m6A modifications at single molecule and single nucleotide resolution. bioRxiv doi:10.1101/2023.11.13.566801

Abstract/FREE Full Text
↵

Cui X, Wei Z, Zhang L, Liu H, Sun L, Zhang S-W, Huang Y, Meng J. 2016. Guitar: an R/Bioconductor package for gene annotation guided transcriptomic analysis of RNA-related genomic features. Biomed Res Int 2016: 8367534. doi:10.1155/2016/8367534

CrossRef Google Scholar
↵

Delgado-Tejedor A, Medina R, Begik O, Cozzuto L, Ponomarenko J, Novoa EM. 2023. Native RNA nanopore sequencing reveals antibiotic-induced loss of rRNA modifications in the A- and P-sites. bioRxiv doi:10.1101/2023.03.21.533606

Abstract/FREE Full Text
↵

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. 2017. Nextflow enables reproducible computational workflows. Nat Biotechnol 35: 316–319. doi:10.1038/nbt.3820

CrossRef Medline Google Scholar
↵

Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, Cesarkas K, Jacob-Hirsch J, Amariglio N, Kupiec M, et al. 2012. Topology of the human and mouse m⁶A RNA methylomes revealed by m⁶A-seq. Nature 485: 201–206. doi:10.1038/nature11112

CrossRef Medline Google Scholar
↵

Frye M, Jaffrey SR, Pan T, Rechavi G, Suzuki T. 2016. RNA modifications: What have we learned and where are we headed? Nat Rev Genet 17: 365–372. doi:10.1038/nrg.2016.47

CrossRef Medline Google Scholar
↵

Furlan M, Delgado-Tejedor A, Mulroney L, Pelizzola M, Novoa EM, Leonardi T. 2021. Computational methods for RNA modification detection from nanopore direct RNA sequencing data. RNA Biol 18: 31–40. doi:10.1080/15476286.2021.1978215

CrossRef Medline Google Scholar
↵

Gao Y, Liu X, Wu B, Wang H, Xi F, Kohnen MV, Reddy ASN, Gu L. 2021. Quantitative profiling of N⁶-methyladenosine at single-base resolution in stem-differentiating xylem of Populus trichocarpa using Nanopore direct RNA sequencing. Genome Biol 22: 22. doi:10.1186/s13059-020-02241-7

CrossRef Medline Google Scholar
↵

Garalde DR, Snell EA, Jachimowicz D, Sipos B, Lloyd JH, Bruce M, Pantic N, Admassu T, James P, Warland A, et al. 2018. Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods 15: 201–206. doi:10.1038/nmeth.4577

CrossRef Medline Google Scholar
↵

Gilbert WV, Nachtergaele S. 2023. mRNA regulation by RNA modifications. Annu Rev Biochem 92: 175–198. doi:10.1146/annurev-biochem-052521-035949

CrossRef Medline Google Scholar
↵

Gunter HM, Idrisoglu S, Singh S, Han DJ, Ariens E, Peters JR, Wong T, Cheetham SW, Xu J, Rai SK, et al. 2023. mRNA vaccine quality analysis using RNA sequencing. Nat Commun 14: 5663. doi:10.1038/s41467-023-41354-y

CrossRef Medline Google Scholar
↵

Helm M, Motorin Y. 2017. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat Rev Genet 18: 275–291. doi:10.1038/nrg.2016.169

CrossRef Medline Google Scholar
↵

Hendra C, Pratanwanich PN, Wan YK, Goh WSS, Thiery A, Göke J. 2022. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat Methods 19: 1590–1598. doi:10.1038/s41592-022-01666-1

CrossRef Google Scholar
↵

Huang S, Zhang W, Katanski CD, Dersh D, Dai Q, Lolans K, Yewdell J, Eren AM, Pan T. 2021. Interferon inducible pseudouridine modification in human mRNA by quantitative nanopore profiling. Genome Biol 22: 330. doi:10.1186/s13059-021-02557-y

CrossRef Medline Google Scholar
↵

Jain M, Abu-Shumays R, Olsen HE, Akeson M. 2022. Advances in nanopore direct RNA sequencing. Nat Methods 19: 1160–1164. doi:10.1038/s41592-022-01633-w

CrossRef Medline Google Scholar
↵

Jenjaroenpun P, Wongsurawat T, Wadley TD, Wassenaar TM, Liu J, Dai Q, Wanchai V, Akel NS, Jamshidi-Parsian A, Franco AT, et al. 2021. Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 49: e7. doi:10.1093/nar/gkaa620

CrossRef Medline Google Scholar
↵

Kim D, Lee J-Y, Yang J-S, Kim JW, Kim VN, Chang H. 2020. The architecture of SARS-CoV-2 transcriptome. Cell 181: 914–921.e10. doi:10.1016/j.cell.2020.04.011

CrossRef Medline Google Scholar
↵

Leger A, Amaral PP, Pandolfini L, Capitanchik C, Capraro F, Miano V, Migliori V, Toolan-Kerr P, Sideri T, Enright AJ, et al. 2021. RNA modifications detection by comparative Nanopore direct RNA sequencing. Nat Commun 12: 7198. doi:10.1038/s41467-021-27393-3

CrossRef Medline Google Scholar
↵

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. doi:10.1093/bioinformatics/btp352

CrossRef Medline Google Scholar
↵

Liang Z, Ye H, Ma J, Wei Z, Wang Y, Zhang Y, Huang D, Song B, Meng J, Rigden DJ, et al. 2023. m⁶A-Atlas v2.0: updated resources for unraveling the N⁶-methyladenosine (m⁶A) epitranscriptome among multiple species. Nucleic Acids Res 52: D194–D202. doi:10.1093/nar/gkad69

CrossRef Google Scholar
↵

Linder B, Grozhik AV, Olarerin-George AO, Meydan C, Mason CE, Jaffrey SR. 2015. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat Methods 12: 767–772. doi:10.1038/nmeth.3453

CrossRef Medline Google Scholar
↵

Liu H, Begik O, Lucas MC, Ramirez JM, Mason CE, Wiener D, Schwartz S, Mattick JS, Smith MA, Novoa EM. 2019. Accurate detection of m⁶A RNA modifications in native RNA sequences. Nat Commun 10: 4079. doi:10.1038/s41467-019-11713-9

CrossRef Medline Google Scholar
↵

Liu J, Li K, Cai J, Zhang M, Zhang X, Xiong X, Meng H, Xu X, Huang Z, Peng J, et al. 2020. Landscape and regulation of m⁶A and m⁶Am methylome across human and mouse tissues. Mol Cell 77: 426–440.e6. doi:10.1016/j.molcel.2019.09.032

CrossRef Medline Google Scholar
↵

Liu H, Begik O, Novoa EM. 2021. EpiNano: detection of m⁶A RNA modifications using Oxford nanopore direct RNA sequencing. Methods Mol Biol 2298: 31–52. doi:10.1007/978-1-0716-1374-0_3

CrossRef Medline Google Scholar
↵

Liu C, Sun H, Yi Y, Shen W, Li K, Xiao Y, Li F, Li Y, Hou Y, Lu B, et al. 2023. Absolute quantification of single-base m⁶A methylation in the mammalian transcriptome using GLORI. Nat Biotechnol 41: 355–366. doi:10.1038/s41587-022-01487-9

CrossRef Google Scholar
↵

Liu-Wei W, van der Toorn W, Bohn P, Hölzer M, Smyth RP, von Kleist M. 2024. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics 25: 528. doi:10.1186/s12864-024-10440-w

CrossRef Medline Google Scholar
↵

Lorenz DA, Sathe S, Einstein JM, Yeo GW. 2020. Direct RNA sequencing enables m⁶A detection in endogenous transcript isoforms at base-specific resolution. RNA 26: 19–28. doi:10.1261/rna.072785.119

Abstract/FREE Full Text
↵

Lucas MC, Novoa EM. 2023. Long-read sequencing in the era of epigenomics and epitranscriptomics. Nat Methods 20: 25–29. doi:10.1038/s41592-022-01724-8

CrossRef Medline Google Scholar
↵

Lucas MC, Pryszcz LP, Medina R, Milenkovic I, Camacho N, Marchand V, Motorin Y, Ribas de Pouplana L, Novoa EM. 2024. Quantitative analysis of tRNA abundance and modifications by nanopore RNA sequencing. Nat Biotechnol 42: 72–86. doi:10.1038/s41587-023-01743-6

CrossRef Google Scholar
↵

Marchand V, Blanloeil-Oillo F, Helm M, Motorin Y. 2016. Illumina-based RiboMethSeq approach for mapping of 2′-O-Me residues in RNA. Nucleic Acids Res 44: e135. doi:10.1093/nar/gkw547

CrossRef Medline Google Scholar
↵

Marchand V, Ayadi L, Ernst FGM, Hertler J, Bourguignon-Igel V, Galvanin A, Kotter A, Helm M, Lafontaine DLJ, Motorin Y. 2018. AlkAniline-Seq: profiling of m⁷G and m³C RNA modifications at single nucleotide resolution. Angew Chem Int Ed 57: 16785–16790. doi:10.1002/anie.201810946

CrossRef Google Scholar
↵

Meyer KD, Saletore Y, Zumbo P, Elemento O, Mason CE, Jaffrey SR. 2012. Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149: 1635–1646. doi:10.1016/j.cell.2012.05.003

CrossRef Medline Google Scholar
↵

Nance KD, Meier JL. 2021. Modifications in an emergency: the role of N1-methylpseudouridine in COVID-19 vaccines. ACS Cent Sci 7: 748–756. doi:10.1021/acscentsci.1c00197

CrossRef Medline Google Scholar
↵

Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, et al. 2012. BEDOPS: high-performance genomic feature operations. Bioinformatics 28: 1919–1920. doi:10.1093/bioinformatics/bts277

CrossRef Medline Google Scholar
↵

Neumann D, Reddy ASN, Ben-Hur A. 2022. RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data. BMC Bioinformatics 23: 142. doi:10.1186/s12859-022-04686-y

CrossRef Medline Google Scholar
↵

Nguyen TA, Heng JWJ, Kaewsapsak P, Kok EPL, Stanojević D, Liu H, Cardilla A, Praditya A, Yi Z, Lin M, et al. 2022. Direct identification of A-to-I editing sites with nanopore native RNA sequencing. Nat Methods 19: 833–844. doi:10.1038/s41592-022-01513-3

CrossRef Medline Google Scholar
↵

Noakes MT, Brinkerhoff H, Laszlo AH, Derrington IM, Langford KW, Mount JW, Bowman JL, Baker KS, Doering KM, Tickman BI, et al. 2019. Increasing the accuracy of nanopore DNA sequencing using a time-varying cross membrane voltage. Nat Biotechnol 37: 651–656. doi:10.1038/s41587-019-0096-0

CrossRef Medline Google Scholar
↵

Novoa EM, Mason CE, Mattick JS. 2017. Charting the unknown epitranscriptome. Nat Rev Mol Cell Biol 18: 339–340. doi:10.1038/nrm.2017.49

CrossRef Medline Google Scholar
↵

Parker MT, Knop K, Sherwood AV, Schurch NJ, Mackinnon K, Gould PD, Hall AJ, Barton GJ, Simpson GG. 2020. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification. eLife 9: e49658. doi:10.7554/eLife.49658

CrossRef Google Scholar
↵

Piechotta M, Naarmann-de Vries IS, Wang Q, Altmüller J, Dieterich C. 2022. RNA modification mapping with JACUSA2. Genome Biol 23: 115. doi:10.1186/s13059-022-02676-0

CrossRef Medline Google Scholar
↵

Pratanwanich PN, Yao F, Chen Y, Koh CWQ, Wan YK, Hendra C, Poon P, Goh YT, Yap PML, Chooi JY, et al. 2021. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat Biotechnol 39: 1394–1402. doi:10.1038/s41587-021-00949-w

CrossRef Google Scholar
↵

R Core Team. 2023. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/.

Google Scholar
↵

Reggiardo RE, Maroli SV, Peddu V, Davidson AE, Hill A, LaMontagne E, Al Aaraj Y, Jain M, Chan SY, Kim DH. 2023. Profiling of repetitive RNA sequences in the blood plasma of patients with cancer. Nat Biomed Eng 7: 1627–1635. doi:10.1038/s41551-023-01081-7

CrossRef Google Scholar
↵

Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. 2011. Integrative genomics viewer. Nat Biotechnol 29: 24–26. doi:10.1038/nbt.1754

CrossRef Medline Google Scholar
↵

Roszkowski M, Mansuy IM. 2021. High efficiency RNA extraction from sperm cells using guanidinium thiocyanate supplemented with tris(2-carboxyethyl)phosphine. Front Cell Dev Biol 9: 648274. doi:10.3389/fcell.2021.648274

CrossRef Medline Google Scholar
↵

Roundtree IA, Evans ME, Pan T, He C. 2017. Dynamic RNA modifications in gene expression regulation. Cell 169: 1187–1200. doi:10.1016/j.cell.2017.05.045

CrossRef Medline Google Scholar
↵

Saletore Y, Meyer K, Korlach J, Vilfan ID, Jaffrey S, Mason CE. 2012. The birth of the epitranscriptome: deciphering the function of RNA modifications. Genome Biol 13: 175. doi:10.1186/gb-2012-13-10-175

CrossRef Medline Google Scholar
↵

Sas-Chen A, Thomas JM, Matzov D, Taoka M, Nance KD, Nir R, Bryson KM, Shachar R, Liman GLS, Burkhart BW, et al. 2020. Dynamic RNA acetylation revealed by quantitative cross-evolutionary mapping. Nature 583: 638–643. doi:10.1038/s41586-020-2418-2

CrossRef Medline Google Scholar
↵

Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, Engreitz JM, Guttman M, Satija R, Lander ES, et al. 2014. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159: 148–162. doi:10.1016/j.cell.2014.08.028

CrossRef Medline Google Scholar
↵

Shaw A, Craig JM, Amiri H, Kim J, Upton HE, Pimentel SC, Huang JR, Marqusee S, Gundlach JH, Collins K, et al. 2023. Secondary structure detection through direct nanopore RNA sequencing. bioRxiv doi:10.1101/2023.04.05.535757

Abstract/FREE Full Text
↵

Siebert M, Söding J. 2016. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. Nucleic Acids Res 44: 6055–6069. doi:10.1093/nar/gkw521

CrossRef Medline Google Scholar
↵

Smith AM, Jain M, Mulroney L, Garalde DR, Akeson M. 2019. Reading canonical and modified nucleobases in 16S ribosomal RNA using nanopore native RNA sequencing. PLoS One 14: e0216709. doi:10.1371/journal.pone.0216709

CrossRef Medline Google Scholar
↵

Smith MA, Ersavas T, Ferguson JM, Liu H, Lucas MC, Begik O, Bojarski L, Barton K, Novoa EM. 2020. Molecular barcoding of native RNAs using nanopore sequencing and deep learning. Genome Res 30: 1345–1353. doi:10.1101/gr.260836.120

Abstract/FREE Full Text
↵

Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. 2016. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun 7: 11307. doi:10.1038/ncomms11307

CrossRef Medline Google Scholar
↵

Stephenson W, Razaghi R, Busan S, Weeks KM, Timp W, Smibert P. 2022. Direct detection of RNA modifications and structure using single-molecule nanopore sequencing. Cell Genom 2: 100097. doi:10.1016/j.xgen.2022.100097

CrossRef Google Scholar
↵

Stoiber M, Quick J, Egan R, Lee JE, Celniker S, Neely RK, Loman N, Pennacchio LA, Brown J. 2017. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv doi:10.1101/094672

Abstract/FREE Full Text
↵

Tavakoli S, Nabizadeh M, Makhamreh A, Gamper H, McCormick CA, Rezapour NK, Hou Y-M, Wanunu M, Rouhanifard SH. 2023. Semi-quantitative detection of pseudouridine modifications and type I/II hypermodifications in human mRNAs using direct long-read sequencing. Nat Commun 14: 334. doi:10.1038/s41467-023-35858-w

CrossRef Medline Google Scholar
↵

Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJM. 2018. Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience 7: giy037. doi:10.1093/gigascience/giy037

CrossRef Google Scholar
↵

Wan YK, Hendra C, Pratanwanich PN, Göke J. 2022. Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data. Trends Genet 38: 246–257. doi:10.1016/j.tig.2021.09.001

CrossRef Medline Google Scholar
↵

Wickham H. 2016. ggplot2: elegant graphics for data analysis. Springer-Verlag, New York. https://ggplot2.tidyverse.org.

Google Scholar
↵

Workman RE, Tang AD, Tang PS, Jain M, Tyson JR, Razaghi R, Zuzarte PC, Gilpatrick T, Payne A, Quick J, et al. 2019. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat Methods 16: 1297–1305. doi:10.1038/s41592-019-0617-2

CrossRef Medline Google Scholar

[1] ↵

Abebe JS, Price AM, Hayer KE, Mohr I, Weitzman MD, Wilson AC, Depledge DP. 2022a. DRUMMER-rapid detection of RNA modifications through comparative nanopore sequencing. Bioinformatics 38: 3113–3115. doi:10.1093/bioinformatics/btac274

CrossRef Medline Google Scholar

[2] ↵

Abebe JS, Verstraten R, Depledge DP. 2022b. Nanopore-based detection of viral RNA modifications. mBio 13: e0370221. doi:10.1128/mbio.03702-21

CrossRef Medline Google Scholar

[3] ↵

Acera Mateos P, Sethi AJ, Ravindran A, Srivastava A, Woodward K, Mahmud S, Kanchi M, Guarnacci M, Xu J, Yuen ZWS, et al. 2024. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun 15: 3899. doi:10.1038/s41467-024-47953-7

CrossRef Medline Google Scholar

[4] ↵

Begik O, Lucas MC, Pryszcz LP, Ramirez JM, Medina R, Milenkovic I, Cruciani S, Liu H, Vieira HGS, Sas-Chen A, et al. 2021. Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing. Nat Biotechnol 39:1278–1291. doi:10.1038/s41587-021-00915-6

CrossRef Medline Google Scholar

[5] ↵

Begik O, Mattick JS, Novoa EM. 2022. Exploring the epitranscriptome by native RNA sequencing. RNA 28: 1430–1439. doi:10.1261/rna.079404.122

Abstract/FREE Full Text

[6] ↵

Boža V, Brejová B, Vinař T. 2017. DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 12: e0178751. doi:10.1371/journal.pone.0178751

CrossRef Medline Google Scholar

[7] ↵

Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. 2014. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515: 143–146. doi:10.1038/nature13802

CrossRef Medline Google Scholar

[8] ↵

Cozzuto L, Delgado-Tejedor A, Hermoso Pulido T, Novoa EM, Ponomarenko J. 2023. Nanopore direct RNA sequencing data processing and analysis using MasterOfPores. Methods Mol Biol 2624: 185–205. doi:10.1007/978-1-0716-2962-8_13

CrossRef Medline Google Scholar

[9] ↵

Cruciani S, Delgado-Tejedor A, Pryszcz LP, Medina R, Llovera L, Novoa EM. 2023. De novo basecalling of m6A modifications at single molecule and single nucleotide resolution. bioRxiv doi:10.1101/2023.11.13.566801

Abstract/FREE Full Text

[10] ↵

Cui X, Wei Z, Zhang L, Liu H, Sun L, Zhang S-W, Huang Y, Meng J. 2016. Guitar: an R/Bioconductor package for gene annotation guided transcriptomic analysis of RNA-related genomic features. Biomed Res Int 2016: 8367534. doi:10.1155/2016/8367534

CrossRef Google Scholar

[11] ↵

Delgado-Tejedor A, Medina R, Begik O, Cozzuto L, Ponomarenko J, Novoa EM. 2023. Native RNA nanopore sequencing reveals antibiotic-induced loss of rRNA modifications in the A- and P-sites. bioRxiv doi:10.1101/2023.03.21.533606

Abstract/FREE Full Text

[12] ↵

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. 2017. Nextflow enables reproducible computational workflows. Nat Biotechnol 35: 316–319. doi:10.1038/nbt.3820

CrossRef Medline Google Scholar

[13] ↵

Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, Cesarkas K, Jacob-Hirsch J, Amariglio N, Kupiec M, et al. 2012. Topology of the human and mouse m⁶A RNA methylomes revealed by m⁶A-seq. Nature 485: 201–206. doi:10.1038/nature11112

CrossRef Medline Google Scholar

[14] ↵

Frye M, Jaffrey SR, Pan T, Rechavi G, Suzuki T. 2016. RNA modifications: What have we learned and where are we headed? Nat Rev Genet 17: 365–372. doi:10.1038/nrg.2016.47

CrossRef Medline Google Scholar

[15] ↵

Furlan M, Delgado-Tejedor A, Mulroney L, Pelizzola M, Novoa EM, Leonardi T. 2021. Computational methods for RNA modification detection from nanopore direct RNA sequencing data. RNA Biol 18: 31–40. doi:10.1080/15476286.2021.1978215

CrossRef Medline Google Scholar

[16] ↵

Gao Y, Liu X, Wu B, Wang H, Xi F, Kohnen MV, Reddy ASN, Gu L. 2021. Quantitative profiling of N⁶-methyladenosine at single-base resolution in stem-differentiating xylem of Populus trichocarpa using Nanopore direct RNA sequencing. Genome Biol 22: 22. doi:10.1186/s13059-020-02241-7

CrossRef Medline Google Scholar

[17] ↵

Garalde DR, Snell EA, Jachimowicz D, Sipos B, Lloyd JH, Bruce M, Pantic N, Admassu T, James P, Warland A, et al. 2018. Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods 15: 201–206. doi:10.1038/nmeth.4577

CrossRef Medline Google Scholar

[18] ↵

Gilbert WV, Nachtergaele S. 2023. mRNA regulation by RNA modifications. Annu Rev Biochem 92: 175–198. doi:10.1146/annurev-biochem-052521-035949

CrossRef Medline Google Scholar

[19] ↵

Gunter HM, Idrisoglu S, Singh S, Han DJ, Ariens E, Peters JR, Wong T, Cheetham SW, Xu J, Rai SK, et al. 2023. mRNA vaccine quality analysis using RNA sequencing. Nat Commun 14: 5663. doi:10.1038/s41467-023-41354-y

CrossRef Medline Google Scholar

[20] ↵

Helm M, Motorin Y. 2017. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat Rev Genet 18: 275–291. doi:10.1038/nrg.2016.169

CrossRef Medline Google Scholar

[21] ↵

Hendra C, Pratanwanich PN, Wan YK, Goh WSS, Thiery A, Göke J. 2022. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat Methods 19: 1590–1598. doi:10.1038/s41592-022-01666-1

CrossRef Google Scholar

[22] ↵

Huang S, Zhang W, Katanski CD, Dersh D, Dai Q, Lolans K, Yewdell J, Eren AM, Pan T. 2021. Interferon inducible pseudouridine modification in human mRNA by quantitative nanopore profiling. Genome Biol 22: 330. doi:10.1186/s13059-021-02557-y

CrossRef Medline Google Scholar

[23] ↵

Jain M, Abu-Shumays R, Olsen HE, Akeson M. 2022. Advances in nanopore direct RNA sequencing. Nat Methods 19: 1160–1164. doi:10.1038/s41592-022-01633-w

CrossRef Medline Google Scholar

[24] ↵

Jenjaroenpun P, Wongsurawat T, Wadley TD, Wassenaar TM, Liu J, Dai Q, Wanchai V, Akel NS, Jamshidi-Parsian A, Franco AT, et al. 2021. Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 49: e7. doi:10.1093/nar/gkaa620

CrossRef Medline Google Scholar

[25] ↵

Kim D, Lee J-Y, Yang J-S, Kim JW, Kim VN, Chang H. 2020. The architecture of SARS-CoV-2 transcriptome. Cell 181: 914–921.e10. doi:10.1016/j.cell.2020.04.011

CrossRef Medline Google Scholar

[26] ↵

Leger A, Amaral PP, Pandolfini L, Capitanchik C, Capraro F, Miano V, Migliori V, Toolan-Kerr P, Sideri T, Enright AJ, et al. 2021. RNA modifications detection by comparative Nanopore direct RNA sequencing. Nat Commun 12: 7198. doi:10.1038/s41467-021-27393-3

CrossRef Medline Google Scholar

[27] ↵

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. doi:10.1093/bioinformatics/btp352

CrossRef Medline Google Scholar

[28] ↵

Liang Z, Ye H, Ma J, Wei Z, Wang Y, Zhang Y, Huang D, Song B, Meng J, Rigden DJ, et al. 2023. m⁶A-Atlas v2.0: updated resources for unraveling the N⁶-methyladenosine (m⁶A) epitranscriptome among multiple species. Nucleic Acids Res 52: D194–D202. doi:10.1093/nar/gkad69

CrossRef Google Scholar

[29] ↵

Linder B, Grozhik AV, Olarerin-George AO, Meydan C, Mason CE, Jaffrey SR. 2015. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat Methods 12: 767–772. doi:10.1038/nmeth.3453

CrossRef Medline Google Scholar

[30] ↵

Liu H, Begik O, Lucas MC, Ramirez JM, Mason CE, Wiener D, Schwartz S, Mattick JS, Smith MA, Novoa EM. 2019. Accurate detection of m⁶A RNA modifications in native RNA sequences. Nat Commun 10: 4079. doi:10.1038/s41467-019-11713-9

CrossRef Medline Google Scholar

[31] ↵

Liu J, Li K, Cai J, Zhang M, Zhang X, Xiong X, Meng H, Xu X, Huang Z, Peng J, et al. 2020. Landscape and regulation of m⁶A and m⁶Am methylome across human and mouse tissues. Mol Cell 77: 426–440.e6. doi:10.1016/j.molcel.2019.09.032

CrossRef Medline Google Scholar

[32] ↵

Liu H, Begik O, Novoa EM. 2021. EpiNano: detection of m⁶A RNA modifications using Oxford nanopore direct RNA sequencing. Methods Mol Biol 2298: 31–52. doi:10.1007/978-1-0716-1374-0_3

CrossRef Medline Google Scholar

[33] ↵

Liu C, Sun H, Yi Y, Shen W, Li K, Xiao Y, Li F, Li Y, Hou Y, Lu B, et al. 2023. Absolute quantification of single-base m⁶A methylation in the mammalian transcriptome using GLORI. Nat Biotechnol 41: 355–366. doi:10.1038/s41587-022-01487-9

CrossRef Google Scholar

[34] ↵

Liu-Wei W, van der Toorn W, Bohn P, Hölzer M, Smyth RP, von Kleist M. 2024. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics 25: 528. doi:10.1186/s12864-024-10440-w

CrossRef Medline Google Scholar

[35] ↵

Lorenz DA, Sathe S, Einstein JM, Yeo GW. 2020. Direct RNA sequencing enables m⁶A detection in endogenous transcript isoforms at base-specific resolution. RNA 26: 19–28. doi:10.1261/rna.072785.119

Abstract/FREE Full Text

[36] ↵

Lucas MC, Novoa EM. 2023. Long-read sequencing in the era of epigenomics and epitranscriptomics. Nat Methods 20: 25–29. doi:10.1038/s41592-022-01724-8

CrossRef Medline Google Scholar

[37] ↵

Lucas MC, Pryszcz LP, Medina R, Milenkovic I, Camacho N, Marchand V, Motorin Y, Ribas de Pouplana L, Novoa EM. 2024. Quantitative analysis of tRNA abundance and modifications by nanopore RNA sequencing. Nat Biotechnol 42: 72–86. doi:10.1038/s41587-023-01743-6

CrossRef Google Scholar

[38] ↵

Marchand V, Blanloeil-Oillo F, Helm M, Motorin Y. 2016. Illumina-based RiboMethSeq approach for mapping of 2′-O-Me residues in RNA. Nucleic Acids Res 44: e135. doi:10.1093/nar/gkw547

CrossRef Medline Google Scholar

[39] ↵

Marchand V, Ayadi L, Ernst FGM, Hertler J, Bourguignon-Igel V, Galvanin A, Kotter A, Helm M, Lafontaine DLJ, Motorin Y. 2018. AlkAniline-Seq: profiling of m⁷G and m³C RNA modifications at single nucleotide resolution. Angew Chem Int Ed 57: 16785–16790. doi:10.1002/anie.201810946

CrossRef Google Scholar

[40] ↵

Meyer KD, Saletore Y, Zumbo P, Elemento O, Mason CE, Jaffrey SR. 2012. Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149: 1635–1646. doi:10.1016/j.cell.2012.05.003

CrossRef Medline Google Scholar

[41] ↵

Nance KD, Meier JL. 2021. Modifications in an emergency: the role of N1-methylpseudouridine in COVID-19 vaccines. ACS Cent Sci 7: 748–756. doi:10.1021/acscentsci.1c00197

CrossRef Medline Google Scholar

[42] ↵

Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, et al. 2012. BEDOPS: high-performance genomic feature operations. Bioinformatics 28: 1919–1920. doi:10.1093/bioinformatics/bts277

CrossRef Medline Google Scholar

[43] ↵

Neumann D, Reddy ASN, Ben-Hur A. 2022. RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data. BMC Bioinformatics 23: 142. doi:10.1186/s12859-022-04686-y

CrossRef Medline Google Scholar

[44] ↵

Nguyen TA, Heng JWJ, Kaewsapsak P, Kok EPL, Stanojević D, Liu H, Cardilla A, Praditya A, Yi Z, Lin M, et al. 2022. Direct identification of A-to-I editing sites with nanopore native RNA sequencing. Nat Methods 19: 833–844. doi:10.1038/s41592-022-01513-3

CrossRef Medline Google Scholar

[45] ↵

Noakes MT, Brinkerhoff H, Laszlo AH, Derrington IM, Langford KW, Mount JW, Bowman JL, Baker KS, Doering KM, Tickman BI, et al. 2019. Increasing the accuracy of nanopore DNA sequencing using a time-varying cross membrane voltage. Nat Biotechnol 37: 651–656. doi:10.1038/s41587-019-0096-0

CrossRef Medline Google Scholar

[46] ↵

Novoa EM, Mason CE, Mattick JS. 2017. Charting the unknown epitranscriptome. Nat Rev Mol Cell Biol 18: 339–340. doi:10.1038/nrm.2017.49

CrossRef Medline Google Scholar

[47] ↵

Parker MT, Knop K, Sherwood AV, Schurch NJ, Mackinnon K, Gould PD, Hall AJ, Barton GJ, Simpson GG. 2020. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification. eLife 9: e49658. doi:10.7554/eLife.49658

CrossRef Google Scholar

[48] ↵

Piechotta M, Naarmann-de Vries IS, Wang Q, Altmüller J, Dieterich C. 2022. RNA modification mapping with JACUSA2. Genome Biol 23: 115. doi:10.1186/s13059-022-02676-0

CrossRef Medline Google Scholar

[49] ↵

Pratanwanich PN, Yao F, Chen Y, Koh CWQ, Wan YK, Hendra C, Poon P, Goh YT, Yap PML, Chooi JY, et al. 2021. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat Biotechnol 39: 1394–1402. doi:10.1038/s41587-021-00949-w

CrossRef Google Scholar

[50] ↵

R Core Team. 2023. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/.

Google Scholar

[51] ↵

Reggiardo RE, Maroli SV, Peddu V, Davidson AE, Hill A, LaMontagne E, Al Aaraj Y, Jain M, Chan SY, Kim DH. 2023. Profiling of repetitive RNA sequences in the blood plasma of patients with cancer. Nat Biomed Eng 7: 1627–1635. doi:10.1038/s41551-023-01081-7

CrossRef Google Scholar

[52] ↵

Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. 2011. Integrative genomics viewer. Nat Biotechnol 29: 24–26. doi:10.1038/nbt.1754

CrossRef Medline Google Scholar

[53] ↵

Roszkowski M, Mansuy IM. 2021. High efficiency RNA extraction from sperm cells using guanidinium thiocyanate supplemented with tris(2-carboxyethyl)phosphine. Front Cell Dev Biol 9: 648274. doi:10.3389/fcell.2021.648274

CrossRef Medline Google Scholar

[54] ↵

Roundtree IA, Evans ME, Pan T, He C. 2017. Dynamic RNA modifications in gene expression regulation. Cell 169: 1187–1200. doi:10.1016/j.cell.2017.05.045

CrossRef Medline Google Scholar

[55] ↵

Saletore Y, Meyer K, Korlach J, Vilfan ID, Jaffrey S, Mason CE. 2012. The birth of the epitranscriptome: deciphering the function of RNA modifications. Genome Biol 13: 175. doi:10.1186/gb-2012-13-10-175

CrossRef Medline Google Scholar

[56] ↵

Sas-Chen A, Thomas JM, Matzov D, Taoka M, Nance KD, Nir R, Bryson KM, Shachar R, Liman GLS, Burkhart BW, et al. 2020. Dynamic RNA acetylation revealed by quantitative cross-evolutionary mapping. Nature 583: 638–643. doi:10.1038/s41586-020-2418-2

CrossRef Medline Google Scholar

[57] ↵

Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, León-Ricardo BX, Engreitz JM, Guttman M, Satija R, Lander ES, et al. 2014. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159: 148–162. doi:10.1016/j.cell.2014.08.028

CrossRef Medline Google Scholar

[58] ↵

Shaw A, Craig JM, Amiri H, Kim J, Upton HE, Pimentel SC, Huang JR, Marqusee S, Gundlach JH, Collins K, et al. 2023. Secondary structure detection through direct nanopore RNA sequencing. bioRxiv doi:10.1101/2023.04.05.535757

Abstract/FREE Full Text

[59] ↵

Siebert M, Söding J. 2016. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. Nucleic Acids Res 44: 6055–6069. doi:10.1093/nar/gkw521

CrossRef Medline Google Scholar

[60] ↵

Smith AM, Jain M, Mulroney L, Garalde DR, Akeson M. 2019. Reading canonical and modified nucleobases in 16S ribosomal RNA using nanopore native RNA sequencing. PLoS One 14: e0216709. doi:10.1371/journal.pone.0216709

CrossRef Medline Google Scholar

[61] ↵

Smith MA, Ersavas T, Ferguson JM, Liu H, Lucas MC, Begik O, Bojarski L, Barton K, Novoa EM. 2020. Molecular barcoding of native RNAs using nanopore sequencing and deep learning. Genome Res 30: 1345–1353. doi:10.1101/gr.260836.120

Abstract/FREE Full Text

[62] ↵

Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. 2016. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun 7: 11307. doi:10.1038/ncomms11307

CrossRef Medline Google Scholar

[63] ↵

Stephenson W, Razaghi R, Busan S, Weeks KM, Timp W, Smibert P. 2022. Direct detection of RNA modifications and structure using single-molecule nanopore sequencing. Cell Genom 2: 100097. doi:10.1016/j.xgen.2022.100097

CrossRef Google Scholar

[64] ↵

Stoiber M, Quick J, Egan R, Lee JE, Celniker S, Neely RK, Loman N, Pennacchio LA, Brown J. 2017. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv doi:10.1101/094672

Abstract/FREE Full Text

[65] ↵

Tavakoli S, Nabizadeh M, Makhamreh A, Gamper H, McCormick CA, Rezapour NK, Hou Y-M, Wanunu M, Rouhanifard SH. 2023. Semi-quantitative detection of pseudouridine modifications and type I/II hypermodifications in human mRNAs using direct long-read sequencing. Nat Commun 14: 334. doi:10.1038/s41467-023-35858-w

CrossRef Medline Google Scholar

[66] ↵

Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJM. 2018. Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience 7: giy037. doi:10.1093/gigascience/giy037

CrossRef Google Scholar

[67] ↵

Wan YK, Hendra C, Pratanwanich PN, Göke J. 2022. Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data. Trends Genet 38: 246–257. doi:10.1016/j.tig.2021.09.001

CrossRef Medline Google Scholar

[68] ↵

Wickham H. 2016. ggplot2: elegant graphics for data analysis. Springer-Verlag, New York. https://ggplot2.tidyverse.org.

Google Scholar

[69] ↵

Workman RE, Tang AD, Tang PS, Jain M, Tyson JR, Razaghi R, Zuzarte PC, Gilpatrick T, Payne A, Quick J, et al. 2019. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat Methods 16: 1297–1305. doi:10.1038/s41592-019-0617-2

CrossRef Medline Google Scholar

Enhanced detection of RNA modifications and read mapping with high-accuracy nanopore RNA basecalling models

Abstract

Results

RNA basecalling model choice affects both per-read accuracy and RNA modification “errors”

IVT model outperforms other models in detecting m6A modifications in DRACH contexts

Improved in vivo detection of m6A-modified sites in human samples using the IVT model

A high-accuracy IVT_SUP model does not improve RNA modification detection

SUP model demonstrates 97% read accuracy and up to a 50% increase in mapped reads

Enhanced U > C mismatch signature and reduced background error in a vaccine model using the SUP model

Discussion

Methods

In vitro transcription

DRS of in vitro transcribed RNA

Training a high-accuracy RNA basecallers (SUP model)

Training an RNA basecaller using in vitro sequences (IVT model)

Demultiplexing DRS

Basecalling and mapping of DRS data sets

Comparative analysis of system requirements

Comparative analysis of basecalled features in curlcakes

Detection of m6A-modified sites in HEK293T using comparative error analysis

Metagene analysis and motif identification

Comparison of identified m6A sites against m6ACE-seq, GLORI-seq, and miCLIP-seq

Processing of eGFP mRNA vaccine data and comparative analysis

Software availability

Data access

Competing interest statement

Acknowledgments

Footnotes

References

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

ORCID

Share

Preprint Server

Navigate This Article

Current Issue

In This Issue

IVT model outperforms other models in detecting m⁶A modifications in DRACH contexts

Improved in vivo detection of m⁶A-modified sites in human samples using the IVT model

Detection of m⁶A-modified sites in HEK293T using comparative error analysis

Comparison of identified m⁶A sites against m⁶ACE-seq, GLORI-seq, and miCLIP-seq