CodonBERT large language model for mRNA vaccines

Table 1.

The collection of the data sets with their corresponding mRNA source and property used for method evaluation

Data set Target Category No. of mRNAs Seq length
MLOS flu vaccines (Sanofi-Aventis) Expression Regression 543 1698–1704
mRFP expression (Nieuwkoop et al. 2023) Expression Regression 1459 678–678
Fungal expression (Wint et al. 2022) Expression Regression 7056 150–3000
E. coli proteins (Ding et al. 2022) Expression Classification 6348 171–3000
Tc-riboswitches (Groher et al. 2019) Switching factor Regression 355 67–73
mRNA stability (Diez et al. 2022) Stability Regression 41,123 30–1497
SARS-CoV-2 vaccine degradation (Wayment-Steele et al. 2022) Degradation Regression 2400 81–81
  • Each data set is split into training, validation, and test with a 0.7, 0.15, and 0.15 ratio. All the methods were optimized on the same data split.

This Article

  1. Genome Res. 34: 1027-1035

Preprint Server