Table 1.

The collection of the data sets with their corresponding mRNA source and property used for method evaluation

Data set	Target	Category	No. of mRNAs	Seq length
MLOS flu vaccines (Sanofi-Aventis)	Expression	Regression	543	1698–1704
mRFP expression (Nieuwkoop et al. 2023)	Expression	Regression	1459	678–678
Fungal expression (Wint et al. 2022)	Expression	Regression	7056	150–3000
E. coli proteins (Ding et al. 2022)	Expression	Classification	6348	171–3000
Tc-riboswitches (Groher et al. 2019)	Switching factor	Regression	355	67–73
mRNA stability (Diez et al. 2022)	Stability	Regression	41,123	30–1497
SARS-CoV-2 vaccine degradation (Wayment-Steele et al. 2022)	Degradation	Regression	2400	81–81

Each data set is split into training, validation, and test with a 0.7, 0.15, and 0.15 ratio. All the methods were optimized on the same data split.

CodonBERT large language model for mRNA vaccines