RT Journal
A1 Li, Sizhen
A1 Moayedpour, Saeed
A1 Li, Ruijiang
A1 Bailey, Michael
A1 Riahi, Saleh
A1 Kogler-Anele, Lorenzo
A1 Miladi, Milad
A1 Miner, Jacob
A1 Pertuy, Fabien
A1 Zheng, Dinghai
A1 Wang, Jun
A1 Balsubramani, Akshay
A1 Tran, Khang
A1 Zacharia, Minnie
A1 Wu, Monica
A1 Gu, Xiaobo
A1 Clinton, Ryan
A1 Asquith, Carla
A1 Skaleski, Joseph
A1 Boeglin, Lianne
A1 Chivukula, Sudha
A1 Dias, Anusha
A1 Strugnell, Tod
A1 Montoya, Fernando Ulloa
A1 Agarwal, Vikram
A1 Bar-Joseph, Ziv
A1 Jager, Sven
T1 CodonBERT large language model for mRNA vaccines
JF Genome Research 
JO Genome Research 
YR 2024 
FD July 01 
VO 34 
IS 7 
SP 1027 
OP 1035 
DO 10.1101/gr.278870.123 
UL http://genome.cshlp.org/content/34/7/1027.abstract 
AB mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.