RT Journal A1 Li, Sizhen A1 Moayedpour, Saeed A1 Li, Ruijiang A1 Bailey, Michael A1 Riahi, Saleh A1 Kogler-Anele, Lorenzo A1 Miladi, Milad A1 Miner, Jacob A1 Pertuy, Fabien A1 Zheng, Dinghai A1 Wang, Jun A1 Balsubramani, Akshay A1 Tran, Khang A1 Zacharia, Minnie A1 Wu, Monica A1 Gu, Xiaobo A1 Clinton, Ryan A1 Asquith, Carla A1 Skaleski, Joseph A1 Boeglin, Lianne A1 Chivukula, Sudha A1 Dias, Anusha A1 Strugnell, Tod A1 Montoya, Fernando Ulloa A1 Agarwal, Vikram A1 Bar-Joseph, Ziv A1 Jager, Sven T1 CodonBERT large language model for mRNA vaccines JF Genome Research JO Genome Research YR 2024 FD July 01 VO 34 IS 7 SP 1027 OP 1035 DO 10.1101/gr.278870.123 UL http://genome.cshlp.org/content/34/7/1027.abstract AB mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.