TY - JOUR A1 - Li, Sizhen A1 - Moayedpour, Saeed A1 - Li, Ruijiang A1 - Bailey, Michael A1 - Riahi, Saleh A1 - Kogler-Anele, Lorenzo A1 - Miladi, Milad A1 - Miner, Jacob A1 - Pertuy, Fabien A1 - Zheng, Dinghai A1 - Wang, Jun A1 - Balsubramani, Akshay A1 - Tran, Khang A1 - Zacharia, Minnie A1 - Wu, Monica A1 - Gu, Xiaobo A1 - Clinton, Ryan A1 - Asquith, Carla A1 - Skaleski, Joseph A1 - Boeglin, Lianne A1 - Chivukula, Sudha A1 - Dias, Anusha A1 - Strugnell, Tod A1 - Montoya, Fernando Ulloa A1 - Agarwal, Vikram A1 - Bar-Joseph, Ziv A1 - Jager, Sven T1 - CodonBERT large language model for mRNA vaccines Y1 - 2024/07/01 JF - Genome Research JO - Genome Research SP - 1027 EP - 1035 DO - 10.1101/gr.278870.123 VL - 34 IS - 7 UR - http://genome.cshlp.org/content/34/7/1027.abstract N2 - mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set. ER -