Biosurfer for systematic tracking of regulatory mechanisms leading to protein isoform diversity

  1. Gloria M. Sheynkman2,9,10,11
  1. 1Broad Institute of MIT and Harvard University, Cambridge, Massachusetts 02142, USA;
  2. 2Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, Virginia 22903, USA;
  3. 3Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, Worcester, Massachusetts 01609, USA;
  4. 4Computer Science Department, Worcester Polytechnic Institute, Worcester, Massachusetts 01609, USA;
  5. 5Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA;
  6. 6Department of Biology, Boston University, Boston, Massachusetts 02215, USA;
  7. 7Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA;
  8. 8Division of General Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA;
  9. 9Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22903, USA;
  10. 10Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia 22903, USA;
  11. 11UVA Cancer Center, University of Virginia, Charlottesville, Virginia 22903, USA
  • Corresponding author: gs9yr{at}virginia.edu
  • Abstract

    Long-read RNA-seq has shed light on transcriptomic complexity, but questions remain about the functionality of downstream protein products. We introduce Biosurfer, a computational approach for comparing protein isoforms, while systematically tracking the transcriptional, splicing, and translational variations that underlie differences in the sequences of the protein products. Using Biosurfer, we analyzed the differences in 35,082 pairs of GENCODE annotated protein isoforms, finding a majority (70%) of variable N-termini are due to the alternative transcription start sites, while only 9% arise from 5′ UTR alternative splicing (AS). Biosurfer's detailed tracking of nucleotide-to-residue relationships helps reveal an uncommonly tracked source of single amino acid residue changes arising from the codon splits at junctions. For 17% of internal sequence changes, such split codon patterns lead to single residue differences, termed “ragged codons.” Of variable C-termini, 72% involve splice- or intron retention-induced reading frameshifts. We systematically characterize an unusual pattern of reading frame changes, in which the first frameshift is closely followed by a distinct second frameshift that restores the original frame, which we term a “snapback” frameshift. We analyze the long-read RNA-seq-predicted proteome of a human cell line and find similar trends as compared to our GENCODE analysis, with the exception of a higher proportion of transcripts predicted to undergo nonsense-mediated decay. Biosurfer's comprehensive characterization of long-read RNA-seq data sets should accelerate insights of the functional role of protein isoforms, providing mechanistic explanation of the origins of the proteomic diversity driven by the AS. Biosurfer is available as a Python package.

    Footnotes

    • Received March 15, 2024.
    • Accepted January 6, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    | Table of Contents

    Preprint Server