Transcriptome innovations in primates revealed by single-molecule long-read sequencing

Transcriptomic diversity greatly contributes to the fundamentals of disease, lineage-specific biology, and environmental adaptation. However, much of the actual isoform repertoire contributing to shaping primate evolution remains unknown. Here, we combined deep long- and short-read sequencing complemented with mass spectrometry proteomics in a panel of lymphoblastoid cell lines (LCLs) from human, three other great apes, and rhesus macaque, producing the largest full-length isoform catalog in primates to date. Around half of the captured isoforms are not annotated in their reference genomes, significantly expanding the gene models in primates. Furthermore, our comparative analyses unveil hundreds of transcriptomic innovations and isoform usage changes related to immune function and immunological disorders. The confluence of these evolutionary innovations with signals of positive selection and their limited impact in the proteome points to changes in alternative splicing in genes involved in immune response as an important target of recent regulatory divergence in primates.


Artifact filtering from Iso-Seq data
Isoform FASTA files were provided to SQANTI (Tardaguila et al. 2018) for quality control and artifact filtering. Here, the corresponding Ensembl annotation (V91) from each species was defined as the reference transcriptome. Furthermore, splice junction coverage by RNA-seq and isoform abundances (full-length reads from Iso-Seq and TPM calculated from RNA-seq) were used to train the SQANTI classifier.
Splice junction coverage in Iso-Seq isoforms across LCLs was obtained by mapping RNA-seq of the same LCLs used in Iso-Seq to the corresponding reference genome, as described in 'RNA-seq data production and processing'. Normalized quantification estimates (TPM) for Iso-Seq isoforms from each species were computed in the LCLs of that species using kallisto-quant (Bray et al. 2016) with --rfstranded argument (default settings), leveraging RNA-seq in the same LCLs for which we generated each long-read transcriptome. Here, the non-redundant set of Iso-Seq isoforms (after ToFU collapsing) obtained in each species was used as kallisto target transcriptome for RNA-seq pseudo-alignment (see 'Transcript expression calculation and reconstruction of transcript gains and losses' for the description of the quantification strategy in the comparative analyses).
SQANTI machine learning method (sqanti_filter.py, default parameters) was applied to the isoforms in each species to keep a set of highly reliable isoforms, discarding any potential sequencing artifacts. All subsequent analyses were performed using the set of isoforms passing SQANTI filtering in each species, and isoforms with a splice junction structure that is not annotated (SQANTI structural category was not FSM or ISM) in the corresponding reference transcriptome were defined as novel. Data manipulation was performed using SAMtools (Li et al. 2009), BEDTools (Quinlan and Hall 2010), and in-house scripts.

Classification of alternative splicing patterns
A classification of alternative splicing (AS) events was generated by SUPPA (version 2.3, parameters: -f ioe -e {SE,SS,MX,RI,FL}) (Alamancos et al. 2015) from the isoform GTF files resulting from SQANTI.
Five types of AS were identified: skipping exon (SE), alternative 5' or 3' splice sites (A5SS/A3SS), retained introns (RI), and mutually exclusive exons (MX). Splicing events were considered as known if both splice forms (inclusion and exclusion forms of the AS event) are found in Ensembl V91 transcriptomic annotations.

Chromatographic and mass spectrometric analysis
Samples were analyzed using an LTQ-Orbitrap Fusion Lumos mass spectrometer (Thermo Fisher Scientific, San Jose, CA, USA) coupled to an EASY-nLC 1000 (Thermo Fisher Scientific (Proxeon), Odense, Denmark). Peptides were loaded directly onto the analytical column and were separated by reversed-phase chromatography using a 50-cm column with an inner diameter of 75 μm, packed with 2 μm C18 particles spectrometer (Thermo Fisher Scientific, San Jose, CA, USA).
Chromatographic gradients started at 95% buffer A and 5% buffer B with a flow rate of 300 nl/min for 5 minutes and gradually increased to 22% buffer B and 78% A in 79 min and then to 35% buffer B and 65% A in 11 min. After each analysis, the column was washed for 10 min with 10% buffer A and 90% buffer B. Buffer A: 0.1% formic acid in water. Buffer B: 0.1% formic acid in acetonitrile.
The mass spectrometer was operated in positive ionization mode with nano spray voltage set at 2.4 kV and source temperature at 275°C. Ultramark 1621 was used for external calibration of the FT mass analyzer prior to the analyses, and an internal calibration was performed using the background polysiloxane ion signal at m/z 445.12003. The dynamics exclusion duration was set at 60s, with a range in mass tolerance of ±10 ppm. Each analysis used the multinotch MS3-based TMT method (McAlister et al. 2014). The scan sequence began with an MS1 spectrum (Orbitrap analysis; resolution 120 000; mass range 375-1500 m/z; automatic gain control (AGC) target 4 × 10 5 , maximum injection time 50 ms). In each cycle of data-dependent acquisition analysis, the most intense ions above a threshold ion count of 5000 were selected for fragmentation after each survey scan. The number of selected precursor ions for fragmentation was determined by the "Top Speed" acquisition algorithm with a cycle time set at 3s. Fragment ion spectra were produced via collision-induced dissociation (CID) at a normalized collision energy of 35% and they were acquired in the ion trap mass analyzer in "Turbo" mode. AGC was set to 10 4 , and an isolation window of 0.7 m/z and a maximum injection time of 50 ms were used. Following the acquisition of each MS2 spectrum, MS3 spectra were collected, in which multiple MS2 fragment ions are captured in the MS3 precursor population using isolation waveforms with multiple frequency notches. MS3 precursors were fragmented by high energy collision-induced dissociation (HCD) at a normalized collision energy of 65% and acquired in the Orbitrap analyzer at 50000 resolution. AGC was set to 10 5 , and an isolation window of 2 m/z and a maximum injection time of 105 ms were used. All data were acquired with Xcalibur software v4.1.31.9.
Digested bovine serum albumin (New England Biolabs cat # P8108S) was analyzed between each sample to avoid sample carryover and to assure stability of the instrument, and QCloud (Chiva et al. 2018) has been used to control instrument longitudinal performance during the project.

Identification of novel peptides by mass spectrometry proteomics
Acquired spectra were analyzed using the Proteome Discoverer software suite (v2.3, Thermo Fisher Scientific) and the Mascot search engine (v2.6, Matrix Science (Perkins et al. 1999)). A UniProt-based database comprising all reference proteomes from all five primates was used to identify detectable genes and narrow down the search space. After removing transcripts classified as potential artifacts by SQANTI, Iso-Seq protein predictions for these genes in all species were incorporated into a customized multi-species database, together with the predicted proteins from intergenic and fusion loci. For this, NHP Iso-Seq isoforms were first lifted to the most recent assembly of the corresponding species (panTro6, gorGor6, ponAbe3, rheMac10) using UCSC liftOver with default parameters (Hinrichs et al. 2006), and protein predictions from Iso-Seq isoforms were performed by SQANTI using GeneMarkS-T algorithm (GMST) (Tang et al. 2014). All target databases included a list of common contaminants (Beer et al. 2017) and all the corresponding decoy entries. This strategy leverages the high-quality Iso-Seq isoforms detected in all species to build a comprehensive target database, which was used to identify peptides present across all samples, allowing peptide identifications from protein predictions derived from transcripts captured in a different species.
For peptide identification, a precursor ion mass tolerance of 7 ppm was used for MS1 level, trypsin was chosen as the enzyme, and up to three missed cleavages were allowed. The fragment ion mass tolerance was set to 0.5 Da for MS2 spectra. Oxidation of methionine and N-terminal protein acetylation were used as variable modifications, whereas carbamidomethylation on cysteines, TMT6plex in Lysines and TMT6plex in peptide N-terminal were set as a fixed modification. The false discovery rate (FDR) in peptide identification was set to a maximum of 5%, and peptides were quantified using the reporter ion intensities in MS3. According to manufacturer's specifications, reporter ion intensities were adjusted to correct for the isotopic impurities of the different TMT reagents. Reporter ion intensities from the samples were referenced to the common pool present in each of the three TMT mixes and they were used to estimate the peptide and protein fold-changes. The obtained fold-changes were then logtransformed and normalized between the channels by adjusting the mean log-fold-change to zero. We established further stringent criteria to filter for high-quality peptide identifications and ensure a good quantification signal: FDR<5%, Mascot IonScore > 20, reporter ion intensity signal (abundance) per sample > 50 in any sample of a given species, and species-wise median ratio of sample abundance to pool abundance > 0.6, after discarding contaminant matches and peptides arising from more than 1 tryptic miscleavage. As a quality control, we projected Iso-Seq transcripts from all species to each species' genome using UCSC liftOver (default parameters) (Hinrichs et al. 2006), and predicted the proteins for each species. We only kept the high-quality identified peptides in a given species that were tryptic in the projected proteome (see 'Detected peptides and plausible in genome' in Supplemental   Table S7).
Then, the resulting sets of peptides identified in each species were compared to the corresponding reference proteome from UniProt (based on hg38, panTro5, gorGor4, ponAbe2 and rheMac8) and RefSeq (CDS set from hg38, panTro6, gorGor6, ponAbe3 and rheMac10). To do so, we performed insilico tryptic digestion of UniProt and RefSeq proteomes (allowing a maximum of 1 miscleavage). Highquality identified peptides in a given species that are absent in the digestions of UniProt or RefSeq proteomes for the corresponding species were defined as novel.

Definition of orthology relationships
To define one-to-one orthologous genes in the five primate species, peptide sequences, gene annotations and genome assemblies were retrieved from Ensembl, using the same genomes as in Iso-Seq mapping (hg38, panTro5, gorGor4, ponAbe2, and rheMac8). Considering that one gene may contain multiple protein isoforms, the longest isoform with a complete ORF was selected as representative. Then, BLASTP (v2.2.26) (Altschul et al. 1990) was used for peptide sequence comparison between NHP and humans with an e-value limit of 1E-5 and SOLAR (v0.9.6) was used to combine local alignments (Almasy and Blangero 1998). Only reciprocal best hits (RBH) following synteny conservation were defined as one-to-one orthologous genes. One RBH gene pair (A1A2; 1 and 2 denote two different species) and its nearest RBH gene pair (B1B2) were considered syntenic if they met the following requirements: a) genes A1 and B1 are on same chromosome/scaffold; b) genes A2 and B2 are on same chromosome/scaffold; c) the number of genes located between A1 and B1 < 5; d) the number of genes between A2 and B2 < 5. We also retained RBH gene pairs if the corresponding scaffold has only one gene annotated. Then, we intersected these results with Ensembl Compara database (V91) (Yates et al. 2020), and kept the genes where one-to-one orthologues in the five species were consistent in both methods.
For isoforms produced by one-to-one orthologous genes, UCSC liftOver (Hinrichs et al. 2006) was used to identify the genome coordinates of orthologous transcripts using available whole-genome alignments for hg38, panTro6, gorGor6, ponAbe3 and rheMac10. Genomic coordinates of NHP isoforms (BED12) were mapped to the most recent assemblies (default parameters in UCSC liftOver), and then to hg38 (tolerating nucleotide differences with -minMatch=0.5 in UCSC liftOver). To reduce the mapping and quantification noise associated with very small differences in transcript sequences, the variability of internal exon ends and transcript extremes from each transcript model (derived from any species) was collapsed using TAMA merge in hg38 reference genome (parameters: -a 100 -m 100 -z 100 -e longest_ends) (Kuo et al. 2020). Then, these non-redundant transcript models were mapped to NHP (-minMatch=0.5 in UCSC liftOver).
For the comparative analyses (species-specific gains and losses, and DIU), we kept the transcript models effectively mapped to all five species. This strategy addresses the unbalanced Iso-Seq isoform repertoire across species (Supplemental Table S3) resulting from the lack of experimental saturation.
As explained below, the expression of orthologous transcripts/exons are calculated using RNA-seq data from each species (3 LCLs per species) mapped to the corresponding transcript/exons in that species.
Furthermore, for each species, we used SQANTI to characterize the orthologous transcript models in the context of the most recent genome assemblies and their reference annotations (Ensembl for hg38, RefSeq for panTro6, gorGor6, ponAbe3 and rheMac10).
We followed an analogous strategy without any transcript collapsing in hg38 to account for smaller splicing changes (e.g., alternative donor and acceptor sites) that were quantified by DEU analyses (TAMA merge parameters: -a 0 -m 0 -z 0).

Species-specific exon gains (genomic structure)
We restricted these analyses to the set of transcripts derived from one-to-one orthologous genes. NHP exons which failed to be mapped to the human reference genome (hg38) and human exons which failed to be mapped to all four NHP by UCSC liftOver (default settings) were chosen as species-specific exon candidates. Then, Liftoff (Shumate and Salzberg 2020) with default settings was used to perform a local alignment of exon candidates to the remaining four assemblies. Only exons with an alignment coverage (length of aligned exonic region / total exon length) lower than 50% against all the remaining four assemblies were selected as species-specific exonic structures. The species-specific exons and conserved exons were intersected with repetitive regions (UCSC RepeatMasker track files for hg38, panTro6, gorGor6, ponAbe3, and rheMac10) and with CDS/UTR transcript annotations provided by SQANTI.
To assess how these exons might impact protein function, we evaluated changes in the combination of protein domains in the encoded protein. To do so, the encoded proteins in transcripts with speciesspecific exons (SQANTI predictions) and all protein sequences from Ensembl (hg38, panTro5, gorGor4, ponAbe2, and rheMac8) were searched against Pfam database using InterPro (Blum et al. 2021) (default settings) to retrieve their protein domains. Protein domain structures (including number and specific order of domains) resulting from exonization events were compared to the set of protein domains found in the complete Ensembl proteomes.
Since some orthologous transcripts projected to a given species might not be supported for RNA-seq in their splicing structure (e.g., if there has been a change in splice sites between species), we evaluated the RNA-seq coverage in splice junctions (sample-wise) prior to kallisto quantification. For each LCL, transcripts that were unsupported by RNA-seq in their splice junctions were excluded from kallisto calculation and assigned 0 TPM (for subsequent analysis, we excluded mono-exonic transcripts). This strategy implies a very strict definition of orthologous transcripts, where small changes in splice sites across species are detected since they will affect the calculation of isoform abundances. Batch effects in transcript expression resulting from three different RNA-seq experimental batches were corrected using ComBat (default settings)   Wagner parsimony was used with a relative penalty of gain-to-loss equal to 1. We also retrieved the number of transcript gains and losses that can be explained by a unique gain or loss event in our phylogeny and compared them to the corresponding gains and losses inferred by Wagner parsimony reconstruction for the same branches by calculating the ratio between them (Supplemental Fig. S14).
To exclude intra-species variation and identify species differences in the detection of transcript gains and losses, we further asked for strict intra-species consistency in the presence/absence of RNA-seqbased transcript expression (short-read-based replicability). Hence, for each transcript we require that the 3 biological replicates (LCLs) from each species are coherent in the pattern of expression/absence of expression. Thus, species-specific transcripts (isoform innovations) are expressed in all LCLs from a given species (TPM>0) and absent (TPM=0) in all LCLs from the remaining species. These criteria increase the percentage of transcript gains and losses that are coherent with the phylogeny (i.e., being gained or lost just once in the phylogeny) in comparison to the total set of transcripts (Supplemental

Fig. S14).
To inspect the contribution of species-specific splicing events to species-specific transcripts, we retrieved the splice junctions conservation level (number of species in which they are supported) from the set of transcripts showing intra-species consistency in their expression (see above). Similar to transcript conservation analyses, we asked for intra-species consistency in the patterns of junction support by RNA-seq. In this way, species-specific junctions are supported in all LCLs from a given species (junction reads > 0), and unsupported (junction reads = 0) in all LCLs from the rest of species.
For these splice junctions (non-redundant set), RNA-seq support across GTEx v6 samples was retrieved using the snapcount package (Wilks et al. 2021) (hg38 exact coordinates query within tissue_specificity function). To be considered supported in a given human tissue, we require that the splice junction is used in at least 20% of the GTEx samples for the corresponding tissue.
The proportion of AS classes was compared between species-specific junctions and junctions used in all five primates (SUPPA definition, same parameters as in 'Classification of alternative splicing patterns'). Both inclusion and exclusion forms must exist in the transcriptome to define an AS event.
Splice junctions were associated with a given AS class if they matched either the inclusion or the exclusion form. We excluded RI from this comparison since our quantification strategy relies on RNAseq support in transcript splice junctions, without taking into account RNA-seq coverage in the boundary between an exon and a retained intron.

Differential gene expression analyses
To assess the effect of gene expression in the detection of species-specific transcripts, differential gene expression analyses were performed using DESeq2 (pairwise comparatives) (Love et al. 2014), including the 3 RNA-seq biological replicates per species. Genes with 10 or more RNA-seq reads accumulated across samples were retained. Species-specific up-regulated genes were defined as those showing significant overexpression (permissive padj<0.1 and log2FoldChange>0 were used to detect even subtle influences of gene expression changes in isoform detection) in a given species compared to all the rest, regardless of the possible gene expression differences among the remaining species. Then, genes up-regulated in each species were intersected with those expressing speciesspecific transcripts in the corresponding species.
We followed an analogous strategy to define species-specific down-regulated genes (padj<0.1, log2FoldChange<0). Species-specific up-and down-regulated genes were intersected with those displaying species-specific up-or down-regulated exon usage in the same species.

Differential isoform usage analyses
Kallisto transcript quantifications were used to evaluate interspecies isoform usage (IU) changes (Soneson et al. 2015). ComBat batch effect correction   To measure the most confident isoform usage changes, we first excluded genes in which the transcript models resulted from the collapsing of internal exon boundaries by TAMA merge, since this variability in internal exon ends can lead to artificial isoform usage differences considering our quantification strategy (see 'Transcript expression calculation and reconstruction of transcript gains and losses'). In addition, low expression genes/isoforms and genes with a single isoform were removed using IsoformSwitchAnalyzeR preFilter function (default settings: geneExpressionCutoff = 1 TPM, isoformExpressionCutoff = 0 TPM, IFcutoff = 0.01, removeSingleIsoformGenes = TRUE). Differential isoform usage (DIU) analyses were performed by establishing pairwise comparatives across the five species using DEXSeq method in IsoformSwitchAnalyzeR (Vitting-Seerup and Sandelin 2017; Ritchie et al. 2015;Anders et al. 2012). Species-specific DIU cases were defined as significant isoform usage changes (|dIU|>0.1 and isoform switch q-value<0.05) in a given species compared to the rest of primates (up-or down-regulated), while the remaining species showed non-significant differences among them. The same rationale was applied to classify isoform usage changes shared by groups of 2 or 3 species in comparison to the rest of them. IsoformSwitchAnalyzeR preFilter function (default settings) was also used for the evaluation of genes where orthologous splice junctions showed highimpact genetic changes across primates.

Differential exon usage analyses
The projected transcript models in each species (without any collapsing, see 'Definition of orthology relationships') were flattened to define orthologous exonic counting bins from all transcribed segments (i.e., exonic parts). RNA-seq read counts overlapping each exonic part in the 3 LCLs from each species were obtained using HTSeq-count (parameters: -p yes -r pos -s reverse -f sam) (Anders et al. 2015) after discarding multi-mapping reads. Differential exon usage (DEU) analyses were performed by establishing pairwise comparatives across the five species using DEXSeq (Anders et al. 2012;Reyes et al. 2013) while controlling for batch effects. Significant exon usage changes (exonBaseMean>10, |log2FC|>1.2 and padj<0.05) were retrieved from each pairwise comparative and used to classify the change pattern across species. Exonic parts showing significant difference in usage in a given species compared to the rest of primates (up-or down-regulated) were defined as species-specific as long as the remaining species did not display significant differences among them. The same rationale was used to classify changes shared by groups of 2 or 3 species compared to the remaining species. Exonic parts not classified as DEU in any pairwise comparative and displaying exonBaseMean>10 (average across all samples) were defined as conserved.
To quantify the frequency of exon inclusion, we calculated percent-spliced-in (PSI) estimates according to (Schafer et al. 2015) and computed the average across the 15 RNA-seq experiments (3 LCLs per species). Average PSI estimates were used to classify exonic parts into highly excluded [0-0.2), middle excluded [0.2-0.4), middle included [0.4-0.8) and highly included [0.8-1] (Fig. 5A). We classified the exonic parts into AS modes using SUPPA in the projected transcript models (see 'Classification of alternative splicing patterns'). The phastCons conservation scores across primates were retrieved from UCSC whole-genome alignments (17-way phastCons scores). Average phastCons scores per exonic part were computed using UCSC bigWigAverageOverBed tool (default settings), and then classified into low [0-0.3], middle (0.3-0.8) and high [0.8-1] (Fig. 5A). Nucleotide diversity (pi) in human populations was calculated from the data generated by the 1000 Genomes project (The 1000 Genomes Project Consortium 2015) mapped against hg38 using VCFtools (version 0.1.12, VCFtools --site-pi, default settings) (Danecek et al. 2011). The 1000 Genomes strict mask was used to select exonic parts with all their nucleotides classified as passed bases according to the strict mask BED file (hg38).
Average pi estimates were also computed for each exonic part.

Classification of genes according to their splicing and usage patterns
One-to-one orthologous genes expressed in LCLs were classified according to the splicing and usage changes detected in the analyses of transcript expression, DIU, and DEU.
To do so, isoforms and exonic parts were first classified as conserved (if the expression or usage is conserved in all species), species-specific gains (up-regulation in a single species in comparison to the rest), and other changes (i.e., the remaining cases resulting from expression or usage differences between groups of species, or species-specific losses). Isoforms and exonic parts showing usage patterns that are not consistent with differences between groups of species (e.g., differential usage in only one pairwise comparative) were not considered as they are more likely reflecting intra-species variability.
The above classification of isoforms and exonic parts resulting from the analyses of transcript expression, DIU, and DEU were combined to establish five gene classes. Each gene was classified according to the following rules: 1) 'all_conserved': genes where only conserved isoforms/exonic parts were detected. Thus, these genes produce isoforms expressed in all five species, and no isoform/exon usage differences were found across species.
2) 'only_human_sp': genes where human-specific gains were detected (and not species-specific gains from other species), allowing the presence of conserved or other isoforms/exonic parts.
Thus, these genes produce human-specific transcripts, or show human-specific isoform/exon usage gains.
3) 'only_NHP_sp': genes where only chimpanzee, gorilla, orangutan or macaque-specific gains were detected (for a single species, excluding human), allowing the presence of conserved or other isoforms/exonic parts. Thus, these genes produce species-specific transcripts in NHP, or show species-specific isoform/exon usage gains in NHP. 4) 'convergence_spsp': genes where independent events of species-specific gains were detected in multiple species, allowing the presence of conserved or other isoforms/exonic parts. Thus,

Supplemental Figures
Supplemental Figure S1. Isoform length distribution of Iso-Seq transcriptomes from each species after SQANTI filtering. Figure S2. Saturation analysis of Iso-Seq data for gene (a) and isoform (b) discovery rates across the 5 primate species. Saturation curves are based on random subsampling of incremental fractions of long-read sequencing data. Figure S3. Number (top) and percentage (bottom) of known vs novel genes (left), predicted protein sequences (middle) and exons (right) captured by Iso-Seq in each species. Figure S4. Percentage of human Iso-Seq transcripts belonging to each SQANTI structural category based on the Universal Human Reference RNA (UHRR) Iso-Seq dataset. Legend indicates if the transcript has been predicted to encode a protein ('coding') or not ('non_coding'). FSM: full splice match (complete match of junctions combination reported in UHRR Iso-Seq); ISM: incomplete splice match (partial match of junctions combination reported in UHRR Iso-Seq); NIC: novel in catalog (novel combination of reported splice sites in UHRR Iso-Seq); NNC: novel not in catalog (combination of splice sites not reported in UHRR Iso-Seq); Genic Genomic: isoform overlaps exons and introns in UHRR Iso-Seq; Antisense: isoform is anti-sense to a reported gene in UHRR Iso-Seq; Fusion: isoform simultaneously maps to two reported genes in UHRR Iso-Seq; Intergenic: isoform maps to intergenic region according to UHRR Iso-Seq; Genic Intron: isoform is fully contained in intronic region from UHRR Iso-Seq. A detailed description of SQANTI structural categories can be found in Tardaguila et al. (2018). Figure S5. Number of alternative splicing events (left) and percentage of known vs novel alternative splicing events captured by Iso-Seq in each species (right). SE: skipping exons; RI: retained introns; A5SS: alternative 5' splice sites; A3SS: alternative 3' splice sites; MX: mutually exclusive exons. Figure S6. Number of novel peptides detected in each species (compared to their RefSeq proteomes) by the SQANTI class of the transcript from which they were predicted. SQANTI classes are defined according to the reference transcriptome of the corresponding species (Ensembl for hg38, and RefSeq for panTro6, gorGor6, ponAbe3 and rheMac10). A detailed description of SQANTI structural categories can be found in Tardaguila et al. (2018). Figure S7. Number of species in which novel peptides to a given species were simultaneously detected by mass spectrometry experiments. Figure S8. Example of poorly annotated gene, SON (shown in ponAbe3 assembly), refined by Iso-Seq, RNA-seq and mass spectrometry in orangutan samples. Figure S9. Schematic workflow for comparative transcriptomics analyses based on Iso-Seq collapsed isoform models. Figure S10. Number of transcript models whose coordinates were projected to different species from all collapsed models in hg38 genome (N=55,888 models). Figure S11. Expression values for the exons present in a single species (Species-specific; yellow) and present across all species (Conserved; blue). Exon expression was normalized to Transcripts Per Million (TPM). Statistical significance of the difference across groups was assessed by Student t-test. P-values are shown for the comparison in each species. Figure S12. Percentage of species-specific gained (yellow) and conserved (blue) exons overlapping repeat elements (left) and number of species-specific gained exons overlapping different classes of repeat elements (right). Statistical significance of the difference across groups was assessed by Student t-test. Figure S13. Example of a gene, MRNIP, expressing a human-specific exon in 3' UTR. RefSeq MRNIP models in hg38 assembly (top track) are illustrated together with human PB.10057.1 Iso-Seq transcript (middle track). Human-specific insertions of transposable elements (SVA and LINEs) in 3' UTR are also depicted (bottom track). UTR and CDS are colored in blue and orange, respectively. CDS: coding sequence; UTR: untranslated region. Figure S14. Transcript gains and losses in the primate lineage in all projected transcripts ('All data', left) and high confidence transcripts ('High confidence dataset', corresponding to transcripts whose expression is consistent in all samples from each species, right). Numbers in blue indicate the gains and losses inferred from Wagner parsimony, whereas numbers in pink correspond to the number of transcript gains and losses that can be explained by a unique gain or loss event in our phylogeny. Supplemental Figure S17. Percentage of alternative splicing modes in junctions used in all five species ('Conserved', blue boxes) and junctions with species-specific usage ('Speciesspecific gain', red boxes). SE, skipping exons; A5SS, alternative 5' splice site; A3SS, alternative 3' splice site; MEX, mutually exclusive exons; Complex, any combination of the previous AS modes. Retained introns (RI) were excluded from this analysis (see Supplemental Methods). Statistical significance of the difference across groups was assessed by Student ttest. Shown p-values were corrected using the Holm method.

Supplemental
Supplemental Figure S18. Number of GTEx tissues supporting splice junction usage. The junctions used by each species' LCLs are segregated by conservation level (number of species in which they are simultaneously expressed). Only junctions showing intra-species consistent usage patterns in intra-species consistently expressed isoforms are shown. For a given splice junction to be supported in a given GTEx tissue, we require that at least 20% of all samples from that tissue display RNA-seq reads spanning the junction.
Supplemental Figure S19. Number of GTEx samples supporting splice junction usage. The junctions used by each species' LCLs are segregated by conservation level (number of species in which they are simultaneously expressed). Only junctions showing intra-species consistent usage patterns in intra-species consistently expressed isoforms are shown.
Supplemental Figure S20. Percentage of GTEx samples in each human tissue that show supporting reads for human-specific splice junctions. Human lymphoblastoid cell lines (LCLs) from GTEx are included as an independent group. Figure S21. Expression level (average TPM over 15 LCLs) for genes only expressing transcripts present in all 5 species, and for genes expressing species-specific transcripts. Statistical difference between groups was assessed using Student t-test (Twosided) and p-values were adjusted using Holm method.

Supplemental
Supplemental Figure S22. Number of projected Iso-Seq transcript models for genes only expressing transcripts present in all 5 species and for genes expressing species-specific transcripts. Statistical difference between groups was assessed using Student t-test (Twosided) and p-values were adjusted using Holm method.
Supplemental Figure S23. Proportion of transcript models that are annotated ('Known') or unannotated ('Novel') in each species' reference annotation by expression conservation levels (number of species in which the transcript is expressed). Only transcripts with intra-species consistent patterns of expression (presence/absence) are shown.
Supplemental Figure S24. Proportion of transcript models that are annotated in any species ('Known_in_any_sp') or unannotated in all five species ('Novel_in_all_sp') by expression conservation levels (number of species in which the transcript is expressed). Only transcripts with intra-species consistent patterns of expression (presence/absence) are shown.
Supplemental Figure S25. CYLD transcript models and their splice junction (SJ) combinations are represented (left) alongside the RNA-seq junction reads found in each species (right). Splice junctions are numbered sequentially according to their location in the genome. The heatmap displays a color-gradient for log2(RNA-seq junction reads +1) in each splice junction (average number of reads per species). Grey shadows in transcript structures indicate species-specific splice junctions. Untranslated regions (UTRs) and coding sequences (CDS) are colored in blue and orange, respectively.
Supplemental Figure S26. Example of a gene, IFI44L, expressing human-specific and conserved transcripts. Ensembl annotation in hg38 assembly (black) is displayed (top) together with IFI44L Iso-Seq transcript models (bottom) that are human-specific (green) or expressed in all species (blue). In Iso-Seq transcript models, narrow blocks represent the untranslated region (UTR) and thick blocks indicate the coding region (CDS). Figure S27. PhastCons conservation scores in 3' splice sites ('Start' of the exon) and 5' splice sites ('End' of the exon) and their surrounding regions (± 25 base pairs from exon boundaries). For every single position (nucleotide), median conservation scores were calculated by deeptools (Ramírez et al. 2016) over a set of non-redundant exons (different in their genomic coordinates) present in the collapsed projected Iso-Seq models (hg38) (top). PhastCons conservation scores for every single position across these exons are represented as a color gradient (bottom) according to the color legend (right). Only exons that are 100-500 nucleotides long were kept for this representation (N=84,785 exons). Conservation scores were retrieved from 17-way alignments (UCSC).

Supplemental
Supplemental Figure S28. Functional enrichment for genes expressing species-specific transcripts that arise from conserved splice site canonicity (left, N=627 genes) and not conserved splice site canonicity (right, N=280 genes) across primates. The enrichment was computed by over-representation analysis (ORA) using the Gene Ontology Biological Process database (affinity propagation clustering). False discovery rates (FDR) were adjusted using the Benjamini-Hochberg method (Benjamini et al. 2001). Color legend indicates the number of overlapping genes between the gene set under evaluation and the pathway gene set. Figure S29. Example of a gene, RRP15, with a human-specific splice site mutation altering the definition of the 5' boundary of RRP15 exon 2. Macaca mulatta (blue) and Pan troglodytes (pink) Iso-Seq projected isoforms are shown together with Homo sapiens Iso-Seq isoforms and RefSeq models in hg38 assembly (green).

Supplemental
Supplemental Figure S30. Example of a gene, NAP1L4, with a gorilla-specific splice site mutation resulting in an exonization event. RefSeq has already annotated this gorilla-specific 3' UTR exon in gorGor6 assembly.
Supplemental Figure S31. Relative isoform expression versus expression of the topexpressed isoform for a given gene. Known transcripts (blue) show higher relative expression than novel transcripts (orange) in all species. Only expressed transcripts in a given species are shown.
Supplemental Figure S32. Relative transcript expression versus expression of the topexpressed transcript for a given gene across conservation levels (number of species in which transcript is expressed). Only expressed transcripts in a given species are shown.
Supplemental Figure S33. Venn diagram showing the intersection of species-specific isoform usage changes detected by two different strategies. 'Grouped': isoform usage is compared in each species versus a group consisting of all the remaining species (yellow). 'Pairwise': isoform usage is compared in all pairwise combinations, where the isoform must display a significant usage change in a given species versus the others while the remaining species must show no significant usage differences between them (purple).
Supplemental Figure S34. Number of isoforms showing species-specific usage changes. Up-regulated (dark shadow) and down-regulated (light shadow) isoforms are classified into novel (orange) and known (blue) isoforms based on the human reference transcriptome (hg38 assembly).
Supplemental Figure S35. Proportion of exonic parts intersecting Alu elements (conserved usage: 'conserved'; human-specific down-regulated: 'human_down'; human-specific upregulated: 'human_up'). All exonic parts are shown to the left while skipping exons (SE) are displayed to the right. Alu elements were retrieved from hg38 repeatMasker track (UCSC). Statistical significance in the difference in proportions across groups was evaluated using Fisher's exact test. P-values were adjusted using the Bonferroni method.
Supplemental Figure S36. Average nucleotide diversity (pi) in human populations for exonic parts with conserved usage ('conserved', N=95,453; red), exonic parts with species-specific up-regulation ('spsp_up', N=941; blue) and down-regulation ('spsp_down', N=1,340; orange), and exonic parts showing other usage changes ('other', N=277; green). 'other' usage changes correspond to usage differences between groups of species (e.g., any 2 species show differential usage versus the remaining 3 species). Pi diversity estimates were calculated from the 1000 Genomes data mapped against hg38. Only exonic parts longer than 5 bases with all nucleotides included as passed bases according to 1000 Genomes strict mask are shown. Statistical significance of the difference across groups was assessed by the Wilcoxon test. Pvalues were adjusted using the Holm method.
Supplemental Figure S37. Proportion of exonic parts intersecting coding regions (CDS) predicted from Iso-Seq projected models (conserved usage: 'conserved'; human-specific down-regulated: 'human_down'; human-specific up-regulated: 'human_up'). Statistical significance in the difference in proportions across groups was evaluated using Fisher's exact test. P-values were adjusted using the Bonferroni method.
Supplemental Figure S39. Ratio of nonsynonymous to synonymous substitutions (dN/dS) in human populations for each of the gene classes according to their splicing and usage patterns. Statistical significance for the pairwise comparisons was obtained using the Dwass-Steel-Critchlow-Fligner all-pairs test. P-values were adjusted using the single-step method. Only adjusted p-values lower than 0.05 are shown.
Supplemental Figure S40. Length distribution of Isoseq3 transcript sequences combining all subreads from the same species ('Combined') versus independent processing of subreads from different libraries and subsequent isoform merging per species ('Not combined', alternative strategy).
Supplemental Figure S41. a) Percentage of Iso-Seq transcripts (SQANTI-filtered) that were supported in 2 LCLs, 1 LCL or unsupported after independent Iso-Seq re-processing for each LCL. b) Long read-based transcript expression for each Iso-Seq replicability class. c) Short read-based transcript expression for each Iso-Seq replicability class. d) Percentage of annotated transcripts in each Iso-Seq replicability class. e) Percentage of each Iso-Seq replicability class in annotated and novel transcripts.
Supplemental Figure S42. Isoform quality descriptors by Iso-Seq replicability classes ('both': isoform was independently identified in two LCLs for that species; 'single': isoform was independently identified in one LCL for that species; 'none': isoform was identified after combining Iso-Seq subreads from the two LCLs per species, but not in independent Iso-Seq reprocessing of each LCL). Transcript length ( Supplemental Figure S44. Functional enrichment for genes displaying species-specific isoform innovations (N=877 genes) considering transcripts with TPM<0.1 as non-expressed. Over-representation analysis (ORA) was conducted using the Gene Ontology Biological Process database (affinity propagation clustering). False discovery rates (FDR) were adjusted using the Benjamini-Hochberg method. Color legend indicates the number of overlapping genes between the gene set under evaluation and the pathway gene set. Supplemental Figure S46. Distribution of Iso-Seq replicability classes by conservation level in the expressed isoforms in each species. Only transcripts showing intra-species consistency in their patterns of expression presence/absence are shown. 'both': isoform was captured in 2 LCLs from a given species; 'single': isoform was captured in 1 LCL from a given species; 'none': isoform was captured combining subreads from the same species, but not in independent Iso-Seq replicate re-processing; 'absent': isoform was captured in any other species' Iso-Seq.
Supplemental Table S7. Number of detected peptides per species according to their presence in the simulated digestion of Iso-Seq isoforms projected to each genome (plausible in genome), UniProt and RefSeq reference proteomes. Detected peptides must satisfy FDR<5%, Mascot IonScore > 20, reporter ion intensity signal (abundance) per sample > 50 (in any sample of a given species), and species-wise median ratio of sample abundance to pool abundance > 0.6 (excluding matches to common contaminant sequences and peptides arising from more than 1 tryptic miscleavage). The numbers correspond to distinct tryptic peptides considering isoleucine and leucine (I/L) as indistinguishable by mass spectrometry.
Supplemental Data S2. All detected peptides by mass spectrometry experiments.
Supplemental Data S3. Detected novel peptides according to RefSeq annotations. Isoleucine and leucine amino acids are designed as 'Z' since they are indistinguishable by mass spectrometry experiments.
Supplemental Data S4. BED files containing the genomic coordinates of detected novel peptides according to RefSeq annotations (hg38, panTro6, gorGor6, ponAbe3 and rheMac10). Isoleucine and leucine amino acids are designed as 'Z' since they are indistinguishable by mass spectrometry experiments.
Supplemental Data S5. GFF file for the projected isoform models in hg38 coordinates.
Supplemental Data S6. SQANTI classification file for the projected isoform models in hg38.
Supplemental Data S8. Expression matrix for projected isoform models across samples (TPM, including batch effect correction and TMM normalization).
Supplemental Data S9. Binary expression matrix (1=presence, 0=absence) for projected isoform models across species. Only transcripts showing consistent expression in all samples from the same species are included (based on batch effect-corrected TPM values).
Supplemental Data S10. Classification of projected isoform models according to their isoform usage patterns.
Supplemental Data S11. Classification of exonic parts according to differential exon usage (DEU) results in hg38 coordinates.