RBFOX splicing factors contribute to a broad but selective recapitulation of peripheral tissue splicing patterns in the thymus

Thymic epithelial cells (TEC) control the selection of a T cell repertoire reactive to pathogens but tolerant of self. This process is known to involve the promiscuous expression of virtually the entire protein-coding gene repertoire, but the extent to which TEC recapitulate peripheral isoforms, and the mechanisms by which they do so, remain largely unknown. We performed the first assembly-based transcriptomic census of transcript structures and splicing factor (SF) expression in mouse medullary TEC (mTEC) and 21 peripheral tissues. Mature mTEC expressed 60.1% of all protein-coding transcripts, more than was detected in any of the peripheral tissues. However, for genes with tissue-restricted expression, mTEC produced fewer isoforms than did the relevant peripheral tissues. Analysis of exon inclusion revealed an absence of brain-specific microexons in mTEC. We did not find unusual numbers of novel transcripts in TEC, and we show that Aire, the facilitator of promiscuous gene expression, promotes the generation of long “classical” transcripts (with 5′ and 3′ UTRs) but has only a limited impact on alternative splicing in mTEC. Comprehensive assessment of SF expression in mTEC identified a small set of nonpromiscuously expressed SF genes, among which we confirmed RBFOX to be present with AIRE in mTEC nuclei. Using a conditional loss-of-function approach, we show that Rbfox2 promotes mTEC development and regulates the alternative splicing of promiscuously expressed genes. These data indicate that TEC recommission a small number of peripheral SFs, including members of the RBFOX family, to generate a broad but selective representation of the peripheral splice isoform repertoire.


A transcriptomic census reveals that Rbfox contributes to a broad but selective recapitulation of peripheral tissue splicing patterns in the thymus
Kathrin Jansen, Noriko Shikama-Dorn, Moustafa Attar, Stefano Maio, Maria Lopopolo, David Buck, Georg A. Holländer, Stephen N. Sansom

I.
Supplemental Methods

II. Supplemental Figure Legends
III. Supplemental Tables

Extraction of thymic epithelial cells and assessment by flow cytometry
Thymic epithelial cells were extracted as previously described . In short, thymic lobes were incubated with Liberase (Roche) and DNase (Roche) in PBS for 30 min at 37 °C. The cells were incubated with magnetic beads for 15 min at room temperature followed by enrichment of CD45-negative cells using the AutoMACS Pro Seperator (Miltenyl Biotech). Two independent experiments were performed using a total of six 4-6 weeks old mice for each experimental group. Three independent experiments were performed with the double knockout (Rbfox1 lox/lox : Rbfox2 lox/lox :cre+/-). For statistical analysis the resulting cell frequencies were combined and two-sided Welch Two Sample t-tests were performed between genotypes ( Fig. 6B- Supplemental Figures 13-15).

Isolation of thymocytes and analysis by flow cytometry
Thymi and spleens were dissected from knockout and wildtype mice and isolated by gently dissociating the tissues between two frosted glass slides. Cells were filtered and resuspended in PBS containing 2% FCS (Merck) and stained with a combination of the markers given in Supplemental Table 12. The staining for surface markers was performed for 20 min at 4 ºC in the dark. The cell viability was assessed using the LIVE/DEAD Fixable Aqua Dead Cell Stain Kit (ThermoFisher Scientific) according to the manufacturer's instructions. Cells were acquired using FACSAria III (BD Bioscience) and data was analyzed using FlowJo software (version 10.5.0). As described above, n=2 independent experiments were performed and statistical analysis were performed on combined data (Supplemental Fig. 16).

Preparation and staining of samples for confocal microscopy
Freshly dissected thymus lobes were frozen in OCT (Tissue Trek) and 8 μm tissue sections were cut using a Cryostat (Thermo Scientific CryoStar NX70 with MB DynaSharp Microtome Blade). The tissue sections were fixed for 20 min in 1.4 % PFA (Sigma, in 1xPBS) and for 10 min in Methanol (VWR). Sections were permeabilized for 10 min with 0.3 % Triton-X (Sigma, in 1xPBS). These steps were followed by one 5 min washing step in 1xPBS, marking of the individual sections by a hydrophobic PAP pen (Sigma) and two further 5 min washing steps of each individual section. The primary antibodies were diluted in 1xPBS containing 10% goat serum, 0.3% Triton-X and incubated for 45 min at 37 °C. Primary antibodies were directed against AIRE (5H12, eBioscience, 1:500) and the RRM domain (1:500). In addition, UEA-1 was used for identifying mTEC (1:150, Vector Laboratories, in-house labelled). After three washing steps, the secondary antibody was added (diluted 1:500 in 1xPBS, goat a rabbit-AF488 (Invitrogen), goat a rat-AF555 (Invitrogen)) sequencing from skin epithelial cells, mTEC and cTEC (GSE44945) (St-Pierre et al. 2013), cTEC (GSE53111) (Sansom et al. 2014), and single mature mTEC (GSE114713) (Handel et al. 2018) was obtained from the Gene Expression Omnibus (GEO). Independent RNA-sequencing datasets for peripheral mouse tissues (GSE41637, (Merkin et al. 2012)) as well as MHCII high and low adult mTEC (GSE68190, (Chuprin et al. 2015)) were downloaded from GEO.
To generate a single high-depth sample for each tissue and TEC population replicate samples were combined and downsampled to 200M reads (samtools (Li et al. 2009) v1.3.1; Supplemental Table 1).
Separate reference-guided assemblies were constructed for each ENCODE tissue or TEC population using the high-depth samples (StringTie (Pertea et al. 2015) v1.3.3b; Ensembl mm10-v91). To identify transcripts that were reproducibly detected we first prepared biological replicate sample pools for each tissue and TEC population (n=2; 60M reads/samples, Supplemental Table   1). Expression of the transcripts present in each of the assemblies was then quantified in the relevant tissue or TEC population using the two 60M replicates (Salmon (Patro et al. 2017), v0.11.3, with parameters: "--incompatPrior=0 --validateMappings --rangeFactorizationBins=4 --seqBias --gcBias -x 0.66"). We then implemented and applied a robust procedure based on computation of the non-parametric Irreproducibility Discovery Rate (npIDR) (Dobin et al. 2013;Pervouchine et al. 2015) (details below and Supplemental Fig. 1). Transcripts from each tissue or TEC population that were reproducibly detected according to this procedure (those expressed above a level determined to correspond to npIDR ≤ 0.1) were merged into a single unified assembly (StringTie, Ensembl reference annotation guided merge). Transcripts contained in reference introns, possible polymerase run-on fragments, repeats, transcripts overlapping opposite-strand exons or introns and possible pre-mRNA fragments were removed (gffcompare v.0.10.6, class codes 'irpxse'). To be included in the final mT&T assembly, the merged transcript models were additionally required to be reproducibly detected (i.e. expressed above a level determined to correspond to npIDR ≤ 0.1; npIDR was determined as described above) in at least one tissue or TEC population.
The final mT&T assembly was used as the annotation for all subsequent steps if not otherwise indicated.

Quantification of gene and transcript expression levels
For comparisons of gene and transcript expression between the ENCODE tissues and the TEC populations samples (generated for this study) the trimmed 76bp sequences were deduplicated (Picard v2.10.9), filtered to exclude unmapped reads, and down-sampled to common read depths.
For comparisons of gene and transcript expression between the Merkin et al. mouse tissues (Merkin et al. 2012) and the St-Pierre and Chuprin TEC samples (St-Pierre et al. 2013;Chuprin et al. 2015) sequence reads were trimmed to 50 bp, deduplicated, filtered to exclude unmapped reads, and down-sampled to common read depths. Transcripts-per-million (TPM) values were obtained using Salmon with an index created from the new mT&T assembly (k=31 for index, settings: "ISR --gcBias") and upper-quartile normalized. Reads were counted by featureCounts (Liao et al. 2014) (v1.6.0).
RNA-sequencing reads from the Rbfox1 and Rbfox2 tKO mice were not trimmed but were otherwise mapped, deduplicated, filtered to remove unmapped reads, downsampled and quantitated as described above (retaining 14.5M and 10M paired-end reads/replicate for mature and immature mTEC, respectively).

Procedure for identification of reproducibly detected transcripts
To identify reproducibly detected transcripts, we implemented a robust procedure based on the npIDR metric (Dobin et al. 2013;Pervouchine et al. 2015) because we noted very lowly expressed transcripts to frequently pass a naïve npIDR filter (data not shown). TPM values were first log10 transformed (after addition of a small pseudocount = 0.001) and extreme values (x < 0.5 or x > 0.95 expression quantile) were removed to avoid issues arising from data sparsity. Data were then binned (n=50 bins) and npIDR values for each bin computed as previously described (Pervouchine et al. 2015). To estimate the expression level above which transcripts could be reliably detected we modelled the TPM vs npIDR relationship by LOESS regression. The fitted curve was used to estimate the TPM value that corresponded to npIDR £ 0.1. The analysis was performed separately for each TEC and peripheral tissue (with n = 2 biological replicate sample pools). Determination of the TPM threshold above which transcripts were reproducibly detected in the adrenal samples is shown in Supplemental Fig 1A-B. The TPM thresholds determined for each of the TEC and tissue samples are shown in Supplemental Fig 1C. This procedure was more consistent and conservative for our datasets than the use of per-transcript npIDR values (data not shown).

Identification of novel tissue-restricted transcripts
Novel transcripts were defined as those without a match in the Ensembl annotation (as assessed with gffcompare). Tissue-restricted novel transcripts were identified as those with tau>0.99

Identification of tissue-restricted antigen and Aire-regulated genes
Tissue-restricted antigen (TRA) genes were defined as the set of protein-coding genes that showed evidence of tissue restricted expression amongst the ENCODE tissues and wildtype mTEC samples. To avoid representation bias when computing expression specificity, we first identified groups of similar peripheral tissues by hierarchically clustering the tissues according to their transcript expression profiles. One tissue was then selected to represent each of the groups identified (Supplemental Fig. 3A). The set of TRA genes was defined as the set of genes with tau values > 0.7 in the representative tissues and wildtype immature and mature mTEC samples (Supplemental Fig. 3B) (Kryuchkova-Mostacci and Robinson-Rechavi 2017). To associate TRA genes with individual tissues we constructed a family (F) of non-overlapping TRA gene subsets which we termed "iTRA" by assigning TRA genes to the tissue in which they were most highly expressed. In summary, FiTRA = {iTRAtissue-i, … , iTRAtissue-n} where iTRAtissue-i comprises the subset of TRA genes with highest expression in tissue i. As a concrete example, the adrenal iTRA subset (denoted iTRAadrenal or simply "adrenal iTRA") contains the subset of TRA genes that were more highly expressed in the adrenal tissue than in any other peripheral tissue. Aire-regulated genes were defined as those significantly downregulated more than 2-fold in the homozygous Aireknockout mTEC relative to heterozygous Aire-knockout mTEC (BH adjusted p < 0.05, DESeq2 (Love et al. 2014) analysis, n=2 biological replicates, Supplemental Fig. 3C). For the identification of Aire-regulated genes in mTEC untrimmed, full-depth sequence data was mapped (as above) and quantified using featureCounts (Ensembl v91). In total, we identified n=3,889 Aire-regulated tissue-restricted antigen genes (Aire-TRA), n=5,266 other tissue-restricted antigen genes (non-Aire TRA) and n=12,885 non-TRA genes (Supplemental Fig. 3D and Supplemental Table 2). normalised, log2(n+1) transformed TPM values). Subsets of iTRAs were then identified for these tissues following the approach described above for the ENCODE tissues.

Assessment of differential splicing and splice junctions
Differential splicing events were identified using rMATS (Shen et al. 2014 Splice junctions (SJs) were counted using SJcounts (v3.1; settings: "-maxnh 1 -read1 0 -read2 1") (Pervouchine et al. 2013) using 50M trimmed (76bp), deduplicated, mapped reads/sample. The results were post-processed to split the multi-junction counts into individual junction counts for final quantitation. SJs were assigned to genes by separately intersecting their start coordinates (±3bp) and end coordinates (±3bp) with exon coordinates extracted from protein-coding transcripts (Ensembl v91, bedtools window (v2.25.0)). Only SJs for which both the start and end coordinates intersected with exons from protein-coding transcripts from the same protein-coding gene were included in downstream analyses. For Fig. 2, and Supplemental Fig. 5 the numbers of unique SJs per gene were summarised. Significant differences in the number of unique junction counts were identified using edgeR (v3.32.0, with housekeeping genes used to estimate dispersion from the single replicates). For this analysis the set of mouse housekeeping genes was defined as n=473 genes which were 1:1 orthologs of a published set of human housekeeping genes (de Jonge et al. 2007). Differences in junction number were considered significant if they showed a two-fold change in number and an FDR < 0.05.

Definition of Aire-regulated transcripts
Differential transcript expression between Aire-knockout and Aire-positive mTEC was assessed using kallisto (Bray et al. 2016

Definition of tissue-restricted splicing-related factors
To assess the expression of splicing-related genes in ENCODE tissues and TEC populations, we protein genes. Genes were included from GO categories that matched the string 'splic' (but not 'tRNA' or 'protein splicing'). GO data was retrieved from AmiGO or the GO Online SQL Environment (GOOSE) database (Carbon et al. 2009) (downloaded 29 th May 2018). Tissuerestricted splicing-related genes were then identified as those with tau > 0.5 in the TEC and peripheral tissue samples.
Fractions of single cells expressing genes were determined based on counts from featureCounts.

Geneset over-representation and motif enrichment analysis
Geneset over-representation analyses were performed using one-sided Fisher's exact tests (FETs)  Table 4.
For the analysis of differentially spliced events in the Rbfox2 tKO dataset (Fig. 7B), the foreground geneset was comprised of genes with differentially spliced events (|delta (d)PSI| > 0.2 and FDR < 0.05) and the background geneset comprised of the set of genes that were tested for differential splicing events (rMATS analysis). The comparison of genes containing differential splicing events between Rbfox1 and Rbfox2 tKO was performed using a two-sided FET.
As input for the motif enrichment tool MATT, exons with enhanced inclusion or exclusion (|dPSI| > 0.2, FDR < 0.05) in the Rbfox2 tKO samples relative to their Cre-littermate controls were used.
Unregulated exons (|dPSI| < 0.05) were used to compute a control enrichment profile. Testing of linear model fits to different sample groups was performed by ANCOVA analysis (Fig.   1C, Supplemental Fig. 4). First, we tested for a difference in slope by testing for a significant interaction between the group and independent variables. If there was not a significant difference (p > 0.05), we concluded that there was no difference in slope and proceeded to test for a difference in intercept by fitting a second linear model without an interaction term. Model tests were performed using the R Anova function. Where a significant difference in intercept was observed (p <0.05) we used the adjusted mean values to summarise the locations of the groups. We also confirmed the patterns of tissue-restricted transcript (tau ≥ 0.9) representation in TEC The barplot shows that Aire-regulated transcripts were largely comprised of "classical" transcripts while the "other" (i.e. non-Aire-regulated) transcripts produced by Aire-regulated genes had "nonclassical" or "unannotated" structures. (F) The density plots show the length distributions of "classical" vs "non-classical" or "unannotated" transcripts. Relative to their non-Aire-regulated counterparts the mean length of Aire-regulated transcripts was slightly increased for the "classical" transcripts (283.9bp; 11%) and greatly increased for the "non-classical" or "annotated" transcripts  Fig. 6B. Data presented in bar graphs represents the combined results from a total of two independent experiments. A total of six 4-6 weeks old mice were used per experimental group. Significant differences of p-value < 0.05 between Rbfox2 lox/lox :cre-/-and Rbfox2 lox/lox :cre+/-using a two-sided Welch Two Sample t-test are indicated with * (mean ± SE shown in bar graphs). Frequencies of mTEC subpopulations. In panels G and I, the three independent experiments are indicated by different symbols. Data presented in bar graphs represents the combined results from a total of three independent experiments. A total of eight 4-6 weeks old mice were used per experimental group (sex-matched). Significant differences of p-value < 0.05 or < 0.01 between Rbfox1 lox/lox :Rbfox2 lox/lox :cre-/-and Rbfox1 lox/lox :Rbfox2 lox/lox :cre+/-using a two-sided Welch Two

Supplemental
Sample t-test are indicated with * or **, respectively (mean ± SE shown in bar graphs).
(A) Gating strategy to define developmental stages and selection of thymocytes within the thymus.  Table 4: Gene ontology categories enriched in protein-coding genes with TEC-specific novel transcripts (Fig. 3D).

Supplemental
The mean fraction of isoforms for each peripheral tissue and mature mTEC is reported in columns 5 and 6.