Population and single-cell genomics reveal the Aire dependency, relief from Polycomb silencing, and distribution of self-antigen expression in thymic epithelia

Promiscuous gene expression (PGE) by thymic epithelial cells (TEC) is essential for generating a diverse T cell antigen receptor repertoire tolerant to self-antigens, and thus for avoiding autoimmunity. Nevertheless, the extent and nature of this unusual expression program within TEC populations and single cells are unknown. Using deep transcriptome sequencing of carefully identified mouse TEC subpopulations, we discovered a program of PGE that is common between medullary (m) and cortical TEC, further elaborated in mTEC, and completed in mature mTEC expressing the autoimmune regulator gene (Aire). TEC populations are capable of expressing up to 19,293 protein-coding genes, the highest number of genes known to be expressed in any cell type. Remarkably, in mouse mTEC, Aire expression alone positively regulates 3980 tissue-restricted genes. Notably, the tissue specificities of these genes include known targets of autoimmunity in human AIRE deficiency. Led by the observation that genes induced by Aire expression are generally characterized by a repressive chromatin state in somatic tissues, we found these genes to be strongly associated with H3K27me3 marks in mTEC. Our findings are consistent with AIRE targeting and inducing the promiscuous expression of genes previously epigenetically silenced by Polycomb group proteins. Comparison of the transcriptomes of 174 single mTEC indicates that genes induced by Aire expression are transcribed stochastically at low cell frequency. Furthermore, when present, Aire expression-dependent transcript levels were 16-fold higher, on average, in individual TEC than in the mTEC population.

Cell lysis, cDNA synthesis and subsequent PCR amplification was performed using the SMARTer Ultra Low RNA Kit (Clontech) for the C1 platform as per the manufacturer's instructions (Clontech/Fluidigm). The cell lysis buffer (20 µL) was supplemented with 1 µL of a 1:400 dilution of the External RNA Controls Consortium (ERCC, (External 2005)) spike-in Mix 1 (Invitrogen).
Indexed Libraries from the 96 samples harvested from each chip were prepared using the Nextera XT kit (Illumina). Multiplexed libraries from each chip were sequenced over two lanes on a HiSeq 2500 (Illumina) at the Wellcome Trust Sanger Institute run in fast mode to generate 100bp paired end reads.

RNA-seq analysis of thymic epithelial cell populations
RNA-seq reads were mapped to the mouse genome (mm9) by providing TopHat (Trapnell et al. 2009) with a transcriptome index built from Ensembl protein coding gene models and a junctions file computed from Ensembl mRNAs. Genes refer to Ensembl version 67 protein coding genes on assembled chromosomes. For differential expression analysis the DESeq algorithm (Anders and Huber 2010) was used to test differences in the mean number of reads mapping to gene models in a biologically replicated design (n=2). Hierarchical clustering of TEC population correlation distances was performed using average linkage clustering in R. Leaf order was optimised using the R "cba" package and significant clusters identified using boot-strapping analysis as implemented in the R "pvclust" package (Suzuki and Shimodaira 2006). GO enrichments were identified using annotation from MGI and a hypergeometric test. 4 Tissue restricted genes were identified using data from the GNF Mouse GeneAtlas V3 (Lattin et al. 2008) (GEO accession GSE10246). Raw microarray expression data from the Atlas were normalised together using the R "rma" package and expression values for 64 non-thymic physiological samples were calculated (replicate values averaged). In order to avoid representation bias, GNF GeneAtlas samples were hierarchically clustered into 35 groups ( Supplementary Fig. 6A). To identify tissue-restricted expression, we used a novel dynamic step method. Because microarray background levels are probe-specific, we tested each probe separately for tissue-restricted expression. Guided by thresholds typically chosen for microarray data analysis, we define tissue restricted probes as those with a minimum normalised expression value of 50 that showed a moderated exponential step-up in expression, such that expression was substantially higher in 1-5 tissue groups than in the 6th highest tissue group (Supplementary Fig. 6B and 6C). Genes with unanimously tissue-restricted probe sets were designated as tissue-restricted (Fig. 2D, 3D, 5A-E, 6A-D, 7A and C, and Supplementary Table 2).

Calculation of per gene Local FDR
Per gene local FDRs were determined using FPKM values calculated directed from read counts (multi-mapping reads were assigned a fractional count value according to their number of alignment locations) because probabilistic methods such as Cuffdiff cannot provide expression estimates for lowly expressed genes. To calculate the null (background) distribution of FPKM values, collapsed gene models (introns shrunk to 50bp) were shifted into proximal, mappabilitymatched intergenic (>5kb from known Ensembl transcripts) space devoid of ESTs (UCSC annotation) ( Supplementary Fig. 3A). FPKM values were then quantitated on protein coding gene models and the matched background set ( Supplementary Fig. 3B). Local FDRs (Efron 2005) 5 (Supplementary Fig. 3E) were determined using sample specific mixing proportions estimated using the Qvality algorithm (Käll et al. 2009).
Copy numbers were then calculated for protein coding genes based on per-cell normalisation curves determined from known spike-in copy number and observed spike-in FPKMs. The curves were obtained from a first order polynomial linear model forced through the plot origin (good linearity observed, see Supplementary Fig. 11A).

ChIP-seq analysis
Reads were aligned to the mouse genome (mm9) using BWA (Langmead et al. 2009) ("bwa aln -l 25 -k 2 -n 0.1 -q 20"). Following de-duplication using Picard, ChIP enrichment over input was calculated using either (i) MACS2 ("bdgcmp -m FE") after preparing normalised bedgraphs for ChIP and input samples using MACS2 (Zhang et al. 2008) ("callpeak --SPMR") ( Fig. 6A,C) or by (ii) calculating normalized read counts for 1kb TSS windows computing the ratio of ChIP/input for 6 these windows (Fig. 6B,D-F). Meta-gene profile plots ( Fig. 6A & C) were prepared using CEAS (Shin et al. 2009) and data visualised using the UCSC Genome Browser (Kent et al. 2002).  (Ramsköld et al. 2009). Here we used a novel local FDR approach to estimate pergene false discovery rates (Supplementary Table 2 Table 2). These values were used because Cuffdiff does not report FPKM values for lowly expressed genes. (C) Foreground (red) and background (blue) FPKM distributions were determined for each TEC sample. (D) Here the foreground FPKMs were used to estimate the mixture density and the background distribution formed the null density in order to calculate local FDR scores as defined by Bradley Efron (Efron 2005). We used the non-parametric algorithm Qvality (Käll et al. 2009)  (B) After ranking groups by expression level, the dynamic step method calls tissue specific genes as those with expression levels in a few (j) groups that are exponentially higher than that in the next highest group (n-jth) group. (C) The dynamic step function sets a threshold based on a moderated exponential jump. (D) The overlap between tissue specific genes called using the dynamic step method and a previously used thresholding method (Gardner et al. 2008). The data indicate a higher sensitivity for the dynamic step method. (E) Comparison between genes called as tissue specific by only one of the two methods. Genes unique to the threshold method While AIRE is not thought to recognize H3K27me3 directly (Koh et al. 2008;Org et al. 2008), we propose that in addition to recognising unmethylated H3K4 via its PHD1 domain, AIRE may be recruited to sites of H3K27 trimethylation via CHD interaction partners, the most likely of which being CHD6 due to its demonstrated interaction with H3K27me3 (Gaetani et al. 2012;Yang et al. 2013). Upon recruitment to these locations, AIRE may override the repressive histone mark (left panel), or may act by locally reprogramming chromatin to a state permissive of transcription (right panel). If AIRE acts to override a repressive state we would predict this state to identify AIRE-targeted genes whether they are actively transcribed or silent in an individual cell. In contrast, should AIRE binding result in a reprogramming of the local chromatin state as a prerequisite for PGE, we anticipate the chromatin state of AIRE target loci to be heterogeneous at the population level. Both models are compatible with the notion that AIRE and its interaction partners target inactive promoters which have already bound the basal transcriptional machinery but where the RNA polymerase II is either stalled or produces only sterile and unstable transcripts (Sims et al. 2004;Oven et al. 2007;Abramson et al. 2010 Il10ra  Il21r  Il12rb1  Pik3cb  Il13  Il12a  Il23a  Csf2rb  Csf2rb2  Il4  Il10  Il23r  Stat5a  Il2rg  Tyk2  Stat5b  Il15  Il11ra1  Stat6  Ifngr2  Socs1  Il20rb  Stat1  Socs3  Lifr  Irf9  Il6st  Pik3r2  Il7  Stat3  Il4ra Ccnd2         (Trapnell et al. 2010).

Supplementary Figure Legends
Aire expression is not shown for the homozygous knockout mTEC populations as this is almost entirely limited to the unmodified 5'UTR ( Supplementary Fig. 2B).  Supplementary Fig. 6).

Supplementary
NRE: Non-restricted expression, TRE: tissue restricted expression (1-5 tissue groups, GNF GeneAtlas), TRE+NRE: Conflicting evidence for tissue specificity from the GNF GeneAtlas. The expression levels reported for Aire in the homozygous mutant mTEC population should be interpreted with caution as they largely arise from the unmodified 5'UTR. Local FDR scores associated with the calculated FPKMs are provided in a separate worksheet (Methods, Supplementary Fig. 3). worksheets are coloured to highlight low (blue) and high (red) expression values using arbitrary scales.

Supplementary
Supplementary Table 5: Genes whose expression is characteristic of three fundamental TEC types (cTEC, immature mTEC and mature mTEC). Here, Aire negative mature mTEC are used to represent mature mTEC in order to avoid the confounding effects of Aire expression in these cells (which are otherwise very similar, Supplementary Fig. 8D). Genes characteristic of each type were identified as those expressed at an FPKM of at least 10 (Cuffdiff) and at least two fold higher (at 5%FDR, DESeq analysis, Supplementary Tables 2 & 3) than in the other two TEC types.
Genes characteristic for each of the three TEC types are shown in separate worksheets as indicated.