The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements

The mammalian genome harbors up to one million regulatory elements often located at great distances from their target genes. Long-range elements control genes through physical contact with promoters and can be recognized by the presence of specific histone modifications and transcription factor binding. Linking regulatory elements to specific promoters genome-wide is currently impeded by the limited resolution of high-throughput chromatin interaction assays. Here we apply a sequence capture approach to enrich Hi-C libraries for >22,000 annotated mouse promoters to identify statistically significant, long-range interactions at restriction fragment resolution, assigning long-range interacting elements to their target genes genome-wide in embryonic stem cells and fetal liver cells. The distal sites contacting active genes are enriched in active histone modifications and transcription factor occupancy, whereas inactive genes contact distal sites with repressive histone marks, demonstrating the regulatory potential of the distal elements identified. Furthermore, we find that coregulated genes cluster nonrandomly in spatial interaction networks correlated with their biological function and expression level. Interestingly, we find the strongest gene clustering in ES cells between transcription factor genes that control key developmental processes in embryogenesis. The results provide the first genome-wide catalog linking gene promoters to their long-range interacting elements and highlight the complex spatial regulatory circuitry controlling mammalian gene expression.

"The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements" Table S1: Genomic coordinates of mouse Promoter Capture Hi-C baits Table S2: Processing of sequence reads Table S3: Publicly available datasets used Table S4: Enrichment of chromatin marks in promoter-interacting fragments  Figure S1: Promoter CHi-C recapitulates known long-range chromosomal interactions Figure S2: Validation of Promoter CHi-C by triple label DNA FISH Figure S3: Characterization of promoter-interacting regions in FLCs Figure S4: Spatial promoter-enhancer interactions uncovered by Promoter CHi-C Figure S5: Promoter interactions and chromosomal domains Figure S6: CHi-C uncovers non-random promoter-promoter interaction networks   primers (Illumina). The concentration and size distribution of Hi-C library DNA after PCR amplification was determined by Bioanalyzer profiles (Agilent Technologies).

Bait capture system design for Promoter Capture Hi-C
A list of transcripts was obtained from the Ensembl BioMart system, detailing the positions and assigned biotype for all transcripts in their gene build. The list was filtered to keep only transcripts with a biotype of protein-coding, non-coding (lincRNA), antisense, snRNA, miRNA or snoRNA. A genomic restriction map was then created to generate a list of all HindIII fragments in the genome, and this was subsequently filtered against the transcript list to keep only restriction fragments which contained the transcriptional start position for one or more transcripts. For each restriction fragment containing a transcription start site, two 120 bp capture probes were designed, one to each end of the fragment. Because of the size selection step following sonication, the probes had to fall entirely within a region no more than 500 bp from the end of the fragment. Each probe was required to have no more than 3 consecutive bases masked by repeatmasker, and they had to have a GC content of 25 to 65% to match the efficient capture range of the SureSelect target enrichment system (Agilent Technologies). Where multiple probes passed these criteria, the probe nearest the end of the restriction fragment was chosen.

Quantitative chromosome conformation capture (3C-qPCR)
Cells were fixed in 2% formaldehyde for 10 min, and 3C was performed essentially as previously described (Dekker et al. 2002). 3C DNA was purified using an Amicon Ultracel 0.5 ml column. For Promoter Capture Hi-C validation, long-range 3C PCR amplicons were designed by combining a 'bait' primer (located within a Promoter Capture target HindIII fragment) with primers to HindIII fragments uncovered by Promoter Capture Hi-C as either interacting or non-interacting (both Promoter Capture targeted and non-targeted HindIII fragments were tested). To generate a standard curve for 3C-qPCR, the corresponding ligation products were generated from a 3C template library and mixed in equimolar concentrations. In order to control for crosslinking and ligation efficiency within individual 3C libraries, each of the long-range ligation products was normalized against a short-range ligation product derived by PCR using the 'bait' primer in combination with a primer for adjacent HindIII fragment sequences. For both cell types, two biological replicates were analyzed by three technical replicates each. Quantitative PCR was performed using SYBR Green PCR master mix (Applied Biosystems) and a C1000 Thermal Cycler (Biorad). 3C primer sequences are available upon request.

DNA FISH
25 ng of each dye-coupled BAC DNA probe was precipitated with 8 µg of mouse Cot-1 DNA and 10 µg of salmon sperm DNA for each hybridization. Precipitated probes were dissolved in 10 µl of 50% formamide, 10% sodium dextran sulfate, 1xSSC and used for hybridization. Cells were suspended in 50 µl of ice-cold PBS and spotted onto poly-L-lysine slides (Sigma) for 4 min to let the cells settle. The slides were immersed in 4% formaldehyde in PBS for 10 min to fix. The fixation was quenched with 0.1 M Tris-HCl, pH7.4 for 10 min, and the cells were permeabilized with 0.1% saponin, 0.1% Triton X-100 in PBS for 10 min. After washing, the slides were immersed in 20% glycerol in PBS for 20 min and subjected to three cycles of freezing/thawing in liquid nitrogen, followed by washing in PBS. The slides were incubated in 0.1 N HCl for 30 min, washed in PBS, permeabilized again in 0.5% saponin, 0.5% Triton X-100 in PBS for 30 min, washed again, and immersed in 50% formamide in 2xSSC for 10 min. For hybridization, the probe mixtures (prepared as above) were applied onto the cells, covered with coverslips, heated at 78˚C for 2 min and incubated at 37˚C overnight. After hybridization, the slides were washed in 50% formamide in 2xSSC for 15 min at 45˚C, followed by 0.2xSSC at 63˚C for 15 min, 2xSSC at 45˚C for 5 min, and 2xSSC at room temperature for 5 min. The cells were finally counterstained with DAPI, and coverslips were mounted with Vectashield mounting medium (Vector Laboratories). Images of the DNA FISH signals were captured and analysed with the MetaCyte automated imaging system (MetaSystems).
3D distances between the specified genomic loci were measured for a minimum of 300 alleles per experiment, and used Fisher´s exact and Chi square tests to identify significant differences in the distance distributions.

Expression analysis
Genes/promoters were separated into five expression categories based on the RPKM values from GSM723776 (Shen et al. 2012) and ERR031629 (Ficz et al. 2011) mRNA-seq data for ESC (genes falling into different expression categories in the two data sets were assigned to NA), and GSM661638 mRNA-seq data from Ter119+ cells for FLC (Kowalczyk et al. 2012). RPKM values were obtained using TopHat with default settings (Trapnell et al. 2012). Genes with 0 RPKM formed a separate category (inactive) and all other genes (active) were divided into quartiles according to their RPKM. In analyses where active and inactive genes are compared, we considered 0 RPKM genes to be inactive. Promoter fragments containing multiple Ensembl (Flicek et al. 2014) annotated promoters were either assigned to the appropriate category if all genes on the fragments belonged to the same expression category, or were assigned NA if they differed.

Interaction symmetry
We aligned all promoter fragments which contained a single Ensembl promoter and plotted upstream and downstream interactions, using the midpoint of two interacting fragments as the interaction length. We calculated the proportion of interactions within 5 kb bins, up to a distance of 150 kb from the promoter. We separated promoters according to their expression activity, and compared the spread of interaction distances between active and inactive sets using the Ansari-Bradley test.

ChIP-seq processing
For all ChIP-seq datasets where the raw data were available, the raw reads were mapped to the mouse genome using Bowtie (Langmead et al. 2009), with a seed length of 25 bp, allowing reads that had at most only one mismatched nucleotide in the seed, returning only one possible mapping and with the remaining parameters set to default values. After mapping, MACS (Zhang et al. 2008) was used to call peaks using default parameters. When no raw data were available, the peaks called in the original publication were used.

Enrichment calculation in non-bait interacting fragments
Enrichment for chromatin marks or states and transcription factors in interacting fragments was calculated using the proportion of fragments in a certain group (e.g., fragments that interact with promoters of a certain expression class) that overlap with a peak for the mark, state or transcription factor being analyzed, divided by the proportion of all non-bait fragments that overlap with such a peak. The resulting value was then converted to its log2 value, so that positive values represent an enrichment compared with all non-bait fragments and a negative value represents depletion. An interacting fragment was assigned to an expression class if all the bait fragments it interacts with have been assigned to the same expression class. If it interacts with baits from different classes, it was excluded from the enrichment analyses.

Chromatin states definition
Chromatin states were defined using ChromHMM (Ernst and Kellis, 2012). Data from 30 histone modifications and transcription factors were used to identify eight chromatin states. For downstream analysis, states 2 and 3 were merged and named "Promoter-like" as they showed a high similarity. The enrichment of chromatin states in non-bait interacting fragments was calculated in the same way as the enrichment for individual transcription factors and histone modifications.

Defining enhancer regions
Predicted enhancers active in ESC or in E14.

Enhancer interactions with proximal or distal promoters
We calculated the proportion of enhancer-promoter interactions between the enhancer fragment and the nearest promoter (either upstream or downstream). Interactions in which the given fragment interacted with a distal promoter were separated into two categories: interactions with the adjacent promoter but not the closest one, and interactions skipping other promoter/s.

Promoter interactions with proximal or distal enhancers
We calculated the proportion of promoter-enhancer interactions that were between the promoter fragment and the nearest enhancer (either upstream or downstream).

Overlap of enhancer interactions between ESCs and FLCs
We calculated the percentage of ESC enhancer interactions per promoter that were present in FLC. For this comparison, only those enhancers that were predicted to be active in both cell types were considered. As controls we calculated the percentage of all ESC promoter-genome interactions per promoter present in FLCs as well as the percentage for a random subset of promoter-genome interactions involving the same number of non-promoter fragments as the shared active enhancer fragments.

Interactions and LADs
Promoter-genome interactions were separated according to where interacting

TADs and interaction directionality
The directionality measure of a bait fragment's interactions was obtained by calculating the proportion of interactions originating from a promoter fragment and contacting a fragment with higher genomic coordinates on the same chromosome, divided by the total number of interactions originating from that promoter fragment.
The resulting number was then normalized to a range of -1 to 1, so that promoter fragments with negative directionality ratios have more interactions with fragments placed before it in the genome (according to their genomic coordinates), and a promoter fragment with positive directionality ratios have more interactions with fragments placed after it. Three measures of directionality were calculated for each promoter fragment: one measuring the directionality of all its interactions, another the directionality of its interactions with non-promoter fragments and a third measuring the directionality of its interactions with other promoter fragments.

Promoter co-localization analysis
To measure the enrichment of interactions within a set of promoters, we generated one hundred random promoter sets, with comparable pair-wise distance distribution to the original set. The p-value was acquired by counting the number random control sets that have more interactions than the original set. As interaction counts in control experiments generally follow a near-normal distribution, a T test with 99 degrees of freedom was used to more accurately estimate low p-values. Fold change was derived by dividing the number of interactions in the original set by the average expected number of interactions in the control sets.

Promoter interaction networks
The enrichments of the promoter-promoter interactions between two target groups, namely TF binding or GO terms (Ashburner et al. 2000) x and y were calculated using a measure of how likely a bait fragment is to be part of y in one end, given that the other end is part of x. The resulting proportion is then divided by the likelihood of a bait fragment being part of x at any end of an interaction. Interaction networks were constructed using these enrichments represented as edge colours. GO terms containing at least 50 genes, and at least 20 interactions in the target cell type were selected for investigation. Most significant GO targets with p-value < 0.01, and fold change > 2 were selected for display. The GO terms in the figures were manually curated to exclude too generic or redundant categories. TFs with significant colocalization (p-value < 0.01) were considered regardless of effect size. The network was laid out using Cytoscape's Edge-Weighted Spring Embedded layout (Shannon et al. 2003).

Target enrichment controls
To further investigate the enrichment of interaction for a given transcription factor, we compared it to a set of control networks of the same size: a) a randomized network with baits in the same expression categories to control for gene expression effects, and b) a randomized network with bait-bait distances comparable to the original set to control for bait proximity. The networks were laid out using Cytoscape's Circle layout (Shannon et al. 2003).

Enrichment calculation in bait-bait interactions
The enrichments among the promoter-promoter interactions were calculated using a measure of how likely an interaction is to have a fragment belonging to expression class y in one end, given that it has a fragment belonging to expression class x in the other end. The resulting proportion is then divided by the likelihood of seeing fragments from expression class x at any end of an interaction and converted to its log2 value.