ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors

The classical DNA recognition sequence of the glucocorticoid receptor (GR) appears to be present at only a fraction of bound genomic regions. To identify sequences responsible for recruitment of this transcription factor (TF) to individual loci, we turned to the high-resolution ChIP-exo approach. We exploited this signal by determining footprint profiles of TF binding at single-base-pair resolution using ExoProfiler, a computational pipeline based on DNA binding motifs. When applied to our GR and the few available public ChIP-exo data sets, we find that ChIP-exo footprints are protein- and recognition sequence-specific signatures of genomic TF association. Furthermore, we show that ChIP-exo captures information about TFs other than the one directly targeted by the antibody in the ChIP procedure. Consequently, the shape of the ChIP-exo footprint can be used to discriminate between direct and indirect (tethering to other DNA-bound proteins) DNA association of GR. Together, our findings indicate that the absence of classical recognition sequences can be explained by direct GR binding to a broader spectrum of sequences than previously known, either as a homodimer or as a heterodimer binding together with a member of the ETS or TEAD families of TFs, or alternatively by indirect recruitment via FOX or STAT proteins. ChIP-exo footprints also bring structural insights and locate DNA:protein cross-link points that are compatible with crystal structures of the studied TFs. Overall, our generically applicable footprint-based approach uncovers new structural and functional insights into the diverse ways of genomic cooperation and association of TFs.

. Fraction of GR-bound regions containing a GBS in different cell types. (a) Percentage of ChIP-seq peaks with at least one GBS motif (JASPAR MA0113.2) match is plotted against the p-value threshold used to scan for motif hits, for three cell lines. (b) Schematic diagram illustrating ExoProfiler's strand-sensitivity. For motifs matching on the negative strand (green shaded box), the whole region is reverse-complemented and the motif's center is aligned to the forward motifs. Reads initially mapped to the positive strand are thus treated as negative strand (here colored in red).   (a) EMSA showing left: GR DBD binding to combi sequence; middle: combi sequence with mutated TTCC and right: randomized sequence. Compared to the combi motif, the GBS shows a higher shifted band indicative of dimeric GR binding. (b) Genomic fragments near GR-target genes with sequences matching the combi motif, or a mutated version, were cloned upstream of a minimal promoter driving luciferase expression. Fold induction ± SEM upon treatment with 1 μM dexamethasone (dex) for wild-type and mutated reporters in U2OS cells is shown (n=3). (c) Transcriptional activity of luciferase reporters containing a minimal promoter together with three copies of the combi motif or mutant versions as indicated. Fold induction ± SEM (n≥3) in IMR90 cells upon treatment with 1 μM dex is shown. (d) Structural alignment of combined binding of a GR monomer and candidate "partnering" proteins ELK1 (left: PDB 1DUX), ETS1 (middle: PDB 1K79) and TEAD1 (right: PDB 2HZD). (e) Efficacy of siRNA knockdown of candidate partnering factors in U2OS cells. RNA levels two days after transfection with dsiRNAs as indicated were quantified by qPCR. Percentage relative to scramble control ± SEM (n≥3) is shown. (f) Effect of siRNA knockdown of genes as indicated on the activity of the combi motif. U2OS cells were transfected with dsiRNAs prior to transfection with the combi or the GBS reporter CGT (Meijsing et al. 2009), which contains three copies of a GBS motif driving expression of a luciferase reporter. Activities upon treatment with 1 μM dex are shown as percentage of that observed for the scramble control ± SEM (n≥3).  Correspondence between footprint profiles for FOXA1 and GBS and GR ChIP-exo reads indicate combinatorial binding of FOXA1 and GR at the FOX6 locus (hg19: Chr15:35600909-35601308). (b) Genomic fragments near GRtarget genes with sequences matching FOX motifs were cloned upstream of a minimal SV40 promoter driving luciferase expression. Fox mutated: FOX motif, TGTTTAT changed to AGCCTAT. Putative GBSs were mutated as described in the materials and methods section. Fold induction ± SEM upon treatment with 1 μM dexamethasone (dex) for wild-type and mutated reporters as indicated are shown (n≥3). (c) Transcriptional activity of empty pGL3 luciferase reporter compared to a reporter containing 3 copies of the FOX motif. Fold induction ± SEM (n≥3) in IMR90 cells upon treatment with 1 μM dexamethasone (dex) over EtOH as vehicle control is shown. (d) Average 5' coverage for each cluster as identified by K-means clustering of the 500 most occupied palindromic FOXA1 binding sites (Fig 6e).
3"       ChIP experiments were performed to monitor GR, STAT3 and non-specific binding (IgG) at GR-bound and unbound control regions (ABL1 and TAT) in IMR90 cells treated with EtOH as vehicle control or dexamethasone (dex). Percentage of input immunoprecipitated ± SEM (n=3) is shown. (c) Boxplot of log fold change for (left) all genes upon treatment for 4 hours with 1 μM dexamethasone and (right) for genes that are differentially expressed with a log fold change ≤ -0.5 or ≥ 0.5. Center lines show the medians; diamonds show the mean; box limits indicate the 25th and 75th percentiles as determined by R software; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles. Genes with ChIP-seq peaks in windows ± 20 kb around the TSS, harboring a specific motif are indicated by circles (Genes with peaks with STAT3 (p<0.0001) but not GBS are marked in green, GBS (p<0.0001) but not STAT3 in blue).

Quantitative Real Time PCR (qPCR)
RNA isolation, reverse transcription, qPCR and data analysis were performed as described previously (Meijsing et al. 2009). Primer pairs used are listed in Table  1.
GGGATAGAACATTCCACAGTAGGG TGCCCACGCACAAAAATGTG Electrophoretic Mobility Shift Assays (EMSAs) with and without formaldehyde cross--linking EMSAs were performed as described (Thomas--Chollier et al. 2013). Sequences of the 5' Cy--5 end--labeled oligos are as listed below (recognition sequence underlined): Cy5--TCGATACCAAAATATTTGAGTAC A modified version of the EMSA assay described above was used to determine the efficiency of in vitro cross--linking of wild type and mutant versions of the GR DBD. Modifications were as follows: No BSA was present in the reaction mixes and NaCl concentration was 100 mM. Further, after the reaction mixtures reached equilibrium, formaldehyde was added to a final concentration of 0.1%, before samples were incubated for an additional 10 minutes after which formaldehyde was quenched by adding glycine to a final concentration of 125 mM. Following a 5 minute incubation, samples were loaded onto pre--run denaturing gels containing 0.1% SDS. Purification of hGR--DBD (385--540) and mutant versions R510A and K514A was done essentially as described (Meijsing et al. 2009). Oligos used for theses assays were as follows (GBS underlined): GBS (8 bp flank) Cy5--GATCTCGAAGAACAAAATGTTCTGTACCTAT Random (8 bp flank) Cy5--GATCTCGATACCAAAATATTTGAGTACCTAT Plasmids Reporter plasmids with genomic fragments (approx. 400 bp centered around the summit of the ChIP--seq peak) containing a sequence matching the recognition sequence for motifs of interest that are near the TSS of GR--target genes in IMR90 were amplified by PCR and cloned into the pGL3--promoter plasmid (Promega). Genomic coordinates ( Chr1:27325909--27326321; combi--3: Chr1:61647749--61648161; combi--4: Chr20:49039019--49039431; combi--5: Chr3:140995189--140995601. Candidate GBS, combi or Fox sequences of these reporters were disrupted by site directed mutagenesis. Fox sequence TGTTTAT was mutated to AGCCTAT for each of the reporters. To disrupt putative GBSs of the reporters, the underlined bases of candidate GBSs were changed to an A: Fox1: CAGACGTACTGTTCC , Fox5: AGAGCATCCTGTACT, Fox6: AGATAAGGAAGTACT, Fox9: TGCTCAAAATGTTCT. Reporter plasmids containing three copies of either the FOX consensus sequence or the combi sequence were constructed using the following oligonucleotides (FOX: CCGGGAAATAAACAAAcgcgAAATAAACAAAcgcgAAATAAACAAAA combi: CCGGGAAAGAACATTCCAgcgAAAGAACATTCCAgcgAAAGAACATTCCAA recognition sequence underlined) with overhangs to facilitate direct cloning into the Xma1 and BglII sites of pGL3--promoter. Using the same approach, we constructed mutant versions of the combi reporter by using oligonucleotides with changes in the combi sequences as indicated in Fig. 5b and Fig S4c.

Transient transfections
To analyze GR--dependent regulation of luciferase reporter plasmids, U2OS cells were transiently transfected and treated overnight with 1 µM dexamethasone, harvested and luciferase activity was measured as described (Meijsing et al. 2009). For IMR90, approximately 50.000 cells in 500 µl EMEM/10% FBS were seeded per well of a 24--well plate. The following day, cells were transfected using 2 µl Lipofectamine 2000 (Invitrogen) per well according to the manufacturer instructions. Cells were transfected with 720 ng reporter plasmid and 8 ng pCMV--Renilla. After transfection (6 h), cells were re--fed with EMEM/10% FBS containing 1 µM dexamethasone or EtOH. 16--18 hours later, cells were lysed in 100 µl lysis buffer and luciferase activity was measured as described above for U2OS cells. For all experiments, at least three biological replicates were done. dsiRNA knockdown For dsiRNA knockdown experiments, approximately 20.000 U2OS cells were seeded per well of a 48--well plate and transfected the following day with 25 nM dsiRNAs (IDT, sequence listed in Table 1) using Lipofectamine 2000 (Invitrogen). After transfection (6 h), cells were washed once and re--fed with DMEM/5% FBS. To analyze knock--down efficiency, RNA was isolated 48 h past dsiRNA transfection using an RNeasy kit (Qiagen) and analyzed by Quantitative Real Time PCR. To measure the effect of the knockdown on luciferase reporter activity, 24 h after dsiRNA--treatment, cells were transiently transfected with luciferase reporter plasmids, treated with hormone and luciferase activity was measured as described (Meijsing et al. 2009). To measure the effect on GR-dependent regulation of endogenous target genes, U2OS cells were treated overnight with 1 µM dexamethasone or ethanol as vehicle control 48 h after dsiRNA transfection. RNA was isolated and analyzed by Quantitative Real Time PCR. Total RNA of vehicle control or hormone--treated (dexamethasone, 1 µM for 4h) IMR90 cells was purified using an RNeasy kit (Qiagen) or a NucleoSpin RNA kit (Macherey--Nagel). Biotin--labeled cRNA was synthesized from 500 ng total RNA for 4 biological replicates for each condition using the TotalPrep RNA amplification Kit (Ambion) according to the manufacturer's instructions. The labeled cDNA was hybridized to HumanHT--12 v3 BeadChip (Illumina). Following washing and staining, the BeadChip were scanned using the Illumina BeadStation 500. Pre--processing and differential expression analysis were done in R using the beadarray package (Dunning et al. 2007), using the "summarize" and "normaliseIllumina" functions and the quantile normalization method. Differentially expressed genes among different samples were identified based on the moderated t--test implemented in the limma package (Smyth 2004). The data were deposited in ArrayExpress (accession number E--MTAB--2954).

ChIP--exo processing
The uniquely mapped reads with BWA were directly obtained from the Peconic company as BAM files. Reads for the publicly available datasets were downloaded from ENA (www.ebi.ac.uk/ena/) (CTCF: SRA accession SRA044886, replicate 3 SRR346403 (Rhee and Pugh 2011), ESR1: ERP003828 (Serandour et al. 2013), FOXA1: ERR336963 (Serandour et al. 2013). ESR1 and FOXA1 reads were aligned with Bowtie 1 retaining only uniquely mapped reads (--m 1 -v 2). The CTCF dataset was processed similarly to the original study (Rhee and Pugh 2011): the reads were mapped with Bowtie 1 in colorspace, retaining uniquely mapped reads (--v 3 -m 1). Unmapped reads were trimmed by 6 bp from 3' end and were mapped again. ChIP--seq and ChIP--exo data were deposited in ArrayExpress and ENA (accession number E--MTAB--2956). Fraction of ChIP--seq peaks with GBS GR ChIP--seq peak sequences (+/--50 bp around the peak summit) were scanned with the JASPAR motif MA0113.2 (Mathelier et al. 2014) for GR, using the program RSAT matrix--scan (Turatsinze et al. 2008;Thomas--Chollier et al. 2011). The background model trained for each cell line on the corresponding peak sequences is a Markov chain of order 1, which accounts for the CpG depletion of vertebrate genomes. To determine which sequence segments are considered as match, we set the threshold on the p--value associated to the weight score. This threshold ranged from 10 --6 (very stringent) to 10 --1 (very loose). As control sequences, the coordinates of GR peaks from all cell lines were randomly shifted into the regions flanking the actual peaks. The flanking regions were defined as 2 kb on each side of the peak after extending the peaks by 200 bp on both sides. This was achieved with slop, flank and shuffle from the BEDTools suite (Quinlan and Hall 2010). As above, the background model was trained on this dataset.

ExoProfiler pipeline
To analyze the local 5' coverage distribution centered on TFBSs, we developed a computational pipeline called ExoProfiler (Fig. 1b), implemented in Python and R. This pipeline is composed of three tools aiming at scanning sequences for TFBS (matrixScanWS.py), performing profile computation (fivePrimeCounter), and finally plotting the computed footprint profiles (exoPlotter.R).

TFBS predictions (matrixScanWS.py):
First, a TFBS coordinates BED file must be obtained from a motif--scanning program. The motifs used for scanning (using matrix--scan) were obtained from a collection of reference motifs (JASPAR November 2013 (Mathelier et al. 2014), vertebrates only, 205 motifs) and from de novo motifs discovered on ChIP--seq peaks sequences with RSAT peak--motifs Thomas--Chollier et al. 2012b) (default parameters, using the four algorithms, 5 motifs per algorithm), both on the complete peak length or on ±30 bp around the peak summit to better benefit from the two algorithms based on positional bias. The results shown are for ± 30 bp around the peak summit. Each motif was given as input to RSAT matrix--scan (Turatsinze et al. 2008;Thomas--Chollier et al. 2011), as described above, with a stringent threshold set on the weight score p--value 10 --4 . For palindromic motifs reported at the same position on both strands, the match associated to the lowest p--value was retained. As control, each motif had its columns randomly permuted ten times independently using RSAT convertmatrix (Thomas--Chollier et al. 2011), which maintain the statistical properties of the original matrix, but not its biological significance. RSAT compare--matrices (Thomas--Chollier et al. 2011) was finally run to ensure that on the one hand the permuted matrices are distinct (--lth Ncor 0.99), and on the other hand not too similar to the original matrix (--lth Ncor 0.4). For the in silico mutated GBS consensus analysis, this TFBS prediction step was replaced by a pattern--matching approach using RSAT dna--pattern (Thomas--Chollier et al. 2011). The patterns were expressed with IUPAC code ; the "mutation" is achieved replacing a chosen letter (e.g. A) by "not this letter" (e.g. B coding for C or G or T). ChIP--exo 5' coverage (fivePrimeCounter.py): It takes as input the mapped reads from a ChIP--exo experiment (BAM format) and TFBS coordinates for a motif of interest (BED format). For each TFBS from the BED file, fivePrimeCounter defines a short region (e.g. +/--30 bp) centred on this TFBS. Within this region, the ChIP--exo coverage is computed as follows: starting from the observation that only the most 5' position of the reads is informative in ChIP--exo data, as it marks the boundary of protection from lambda exonuclease digestion provided by cross--linked proteins, fivePrimeCounter reduces all mapped ChIP--exo mapped reads to their 5'--most base position, generating a count profile for the selected region. The program is fully strand--sensitive, ensuring that forward and reverse read coverages are calculated with respect to the motif orientation: if the TFBS is located on the reverse strand, reads on the direct strand are counted as reverse and reads on the reverse strand as forward, and all counts are adjusted to the correct distance from the motif center (Fig. S1b). Optionally, the program also calculates the consensus sequence of all regions aligned by the motif midpoint, which necessitates as additional input the reference genome in FASTA format. Thanks to the python package HTSeq (Anders et al. 2014), fivePrimeCounter is computationally efficient, processing a typical dataset in a few minutes on a common desktop computer. Plotting footprint profiles (ExoPlotter.R): ExoPlotter first discards regions not covered by at least 5 ChIP--exo reads to offer a better visualisation. The pipeline outputs 4 plots of the short regions centered on motifs, with a companion R script: • A color chart representation, which mainly serves to control that the motifs are correctly aligned and that the regions are not shifted by one base pair, a relatively common error when working with genomic coordinate files. • A heatmap of the 5' coverage combining the forward (blue) and reverse (red) strand. The color intensities are log transformed after a pseudo.count of 1 is added to all 5' coverage counts. • A similar heatmap, ordered after a hierarchical clustering of the ChIP--exo 5'coverage at individual short regions. The distance between individual sites is calculated as follows: After adding a pseudo--count of 1, each 5'coverage count c is log normalized by log(c)/log(c_max)), where c_max is the maximal count for forward or reverse 5' coverage. For each individual site, the coverage count signal is then smoothed along the genomic positions using the 'smooth' function in R with default parameters. The Euclidean distances on the log--normalized and smoothed count vectors is used for hierarchical clustering. • A footprint profile, summing the coverage at each position for all regions, for the forward (blue) and reverse (red) strand. The raw sum is plotted unless the user chooses to add the permuted motif control. In this case, the values are normalized by dividing the counts by the number of motifs matches in the assay and in each permutation. A p--value, determining the significance of the enrichment of ChIP--exo reads around the motif, is calculated using a Wilcoxon rank--sum test. It tests if the total coverage on the short region is significantly higher than on the short regions extracted when using permutated motifs. For all these plots, there is no shift in the position of the reads.

Differences between subsampled profiles
To compare profiles between degenerated motifs (Fig. 3) or between cell lines (Fig S3), new profiles were produced with the same numbers of sites. Multiple random subsets were drawn out of all sites contributing to each profile to confirm similarity of subsampled profiles. For degenerated motifs, the difference between the full 8 constrained profile and the subsampled profile is represented as a plot over a heatmap, separating the forward and reverse strands. To calculate the difference, subsampled and full profiles were first divided by their respective maximum values for normalization, then the full profile is subtracted from the subsampled profile values. K--means clustering of ChIP--exo footprints Individual sites were clustered using K--means clustering with k=4 clusters and 100 restarts with the function 'kmeans' from the 'stats' package in R. The distance between individual sites is calculated as follows: After adding a pseudo--count of 1, each 5'coverage count value c at binding site s is log normalized by the total counts of all positions of site s log(c)/log(C_s), where C_s is the sum of all counts for each position in s for forward or reverse 5' coverage. For each site, the coverage count signal is then smoothed along the genomic positions using the 'smooth' function in R with default parameters. The Euclidean distances on the log--normalized and smoothed count vectors is used for K--means clustering.

Structural alignment
Structural alignments of protein and DNA complexes were obtained as follows: A structural model of a DNA hybrid sequence was generated using 3D--Dart (van Dijk and Bonvin 2009). The hybrid sequence always consisted of the GR half site AGAACA and the binding motif of the alignment partner. The latter was derived from the corresponding PDB file and comprised the JASPER consensus sequence (Mathelier et al. 2014). For instance, a hybrid sequence for the GR:ETS1--DNA complex consisted of the 5'--CAG ATT TCC GGC ACT--3' motif of the ETS1 structure (PDB entry 1K79) and the 5'--AGA ACA CCC TGT TCT--3' for GR (PBD entry 3G6U), comprising ETS1 binding site TTCC and GR half site AGAACA. An overview of all hybrid sequences used is given in table 3. GR and potential interaction partner binding motifs were aligned using the CE--align algorithm (Jia et al. 2004) to the 3D--DART DNA model of the hybrid sequence. A complete list of structures used for alignment is provided in table 3.