Extreme HOT regions are CpG-dense promoters in C. elegans and humans

  1. Julie Ahringer1
  1. The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge CB3 0DH, United Kingdom

    Abstract

    Most vertebrate promoters lie in unmethylated CpG-dense islands, whereas methylation of the more sparsely distributed CpGs in the remainder of the genome is thought to contribute to transcriptional repression. Nonmethylated CG dinucleotides are recognized by CXXC finger protein 1 (CXXC1, also known as CFP1), which recruits SETD1A (also known as Set1) methyltransferase for trimethylation of histone H3 lysine 4, an active promoter mark. Genomic regions enriched for CpGs are thought to be either absent or irrelevant in invertebrates that lack DNA methylation, such as C. elegans; however, a CXXC1 ortholog (CFP-1) is present. Here we demonstrate that C. elegans CFP-1 targets promoters with high CpG density, and these promoters are marked by high levels of H3K4me3. Furthermore, as for mammalian promoters, high CpG content is associated with nucleosome depletion irrespective of transcriptional activity. We further show that highly occupied target (HOT) regions identified by the binding of a large number of transcription factors are CpG-rich promoters in C. elegans and human genomes, suggesting that the unusually high factor association at HOT regions may be a consequence of CpG-linked chromatin accessibility. Our results indicate that nonmethylated CpG-dense sequence is a conserved genomic signal that promotes an open chromatin state, targeting by a CXXC1 ortholog, and H3K4me3 modification in both C. elegans and human genomes.

    Transcription is regulated through functional elements in the genome such as promoters and enhancers. Identifying these genomic elements, many of which are recognized by transcription factors (TFs), is a necessary first step in their functional analysis. Toward this goal, the modENCODE and ENCODE Projects have used chromatin immunoprecipitation (ChIP) to map the binding sites for a large number of TFs in C. elegans, Drosophila, and humans (The ENCODE Project Consortium 2007, 2012; Gerstein et al. 2010, 2012; The modENCODE Project Consortium 2010; Nègre et al. 2011; Niu et al. 2011; AP Boyle, CL Araya, C Brdlik, P Cayting, C Cheng, Y Cheng, K Gardner, L Hillier, J Janette, L Jiang, et al., in prep.).

    Most of the identified TF-binding regions are “low occupancy,” showing binding by one or a few different TFs. As expected of enhancers, low-occupancy sites are enriched for DNA sequence motifs recognized by the bound factors, suggesting direct DNA binding (Gerstein et al. 2010; The modENCODE Project Consortium 2010). In contrast, these mapping studies also identified an unusual class of sites bound by the majority of mapped TFs, termed HOT (Highly Occupied Target) regions. The large number of factors associated with HOT regions is incompatible with simultaneous occupancy, and the regions usually lack sequence-specific binding motifs of the associated factors, indicating that they are unlikely to be classical enhancers (Gerstein et al. 2010; The modENCODE Project Consortium 2010). Consistent with this idea, Drosophila HOT regions are depleted for annotated enhancers; however, a significant fraction displays enhancer activity in transgenic analyses (Kvon et al. 2012). Drosophila and human HOT regions have features of open chromatin, such as nucleosome depletion and high turnover, and in C. elegans they are usually located near genes with ubiquitous expression (Gerstein et al. 2010; The modENCODE Project Consortium 2010; Yip et al. 2012). These previous reports suggest that HOT regions are active genomic elements but fail to provide a clear picture of their function, possibly because of differences in the definition and characterization of these regions.

    Other DNA sequence features are also known to play regulatory roles. For example, a hallmark of many mammalian promoters is enrichment for nonmethylated CpG dinucleotides defined as a “CpG island (CGI)” (Bird et al. 1985; Bird 1986; Gardiner-Garden and Frommer 1987; Illingworth and Bird 2009; Deaton and Bird 2011). In contrast to these promoter regions, CpG dinucleotides are sparsely distributed and cytosine-methylated in most other regions in the genome, and differential DNA methylation is thought to be important for the recognition and function of CGIs (Jones et al. 1998; Nan et al. 1998; Cameron et al. 1999; Klose and Bird 2006; Joulie et al. 2010). Promoter-enriched CpGs have not been observed in invertebrates and are believed to be absent in organisms lacking DNA methylation. Indeed, it is widely considered that the enrichment of CpGs at promoters is a vertebrate-specific phenomenon that requires DNA methylation (Duncan and Miller 1980; Gardiner-Garden and Frommer 1987; Cooper and Krawczak 1989; Ehrlich et al. 1990; Antequera and Bird 1999; Caiafa and Zampieri 2005; Illingworth and Bird 2009; Turner et al. 2010; Deaton and Bird 2011). In mammals, nonmethylated CpGs are associated with active promoter chromatin features and bound by CXXC1 (CFP1), which is part of a SETD1A complex that catalyzes methylation of H3K4 (Lee and Skalnik 2005; Ansari et al. 2008; Butler et al. 2008; Tate et al. 2010; Thomson et al. 2010; Xu et al. 2011). Although C. elegans lacks DNA methylation, the C. elegans genome encodes a CXXC1 ortholog, CFP-1, required for global H3K4 trimethylation (Simonet et al. 2007; Li and Kelly 2011), suggesting a possible similarity of function.

    Here we directly compared HOT regions in C. elegans and humans to investigate their functions and ask if these are conserved. We discovered that HOT regions are active promoters enriched for CpG dinucleotides in both organisms. We further show that high CpG density is a feature of many C. elegans promoters, and such promoters are targeted by CFP-1. Our results support a conserved function for promoter-dense CpGs in C. elegans and human genomes.

    Results

    We collected modENCODE and ENCODE TF mapping data sets for 90 C. elegans factors (across developmental stages) and 159 human factors (from multiple cell types) and determined regions of factor overlap (Supplemental Fig. S1; Methods; The ENCODE Project Consortium 2007, 2012; Gerstein et al. 2010, 2012; The modENCODE Project Consortium 2010; Nègre et al. 2011; Niu et al. 2011; AP Boyle, CL Araya, C Brdlik, P Cayting, C Cheng, Y Cheng, K Gardner, L Hillier, J Janette, L Jiang, et al., in prep). Within each region of TF overlap, we further defined a minimal core region with the highest factor occupancy. This identified 35,062 C. elegans regions bound by 1–87 factors and 737,151 human regions bound by 1–138 factors.

    HOT regions are promoters

    We first examined chromatin modifications at TF binding regions. We ranked TF binding regions by factor occupancy (high to low) and plotted H3K4me3 and H3K4me1 as a heat map. As expected, in both organisms we find that the chromatin of low-occupancy regions has features typical of enhancers (H3K4me3low/H3K4me1high) (Fig. 1). In contrast, as the number of factors increases, the chromatin signature becomes more promoter-like (H3K4me3high/H3K4me1low). The majority of TF binding regions in the top 5% of occupancy have a promoter-like signature in both organisms, suggesting that they are located at promoters. Consistent with this idea, high-occupancy TF binding regions more often overlap with proximal promoter regions compared to low-occupancy regions (Fig. 1).

    Figure 1.

    HOT regions display promoter features. H3K4me3 and H3K4me1 signals plotted at the centers of core TF overlap regions ranked by the indicated percentile of TF occupancy in humans and C. elegans. (A) For human TF overlap regions, 7419 (top 1%), 30, 945 (top 2%–5%), a random selection of 100,000 of 241,975 (top 6%–30%), and 456,812 (bottom 70%) regions are plotted. (B) For C. elegans TF overlap regions, 376 (top 1%), 1429 (top 2%–5%), 9721 (top 6%–30%), and 23,536 (bottom 70%) regions are plotted. In each TF overlap band, the number of TFs and the percentage of TF core midpoints ±500 bp of a TSS are indicated. Scales show linear (human) or log2 (C. elegans) input normalized signal ranges.

    To increase the power of finding similarities between HOT regions within and between organisms, we analyzed HOT regions in the top 1% of occupancy (more than 64 factors in C. elegans and more than 57 factors in humans) since these display the strongest enrichment for promoter-like chromatin features in both organisms. For comparison, we define “COLD” low-occupancy regions as those with single-factor binding (representing 33.6% of regions in C. elegans and 30.8% in humans).

    The preceding findings suggest that HOT regions may be active promoters. In C. elegans, the annotated transcript start is often the start of the mature transcript, not the start of transcription, because the primary 5′ end is removed and degraded following trans-splicing (Bektesh and Hirsh 1988; Blumenthal 1995, 2012; Allen et al. 2011). To directly investigate whether core HOT regions are functional promoters, we chose 10 C. elegans core HOT regions of 241–525 bp located at various distances upstream of the nearest annotated transcript start (150 bp to 4.7 kb upstream) and cloned each one directly upstream of a GFP reporter gene (Fig. 2A). All 10 tested HOT regions drove GFP, with most showing widespread expression (Fig. 2B; Supplemental Table S1), indicating that the short tested regions are widely active promoters.

    Figure 2.

    C. elegans HOT regions are functional promoters. (A) The indicated HOT regions (orange) were cloned directly upstream of a histone∷GFP fusion gene and examined for the expression of GFP using transgenic reporter assays. (B) Ten of ten assayed regions drove GFP expression. The expression in representative larvae stage 4 worms is shown. Coordinates of cloned regions and further information on expression patterns are shown in Supplemental Table S1.

    We also found that genes proximal to HOT regions are usually widely expressed. In C. elegans, 78% of genes associated with HOT regions are expressed in all tissues assayed using gene expression profiling (Spencer et al. 2011). Similarly, most (91%) human genes associated with HOT regions are expressed in all examined cell types in gene expression analyses from the ENCODE Project (see Methods). Taken together, these findings support the view that that most C. elegans and human HOT regions are ubiquitously active core promoters.

    CpG dinucleotides are enriched in C. elegans and human HOT regions

    We next searched for common sequence characteristics of C. elegans and human HOT regions by analyzing their composition of mono- and dinucleotides. In both organisms, HOT regions have high GC content (Fig. 3A). Remarkably, C. elegans and human HOT regions show similar patterns of dinucleotide enrichment and depletion, with CG dinucleotides showing the highest enrichment in both organisms (Fig. 3A). To further investigate the distribution of CG dinucleotides at HOT and COLD regions, we plotted CpG density across these regions, which showed prominent peaks around the midpoints of HOT regions (Fig. 3B,C, upper panel). In contrast, COLD regions do not display these patterns (Fig. 3B,C, upper panel). In addition, we find that HOT regions but not COLD regions show strong nucleosome depletion (Fig. 3C, bottom panel).

    Figure 3.

    HOT regions are enriched for CpG dinucleotides and depleted for nucleosomes. (A) Frequency of the indicated mono- and dinucleotides in HOT (red) and COLD (blue) regions is shown relative to the genome-wide frequency scaled to one (black horizontal line). (B) Heat maps showing the distribution of CpG density in ranked TF overlap regions in human and C. elegans. The color scheme shows the scale (0 to 15) for CpG content in a 200-bp window. (C) The distribution of CpG density (top) and nucleosomes (bottom) was plotted for HOT (red) and COLD (blue) regions in the human and C. elegans genomes. Lines show mean signal, darker filled areas show standard error, and lighter filled areas are 95% confidence intervals. All plots show 2-kb regions centered at the midpoint of core regions.

    The finding that HOT regions are at core promoters raised the possibility that CpG-dense sequences may be a shared promoter feature in C. elegans and human genomes. As described above, CpG enrichment is a known property of the majority of human promoters (Gardiner-Garden and Frommer 1987; Illingworth and Bird 2009). In these “CpG islands,” the cytosines of CpGs are typically nonmethylated; whereas CpGs are broadly depleted from most regions of the human genome, and nonpromoter CpGs are usually cytosine methylated. Since DNA methylation differences are thought to underlie the recognition and function of CpGs as well as the mutation of cytosine that drives depletion of bulk CpGs, it was unexpected to observe enrichment of CpGs at promoters in the C. elegans genome, where DNA methylation and CpG depletion are both absent.

    The preceding analyses only examined CpG enrichment at narrowly defined HOT regions. To ask whether CpG dinucleotides might be more widely found at C. elegans promoters, we plotted CpG density and observed/expected CpG content (normalizing for local GC content) across protein-coding promoters and found clear CpG enrichment just upstream of the transcription start site (TSS) in many C. elegans promoters, with a narrower distribution in C. elegans than in human (Supplemental Figs. S2, S3; heat map in Fig. 5D, see below). In both humans and C. elegans, genes with CpG-rich promoters more frequently show ubiquitous expression compared to those with CpG-poor promoters (in human 68.6% of high CpG and 2.3% of low CpG; in C. elegans 56.3% of high CpG and 21.8% of low CpG; see Methods) (Schug et al. 2005; Ramskold et al. 2009). Taken together, these findings suggest that CpG dinucleotides are a feature of widely active C. elegans promoters, as they are in vertebrates.

    Promoter-dense CpG regions are associated with nucleosome depletion

    Mammalian CpG-rich promoters show strong nucleosome depletion that appears at least partially independent of transcriptional activity, suggesting that CpG density plays a role in promoter accessibility (Fenouil et al. 2012; Vavouri and Lehner 2012). We therefore investigated the relationship between CpG density and chromatin accessibility at C. elegans promoters.

    We first separated promoters of highly expressed genes into those with high and low CpG density and compared their nucleosome distribution patterns. For this analysis, we used ubiquitously active C. elegans promoters to avoid effects due to tissue-specific regulation. We found that among these highly active promoters, those with high CpG density are strongly nucleosome depleted compared to those with low CpG density (Fig. 4A). We obtained similar results after normalizing CpG content to local GC content (Supplemental Fig. S4A). Therefore, high CpG density but not high transcriptional activity is associated with promoter accessibility. To test whether high GC content rather than high CpG density was related to nucleosome depletion, we separated high GC content promoters into those with high or low CpG content and plotted their nucleosome distributions. Promoters with high CpG content show stronger nucleosome depletion than those with low CpG content, indicating that accessibility is linked with high CpG rather than generally high GC content (Supplemental Fig. S4B,C).

    Figure 4.

    A promoter harboring a CpG dense region is associated with nucleosome depletion in C. elegans. (A) Plots of mononucleosome and CpG distributions across promoters of ubiquitously expressed genes in the top 20% of expression band, separated by high CpG density (red, top 20%) or low CpG density (dark gray, bottom 20%). (B) Mononucleosome and CpG distributions were analyzed for ubiquitously active promoters within 20% of CpG density (protein coding promoters) and separated into the top 20% (red) or second 20% (blue) of expression. (C) The distribution of mononucleosomes and CpGs across promoters with the top 20% CpG density separated into those with high (top 20%, pink), medium (middle 20%, blue), or low/no expression (bottom 40%, orange). Lines show mean signal, darker filled areas show standard error, and lighter filled areas are 95% confidence intervals.

    To investigate whether the level of transcriptional activity affects the nucleosome distribution pattern at promoters, we separated ubiquitously active promoters with high CpG density into different gene expression bands. Ubiquitously active genes are relatively highly expressed, with most (94%) falling into the top 40% of expression when considering all genes. We separated these genes into those in the top and bottom of this range (top 20% and second 20% of expression) and compared their nucleosome distributions. The two gene expression classes both show strong nucleosome depletion (Fig. 4B), suggesting that among ubiquitously expressed genes, the level of transcriptional activity has little effect on the level of nucleosome depletion at promoters with high CpG density.

    Finally, to investigate whether high CpG density per se can drive nucleosome depletion, we separated all high CpG density promoters into different expression bands. Genes in the middle and bottom expression bands are enriched for tissue-specific expression, whereas those in the top expression band are enriched for being ubiquitously expressed. Promoters of all three expression bands display clear nucleosome depletion. However, we found that the level of depletion is lower at promoters of genes in the low/no and middle expression bands compared to those of high expression (Fig. 4C). This difference suggests that high CpG density alone is not sufficient for strong nucleosome depletion, and that additional regulation can counteract the effects of CpG-rich sequences in promoting chromatin accessibility.

    C. elegans CFP-1 is targeted to CpG-rich promoters marked by H3K4me3

    In higher vertebrates, nonmethylated CpGs are directly recognized by the CXXC domain of CXXC1 (CFP1), which leads to trimethylation of lysine 4 of histone H3 through recruitment of the SETD1A histone methyltransferase (Lee and Skalnik 2005; Tate et al. 2010; Thomson et al. 2010). Despite the lack of DNA methylation in C. elegans (Simpson et al. 1986), a CXXC1 ortholog (CFP-1) containing a conserved CXXC domain exists (highlighted in Supplemental Fig. S5). In addition, it was previously demonstrated that C. elegans CFP-1 is required for global H3K4me3 levels (Simonet et al. 2007; Li and Kelly 2011). These findings raise the possibility that C. elegans CFP-1 might target promoter regions of high CpG density, as in mammals.

    To test this hypothesis, we mapped the binding sites of CFP-1 by chromatin immunoprecipitation (ChIP) assays of GFP-tagged CFP-1. We find that CpG di-nucleotides are enriched at CFP-1 binding sites and are densely distributed at CFP-1 peaks (Supplemental Fig. S6). Most C. elegans CFP-1 binding sites are at promoters and overlap regions marked by H3K4me3 (Fig. 5A–C), consistent with the requirement for CFP-1 in generating H3K4me3. Furthermore, the signal intensities of both CFP-1 and H3K4me3 at promoters show a striking concordance with CpG density (Fig. 5D,E). The CFP-1 peaks not near annotated 5′ ends also harbor high CpG density and promoter-like chromatin signatures (H3K4me3high/H3K4me1low), suggesting that they might identify unknown promoters (Supplemental Fig. S7A). In contrast, little CFP-1 signal is found at COLD regions, which usually have an enhancer-like chromatin signature (Supplemental Fig. S7B). We also find that highly expressed genes with high levels of CFP-1 (top 20%) show H3K4me3 marking and nucleosome depletion, but highly expressed genes with little CFP1 (bottom 20%) do not, suggesting that CFP-1 binding is important for these patterns (Fig. 5E). Furthermore, as with high CpG density, we found that promoters with high CFP-1 levels show nucleosome depletion and high H3K4me3 marking, which was observed in promoters linked with genes in both the top 20 and second 20% expression bands (Fig. 5F). We conclude that targeting of CXXC1 orthologs to nonmethylated CpG-dense promoters and the relationship with H3K4me3 marking is conserved between C. elegans and humans.

    Figure 5.

    C. elegans CFP-1 is targeted to CpG-rich promoters marked by H3K4me3. (A) A representative screen shot for the distribution of CFP-1 (red) and H3K4me3 (blue) normalized ChIP signals in late embryos. (B) Venn diagram showing the overlap between CFP-1 and H3K4me3 ChIP ChIP-seq peaks. Numbers in the overlap region are not equal because single peaks in one data set may overlap more than one peak in the other data set. (C) Percentage of CFP-1 peaks overlapping promoter regions (orange; ±500 bp of coding TSSs) (Chen et al. 2013; Kruesi et al. 2013), gene bodies (green), and the remaining genomic regions (blue). (D) Signal distributions for CFP-1, H3K4me3, and CpG density plotted in 2-kb windows centered by coding TSSs identified in Chen et al. (2013) in heat maps ranked by the density of promoter CpGs. The color scheme shows the scale (0–15) for the indicated signal. (E) Plots of CFP-1 ChIP signal, CpG content, mononucleosome pattern, and H3K4me3 signal across promoters of active ubiquitous genes in the top 20% of expression with high (red, top 20%), or low (dark gray, bottom 20%) CFP-1 occupancy. (F) Plots of CFP-1 ChIP signal, CpG content, mononucleosome pattern, and H3K4me3 ChIP signal across ubiquitous promoters highly targeted by CFP-1 (top 20%) and with high (red, top 20%) or low (blue, second 20%) expression levels. All plots display the indicated features in 2-kb windows centered at TSSs. Lines show mean signal, darker filled areas show standard error, and lighter filled areas are 95% confidence intervals.

    Discussion

    Our results support the view that promoter CpG dinucleotides may function as an ancient conserved promoter signal. Many C. elegans and human promoters show local CpG enrichment and are targeted by a nonmethyl CpG binding protein 1 ortholog (Lee and Skalnik 2005 and this work), and extreme HOT regions in both organisms are CpG-rich promoters. Furthermore, C. elegans CpG-dense promoters are strongly nucleosome depleted in vivo, as in mammals (Fenouil et al. 2012; Vavouri and Lehner 2012).

    How CpG-rich sequences function in promoter accessibility is unclear. One possibility is that the DNA sequence itself initiates accessibility, as it has been observed that selected human CpG-rich promoter sequences poorly assemble nucleosomes in vitro (Ramirez-Carrozzi et al. 2009), and large CGIs show low in vitro occupancy (Fenouil et al. 2012). If CpG-dense sequence does intrinsically disfavor nucleosome assembly, this could make the regions more available for binding to factors such as CXXC1 (CFP1). However, in contrast to the findings of Ramirez-Carrozzi et al. (2009), global in vitro analyses that mapped intrinsically favored nucleosome positions in genomic DNA showed that human CpG-rich promoters are on average enriched for nucleosome occupancy (Valouev et al. 2011). Similarly, we also find that C. elegans CpG-rich promoters display strong nucleosome occupancy in in vitro assembly data sets (Supplemental Fig. S8; Locke et al. 2013).

    These conflicting in vitro results regarding CpG-rich sequences could be due to differences in the assay conditions used for nucleosome assembly in vitro. Ramirez-Carrozzi et al. (2009) directly tested nucleosome assembly on 300-bp fragments, whereas the global in vitro studies (Valouev et al. 2011; Locke et al. 2013) assembled nucleosomes on larger fragmented genomic DNAs before micrococcal nuclease treatment to isolate mononucleosomes. Another study (Zhang et al. 2011) showed that conditions used for nucleosome reconstitution can strongly affect the results obtained, as more in vivo-like nucleosome positions are observed when a salt dialysis method is applied during nucleosome reassembly. Interestingly, the addition of whole-cell extract and ATP in the absence of transcription nearly reconstituted in vivo patterns (Zhang et al. 2011), supporting a role for external factors in nucleosome positioning and occupancy. Further investigations will be required to elucidate the roles of CpG sequences and non-nucleosome factors in determining nucleosome patterns at promoters.

    Irrespective of such future studies, our findings suggest a paradigm whereby CpG-rich sequences promote chromatin accessibility, either intrinsically or through the activities of other factors, which facilitates the formation of an active promoter with an open chromatin state. Our results also provide a plausible explanation for the high occupancy of transcription factors at HOT regions. High CpG density at these regions might create and/or maintain highly accessible regions that are available for interactions with TFs and other factors.

    The finding that C. elegans and human HOT regions are active promoters may seem at odds with the finding that Drosophila HOT regions can function as enhancers in transgenic assays (Kvon et al. 2012). This variance may be partly explained by differences in how HOT regions were defined. We analyzed extreme HOT regions in the top 1% of occupancy, whereas Kvon et al. (2012) studied those in the top 5% of occupancy. Nevertheless, we find that Drosophila TF binding regions in the top 5% of occupancy are enriched for having promoter-like chromatin signatures and frequently overlap transcription start sites (Supplemental Fig. S9A), similar to C. elegans and human HOT regions. We also note that the HOT regions assayed in the Drosophila transgenic study (Kvon et al. 2012) were relatively large (a median length of 2.1 kb) and therefore may harbor multiple regulatory elements. As activity assays were not conducted in the absence of a basal promoter, it is possible that the tested Drosophila regions might contain active promoters as well as enhancers. Interestingly, in contrast to C. elegans and human, we observed only a very small enrichment for CpG dinucleotides in Drosophila HOT regions, with similar enrichment in COLD regions (Supplemental Fig. S9B), suggesting that CpG dinucleotides may not be relevant at Drosophila HOT regions.

    After CpG, the GC dinucleotide shows the second highest enrichment in human and C. elegans HOT regions (Fig. 3), suggesting that GpC might play an uncharacterized role at promoters. CpG shows much higher enrichment in humans than in C. elegans, whereas GpC enrichment is similar in the two organisms. This difference might be a consequence of the global depletion of CpG in the nonpromoter regions of the human genome, which is not seen in C. elegans. A possible role for GpC dinucleotides at promoters remains to be characterized.

    Computational CpG island predictions, which identify DNA sequences with the highest relative CpG density, do not highly overlap promoters in cold-blooded vertebrates (Long et al. 2013) or in C. elegans (Supplemental Table S4; Wu et al. 2010). Therefore, CpG density alone is unlikely to be the sole signal for an open chromatin state at promoters. Consistent with this, purification of nonmethylated DNA via CXXC binding identifies promoters across a range of vertebrate genomes (Long et al. 2013), suggesting that differential DNA methylation may be important for recognition of these regions. As C. elegans lacks DNA methylation, recognition of nonmethylated CpG-rich sequence must involve alternative mechanisms in this organism. An important goal for the future will be to understand how these regions are identified. Given the conservation of CpG promoter enrichment, it seems plausible that mechanisms not involving DNA methylation may also operate in animals harboring DNA methylation.

    Methods

    Defining the overlaps of transcription factor binding sites

    We used modENCODE and ENCODE collections of TF mapping data for 90 C. elegans factors, 54 D. melanogaster factors, and 159 human factors (The ENCODE Project Consortium 2007, 2012; Gerstein et al. 2010, 2012; The modENCODE Project Consortium 2010; Nègre et al. 2011; Niu et al. 2011; AP Boyle, CL Araya, C Brdlik, P Cayting, C Cheng, Y Cheng, K Gardner, L Hillier, J Janette, L Jiang, et al., in prep). Data sets for a given factor were merged into single files to prevent double counting of factors; then at every base, the number of factors where a peak was present was counted. The occupancy of each region is the number of unique factors found inside. For each region of TF overlap, the “core region” (Supplemental Fig. S1) was defined as the range with local maxima of peak call coverage. This process identified 35,062 C. elegans regions bound by 1–87 factors, 32,168 D. melanogaster by 1–37 factors, and 73,7151 human regions bound by 1–138 factors (Supplemental File S1). For all analyses, we defined HOT regions as those in the top 1% of occupancy (376 for C. elegans, 341 for D. melanogaster, and 7419 for humans) and “COLD” regions as single factor binding regions (13,425 for C. elegans, 18,177 for D. melanogaster, and 314,323 for humans) (Supplemental File S2). The following genome versions were used: hg19 for human, ce10/WS220 for C. elegans, and dm3 for Drosophila.

    HOT and COLD core regions were scored as overlapping a promoter defined by ±500 bp of a transcript start site. Coding transcript start sites were downloaded from Ensembl genes 71 (human (GRCh37.p10) and D. melanogaster (BDGP5). For C. elegans, we collected and pooled all coding TSSs recently identified based on capped RNA sequencing (Chen et al. 2013; Kruesi et al. 2013).

    Generation and analysis of HOT-region transgenic lines

    Ten HOT core sites were PCR amplified from N2 genomic DNA using Phusion polymerase (Finnzymes) and Gateway cloned (Invitrogen) into pDONR221 (regions given in Supplemental Table S1). To create transgenes to test whether HOT regions could function as promoters, MultiSite Gateway cloning (Invitrogen) was used to recombine the HOT regions upstream of his-58 (pJA272) and gfp∷tbb-2 3′UTR (pJA256) sequences on the MosSCI compatible vector pCFJ150, which targets Mos site Mos1(ttTi5605) chrII (Zeiser et al. 2011). C. elegans MosSCI lines were generated as described (Frøkjær-Jensen et al. 2008), injecting strain EG6699 with injection mixes that contained pCFJ103(40 ng/µL), pCFJ90(5 ng/µL), pCFJ104(2.5 ng/µL), and expression clones at 40 ng/µL. Supplemental Table S2 lists all strains generated in this study. All strains were used and cultured using standard methods (Brenner 1974). Transgenic strains were grown at 25°C prior to microscopic examinations. Young adult or L4 stage worms were sedated in 5 mM Tetramisole, aligned, and scanned in groups at controlled laser and scanning settings. A full list of primers used for generation of pDONR221 promoter constructs can be found in Supplemental Table S3.

    Generation of GFP-tagged CFP-1 (JA1597)

    The coding region of F52B11.1a (cfp-1) was PCR amplified from N2 genomic DNA using Phusion polymerase (Finnzymes) and Gateway cloned into pDONR221. The cfp-1 coding region was then recombined into the MosSCI compatible vector pCFJ201 (which targets Mos site Mos1[cxTi10882] chrIV) downstream from the dpy-30 promoter and upstream of gfp∷tbb-2 3′UTR (Zeiser et al. 2011).

    Chromatin immunoprecipitation (ChIP)

    Late embryos were obtained as in Vielle et al. (2012) by aging embryos collected by hypochlorite treatment 3.5 h prior to flash freezing in liquid nitrogen. Formaldehyde-fixed chromatin extracts and chromatin immunoprecipitations were as in Kolasinska-Zwierz et al. (2009) except that DNA was sonicated to a size range of 200–400 bp. ChIP assays were performed in 1 mL extract (1 mg protein) in FA buffer with 10 µg of anti-GFP rabbit serum (Abcam ab290) or anti-H3K4me3 (Abcam ab8580); two biological replicates were performed for each antibody. DNA sequencing libraries were constructed using the Illumina TruSeq sequencing kit and were sequenced on the Illumina platform.

    Data sets, processing, and visualization

    C. elegans ChIP-seq data H3K4me3 (modENCODE_5166) and H3K4me1 (modENCODE_5158) and Drosophila ChIP-seq data H3K4me3 (modENCODE_789) and H3K4me1 (modENCODE_777) were obtained from modENCODE (http://www.modencode.org/). MNase digested mononucleosome data for C. elegans embryos (GSM468574) (Ooi et al. 2010) and human K562 cells (GSM920557) (The ENCODE Project Consortium 2012) were downloaded from the Gene Expression Omnibus. H3K4me3 (wgEncodeBroadHistoneHepg2H3k4me3StdSig.bigWig) and H3K4me1 (wgEncodeBroadHistoneHepg2H3k04me1StdSig.bigWig) ChIP-seq data in HepG2 cells were obtained from the ENCODE Project (The ENCODE Project Consortium 2012; AP Boyle, CL Araya, C Brdlik, P Cayting, C Cheng, Y Cheng, K Gardner, L Hillier, J Janette, L Jiang, et al., in prep) (http://genome.ucsc.edu/ENCODE/downloads.html). TF ChIP-seq data sets used in the HOT region study are listed in Supplemental File S3 and can be downloaded from http://anshul.kundaje.net/projects/modencode (for C. elegans and Drosophila) and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/ (for human).

    CpG density, GC content, and observed over expected CpG ratio tracks were calculated in sliding windows of 200-bp size and 1-bp shift, reporting calculated values in the middle of the corresponding range. CpG promoter content was calculated at −200 bp to C. elegans TSSs and ±500 bp of human TSSs. These calculated CpG values for all protein coding promoters—n = 10,106 for C. elegans from Chen et al. (2013) and n = 129,604 for human from Ensembl GRCh37.p12—were ranked to extract promoters harboring the indicated percentile of CpG content.

    Ubiquitously expressed genes in C. elegans were defined as those showing expression (FDR <0.05) in embryonic gut, pan neuron, body wall muscle, germline, and hypodermis in tissue-specific RNA-seq profiling data sets (Spencer et al. 2011). Human ubiquitously expressed genes were defined as those detectably expressed in all examined cell lines (n = 34) in human RNA-seq data sets from the ENCODE Project (The ENCODE Project Consortium 2011) data from http://genome.crg.es/encode_RNA_dashboard/hg19/. Genes with high CpG promoters were those having a promoter in the top 20% CpG band but no promoter in the bottom 80% CpG band (n = 2215 for C. elegans and n = 7710 for human). Genes with low CpG promoters are those having a promoter in the bottom 20% CpG band but no promoter in the top 80% CpG band (n = 1764 for C. elegans and n = 1834 for human genes).

    CFP-1 and H3K4me3 ChIP-seq data sets used in Figure 5 were aligned using BWA with default settings (Li and Durbin 2009), normalized using BEADS (Cheung et al. 2011), then converted to ratios of BEADS scores (enrichment relative to input) and then Z-scored. Peaks were called using MACS v. 2.0.10 software (Feng et al. 2011) with 1 × 10−10 P-value cutoff. Peak calls from each replicate were intersected, and regions present in both were kept (Supplemental File S4). Heat maps were generated using the Cistrome heatmap function (Liu et al. 2011). The IGV Genome Browser was applied for visualization (Robinson et al. 2011).

    Data access

    The ChIP-seq data for C. elegans CFP-1 and H3K4me3 tracks used in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE49870.

    Acknowledgments

    We thank M. Chesney for helpful comments on the manuscript. R.A.-J.C., P.S., E.Z., and J.A. were supported by a Wellcome Trust Senior Research Fellowship to J.A. (054523) and T.A.D. by a Wellcome Trust Research Career Development Fellowship. S.K.F. was supported by a Gates Cambridge Scholarship. J.A. and T.A.D. also acknowledge support by core funding from the Wellcome Trust and Cancer Research, UK.

    Footnotes

    • 1 Corresponding author

      E-mail ja219{at}cam.ac.uk

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.161992.113.

      Freely available online through the Genome Research Open Access option.

    • Received June 13, 2013.
    • Accepted December 26, 2013.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server