Divergent functions of hematopoietic transcription factors in lineage priming and differentiation during erythro-megakaryopoiesis

Combinatorial actions of relatively few transcription factors control hematopoietic differentiation. To investigate this process in erythro-megakaryopoiesis, we correlated the genome-wide chromatin occupancy signatures of four master hematopoietic transcription factors (GATA1, GATA2, TAL1, and FLI1) and three diagnostic histone modification marks with the gene expression changes that occur during development of primary cultured megakaryocytes (MEG) and primary erythroblasts (ERY) from murine fetal liver hematopoietic stem/progenitor cells. We identified a robust, genome-wide mechanism of MEG-specific lineage priming by a previously described stem/progenitor cell-expressed transcription factor heptad (GATA2, LYL1, TAL1, FLI1, ERG, RUNX1, LMO2) binding to MEG-associated cis-regulatory modules (CRMs) in multipotential progenitors. This is followed by genome-wide GATA factor switching that mediates further induction of MEG-specific genes following lineage commitment. Interaction between GATA and ETS factors appears to be a key determinant of these processes. In contrast, ERY-specific lineage priming is biased toward GATA2-independent mechanisms. In addition to its role in MEG lineage priming, GATA2 plays an extensive role in late megakaryopoiesis as a transcriptional repressor at loci defined by a specific DNA signature. Our findings reveal important new insights into how ERY and MEG lineages arise from a common bipotential progenitor via overlapping and divergent functions of shared hematopoietic transcription factors.


Cell culture and purification
Fetal livers were harvested from E14.5 pregnant mice and used as a source of all cell populations in this study. For purification of primary mouse fetal erythroblasts, the fetal liver cell suspension was labeled with a PE-conjugated anti-Ter119 antibody. EasySep PE Selection Kit (#18554) was then used to purify Ter119-positive, PE-labeled erythroblasts.
For MEG production, we used the mouse CD117 (cKit) EasySep selection kit (#18757) to purify cKit-positive hematopoietic progenitors in full accordance with the manufacturer's recommendations. The cKit-positive cells were incubated for 4-6 days in a cell culture medium containing mouse stem cell factor (mSCF) and thrombopoietin (TPO) to promote expansion of MEG progenitors. The cells were then washed and incubated for 5 more days in a cell culture medium in the presence of TPO alone to promote terminal MEG differentiation. EasySep PE Selection Kit (#18554) was used to purify CD14-positive MEG labeled with a PE-conjugated anti-CD41 antibody.
For purification of HSPC, the E14.5 fetal liver cell suspension was enriched for hematopoietic progenitors using EasySep Hematopoietic Progenitor Cell Enrichment Kit (#19756). Briefly, fetal liver cell suspension was stained with biotinylated antibodies against lineage antigens (CD5, CD11b, CD19, CD45R, Ly-6G/C, Ter119, CD71), followed by removal of lineage-positive cells using a magnetic bead protocol. The resulting population was stained with an APC-conjugated anti-Sca-1 antibody, and the Sca1-positive hematopoietic progenitors (HSPC) were purified by FACS sorting. Nonviable cells and cellular debris were excluded based on forward and side scatter characteristics.

Transcriptome analysis
mRNA was extracted with phenol-chlorophorm and further purified using Qiagen RNeasy purification kit. GeneChip® Mouse Gene 1.0 ST Arrays were used to interrogate genome-wide mRNA expression of HSPC, MEG and ERY cells using 4 biological replicates for each cell type. The raw hybridization data were corrected for background, quantile-normalized and log2-transformed using RMA (Irizarry et al., 2003).
Probesets were then filtered to keep only those that had expression values more than 8.69 for at least 1 sample. This cutoff value was calculated as 95th percentile of the distribution of expression values for negative control probes across all samples. The filtering resulted in 8684 probesets. All probesets were then tested for differential expression between each pair of groups and corrected for multiple testing by the Benjamini-Hochberg approach. Then, we removed 1171 probesets to keep only those that represented unique transcripts, preferentially discarding redundant probesets with the smallest significant fold change between the three cell types. The filtering resulted in a final list of 7513 probesets significantly expressed in at least one sample of at least one cell type. We used the HSPC transcriptome as a reference point to define the developmental changes in gene expression during mono-lineage differentiation. For MEG vs. HSPC and ERY vs. HSPC comparisons, genes that significantly (FDR < 5%) changed more than 2-fold relative to the expression level in HSPC were considered to be developmentally up-or down-regulated. Genes whose expression was altered insignificantly (nominal p > 0.05) or less than 1.2-fold were considered to be unchanged ( Figure 2A). Genes that changed between 1.2 and 2-fold were considered indeterminable and not examined further in this study. Differential expression of selected signature genes was confirmed by TaqMan RT-PCR (Supplemental Figure7).
A heatmap for a list of genes was composed using hierarchical clustering with Spearman correlation distance to cluster genes (Supplemental Figure 7A), or using predefined gene clustering with genes separated into 9 groups according to the bilineage MEG/ERY pattern of expression relative to that of HPSC ( Figure 3A). Heatmap color intensities were proportional to the value calculated as a ratio between the gene expression in a single sample and the geometric mean expression of the gene across a set of samples.

Chromatin immunoprecipitation and massive parallel DNA sequencing
For each transcription factor-cell type combination, ChIP-seq was performed on two to four biological replicates. We used the approach of irreproducible discovery rate at a threshold of 0.02 to determine the number n of reproducible peaks in the replicate datasets (Landt et al., 2012;Li et al., 2011). Peaks were called on the combined reads from all replicates and the top n peaks were taken as the set of high confidence peaks. This is a very conservative method for thresholding, and we also generated a larger set of quality peaks, called the reduced stringency peaks, by applying a threshold based on the p-value of the least significant peak in the top 90% of reproducible peaks (the threshold p-values ranged from 10 -80 to 10 -160 ; a detailed description is presented below) The high stringency set of peaks was used for all analyses, and for cases in which the smaller number of peaks could affect the interpretation, the analysis was repeated for the larger set of reduced stringency peaks. For all samples except GATA2 in megakaryocytes two to four biological replicates were performed.

ChIP
ChIP assay was performed as previously described (Welch et al., 2004). Briefly, 75 million cells in PBS were crosslinked for 10 mins by adding formaldehyde at a final concentration of 0.4% and glycine was added at a final concentration of 125mM to quench cross-linking. For megakaryocytes, which are multiploid (6-64N), ~12 million cells in PBS were used for each ChIP assay. Cells were then lysed followed by nuclear lysis and sonication to shear the cross-linked chromatin. A Misonix S-4000 sonicator was used to shear samples in 8 repeats of 30 cycles of 1 sec. on, 1 sec. off sonication at output power 30. Fragments in the size range of 200-400bp were obtained. Sonicated chromatin was pre-cleared overnight at 4°C with 20 µg appropriate non-immune sera (IgG) on protein G agarose beads. 20 µg of the appropriate ChIP antibody were also pre-bound to protein G agarose beads overnight at 4°C. For binding, pre-cleared chromatin was added to the antibody:bead complex and incubated with rotation at 4°C for 2 -4 hours. 200 µL of pre-cleared chromatin was saved for use as input. After binding, the beads were washed with wash buffers, high-salt buffer and TE. DNA:protein complexes were eluted from beads into an elution buffer (1% SDS, 100mM NaHCO3).
After adding 5M NaCl to ChIP and input samples, they were incubated overnight at 65 °C with 1µg RNase A. To digest protein, each sample was treated with 60 µg Proteinase K for 2 hours at 45 °C and immunoprecipitated DNA was finally extracted using the Qiagen PCR Purification Kit.

Illumina Library Preparation
All samples including input were processed for library construction for Illumina sequencing using Illumina's ChIP-seq Sample Preparation Kit. DNA fragments were repaired to generate blunt ends and a single 'A' nucleotide was added to each end.
Double-stranded Illumina adaptors were ligated to the fragments. Ligation products were amplified by 18 cycles of PCR, and the DNA between 250-350 bp was gel purified.
Completed libraries were quantified with Quant-iT dsDNA HS Assay Kit. The DNA library was sequenced on the Illumina Genome Analyzer II sequencing system, and more recently on the HiSeq. Cluster generation, linearization, blocking and sequencing primer reagents were provided in the Illumina Cluster Amplification kits.

Data processing and peak calling
Raw ChIP-seq reads were first groomed using FASTQ Groomer on Galaxy (Blankenberg et al., 2010;Giardine et al., 2005;Goecks et al., 2010). This program verifies that each base call has a corresponding quality value, and that the quality value is in the Sanger Phred+33 format. Groomed reads were then mapped to mouse mm9 genome using Bowtie (Langmead et al., 2009) using the parameters -m = -1 (no limit),k = 1, and -best, thus allowing reads to map to multiple locations, but reporting only the single, best alignment. This option was chosen to allow reads to map in duplicated regions such as those containing the Hba genes. The mapped reads for a transcription factor as well as input reads for the appropriate cell line are then passed to MACS (Zhang et al., 2008) for peak calling using an mfold of 12, p-value threshold of 1e-05 and bw set to half the ChIP DNA fragment length as measured on an Agilent Bioanalyzer. To insure that all of the experiments were processed consistently, all of the above steps were performed as part of a Galaxy workflow, which can be found at This analysis uses the cross-correlation profile of the mapped reads from a given experiment to highlight the extent of enrichment achieved by immunoprecipitation. A well enriched sample that has been sequenced to sufficient depth should show a peak in the cross-correlation profile at a distance equivalent to the average fragment length of the sample, as positive and negative strand reads pile up on either side of the position occupied by a transcription factor. A sample that is poorly enriched will show a peak at approximately the length of the reads generated by the sequencer. A ratio of the heights of the peaks at these two positions, the rPhc, can be used as metric for identifying the extent of enrichment an experiment achieved. Results of this assessment are summarized in the

Consistency analysis for measuring reproducibility between replicates
For all ChIP-seq experiments other than GATA2 in megakaryocytes, a minimum of two biological replicates were sequenced. To assess the level of reproducibility between our replicates, we used a recently published method, called irreproducible discovery rate analysis (IDR), which was designed for analysis of consistency between replicates in high-throughput experiments (Li et al., 2011). This method not only reveals the degree of reproducibility between replicates, but also provides an objective criterion called the irreproducible discovery rate (IDR), to control or report the level of irreproducibility, in a fashion similar to the FDR. This method has been used extensively by the ENCODE and modENCODE Consortia to objectively identify high-confidence peaks from datasets generated by several labs (Mouse ENCODE Consortium et al., 2012).
We performed IDR analysis as described in the ENCODE standards manuscript, with a few modifications (Landt et al., 2012). The method is described in detail at https://sites.google.com/site/anshulkundaje/projects/idr, ("IDR on original replicates").
We used MACS to call peaks with a relaxed p-value threshold of 0.05 for individual replicates as well as the pooled data using the following workflows, respectively: https://main.g2.bx.psu.edu/u/csm165/w/adjustablepvaluehardisonidrpeakcalling https://main.g2.bx.psu.edu/u/csm165/w/adjustablepooledhardisonchipseqworkflow This resulted in a minimum of 100,000 peaks per replicate, and was expected to contain many false positives. The peaks for each pair of replicates were then run through the IDR pipeline using the p-values as a ranking measure ('ranking.measure' = p.value), with overlapping peaks defined as peaks with >=1bp overlap ('min.overlap.ratio' =0). As recommended on the above IDR webpage, IDR of 0.02 was used as a threshold to identify the number N of reproducible peaks from the pairs of replicates. For samples with more than two replicates, we performed pairwise consistency analysis for all the possible pairs of replicates the reproducible peaks from each pairwise comparison were concatenated and merged to give a set of N non-overlapping peaks. We obtained a final set of high-confidence peaks for each factor by selecting the top N peaks from the set of pooled peaks ranked by fold-enrichment. This is a very stringent set of peaks. We also obtained a larger set of less conservative set of peaks, called the confident peaks by selecting peaks from the pooled set whose p-value was more significant than TF-90, where TF-90 was defined as -10log10(p-value) of the least significant reproducible peak out of the top 90th percentile of reproducible peaks. For both sets, ChIP-seq peaks overlapping blacklisted regions (see below) were first removed from the set of pooled peaks prior to selection of final peaks. A summary of the IDR results can found in the Supplemental Methods Table 2 below.

Blacklist
As a final quality filter, we identified a set of genomic locations characterised by very high signal in our input or control tracks. We compiled a set of these locations by using MACS to identify peaks in the input tracks. To this list, we added additional locations, which showed elevated input signal over large areas (> 10,000 bp). We have identified the combination of these two groups of segments as a blacklist.

Peak intersections
Overlap between two peak regions was called between two transcription factors in the same lineage or between two lineages for the same transcription factors if the coordinates overlapped by at least 1 base pair. A gene was considered to be occupied in both lineages or by both transcription factors in two different ways: (1) if it had at least one peak overlapping the gene region where the "gene" includes 10 kb upstream of the TSS and 3 kb downstream of the polyA signal in both compared instances (2) if it had overlapped peak regions overlapping the gene region.

Motif enrichment
Enrichment of a motif was performed de novo, using HOMER algorithm (Heinz et al., 2010), testing for 8 to 16 base pair motif sequences. For enrichments among a set of peaks, a 200 bp window around the center of the peak was used as the target sequence. For motif enrichments within a set of genes we used the region from 2000 bp upstream to 500 bp downstream from the gene TSS. The canonical binding site motifs are enriched in the OSs from both lineages, and thus are not returned in the discriminative analysis.

Methylation profiles
The methylation profile for a set of transcription profile peaks was plotted for the region spanning -10 kb to +10 kb from the peak center. We used positive strands with a 200 bp window resolution. For heatmap generation, gray-scaled intensities were proportional to the significance (-log10 scale) of the methylation signal as compared to control with a maximum significance set at log10(P)=10 (black) and minimum log10(P)=0 (white).
Individual rows were sorted by the significance of transcription factor binding peak signal with the methylation profile.

Enrichment of transcription factor occupancy and histone methylation marks
The Fisher exact test was used to test if occupancy by a transcription factor or the presence of a methylation mark is over-or under-represented in a gene group (G) compared to all considered genes (A). For n being the number of genes occupied by a transcription factor in a gene set and N being a total number of genes in a set, the