Identifying Cis-Regulatory Regions in the AEP Genome Using ATAC-seq and CUT&Tag

This document covers the mapping, filtering, peak calling, and plotting of whole-animal ATAC-seq and Cut&Tag data to identify cis-regulatory elements in the strain AEP H.vulgaris genome.

Mapping ATAC-seq data to the AEP Assembly

To analyze our whole-animal ATAC-seq data for the AEP strain of H. vulgaris, we begun by aligning each of the three biological replicates to the AEP assembly. The mapping process we used was as follows: we first used Trimmomatic (v0.36) to remove low quality base calls and contaminating adapter sequences. We also generated FASTQC (v0.11.4) reports before and after the Trimmomatic step to ensure the filtering was successful.

We next mapped the filtered reads to both the AEP assembly as well as the Hydra mitochondrial genome (downloaded from here and here) using bowtie 2 (v2.2.6). The ATAC-seq protocol targets all DNA, including mitochondrial DNA. This means that often a sizable portion of ATAC-seq reads are mitochondrial reads that provide no insight into chromatin accessibility in the nuclear genome. We therefore removed all reads from our AEP assembly-mapped alignment file that aligned to the mitochondrial genome.

We then performed some filtering of the aligned reads to remove unmapped or ambiguously mapped reads using samtools (v1.12) and Picard Tools (v2.17.8). Finally, we removed PCR duplicate sequences to prevent PCR biases introduced when amplifying the ATAC-seq libraries from interfering with the quantitative analysis of read counts.

The final output from this step was three bam files (AEP1_final.bam, AEP2_final.bam, and AEP3_final.bam), one for each biological replicate, containing all non-mitochondrial dedupliated uniquely mapped ATAC-seq reads.

Prior to running the mapping pipeline we first had to prepare bowtie references for both the mitochondrial and AEP assembly fasta files:

We then used the following script for the alignment pipeline:

(01_atacPeaks/ATAC_Mapping_Pipeline.sh)

We executed the above pipeline on a computing cluster using this script:

(01_atacPeaks/slurmMap.sh)

Calling ATAC-seq Peaks

To identify candidate CRE coordinates from our aligned ATAC-seq data. We used Macs2 (v2.2.7.1) to call peaks for each of our ATAC-seq biological replicates. We then used the IDR python package (v2.0.3) to make pairwise comparisons between the peak calling results of different biological replicates to calculate three sets of irreproducible discovery rate (IDR) scores, which reflect the likelihood that a given peak was reproducible across two biological replicates. We use these IDR scores in a subsequent step to generate our consensus peak set.

As part of our peak calling pipeline, we also generated a merged version of the data that combined all three biological replicates. We called peaks on this merged dataset as well, which served as the starting point for generating our consensus peak set (see below). Finally, we used deepTools (v3.5.0) generated bigwig files to facilitate viewing ATAC-seq read density in a genome browser.

(01_atacPeaks/ATAC_Peak_Pipeline.sh)

The peak pipeline script using the following accessory R script to generate seed values for subsetting bam files:

(01_atacPeaks/generateSubsampleValue.R)

We executed the peak calling pipeline on a computing cluster using the following script:

(01_atacPeaks/slurmPeak.sh)

When calculating the IDR score, we removed any peaks that had a score > 0.1. To generate our consensus set of reproducible peaks, we used BedTools (v2.29.2) to identify peaks from the merged ATAC-seq data set (AEP_MG.narrowPeak) that overlapped with peaks from each of our pairwise IDR comparisons.

We then used the following R script to subset the AEP_MG.narrowPeak peak list to include only those peaks that intersected with a peak from at least two of the IDR peak sets. The output from this script (consensusAEP.bed) was our consensus ATAC-seq peak set that we used for subsequent analyses.

(01_atacPeaks/getCon.R)

Mapping CUT&Tag data to the AEP Assembly

Our CUT&Tag mapping approach was quite similar to the one we used for our ATAC-seq data. We filtered reads using Trimmomatic and verified the output using FASTQC. We then mapped the filtered reads to the AEP assembly. Finally, we removed duplicate reads from the resulting alignment file.

(02_cutAndTagPeaks/mapPipe.sh)

We executed this mapping pipeline on a computing cluster using the following script:

(02_cutAndTagPeaks/slurmMapPipe.sh)

Calling CUT&Tag Peaks

Our peak calling approach with CUT&Tag differed from our ATAC-seq approach because our CUT&Tag data included negative controls that could be used as a baseline. Instead of Macs2, we opted to use SEACR (v1.3) to call peaks on our CUT&Tag data, as it was specifically designed to work well with CUT&Tag data.

SEACR requires read depth data in a bedgraph format. Because we needed to prep the negative control IgG samples before we could process any other samples, we used the following script to generate the necessary negative control files prior to running the main peak calling script.

(02_cutAndTagPeaks/iggProcessing.sh)

For the peak calling, we used SEACR to call peaks on each bam file generated in the previous section, as well as on merged replicate bam files generated. Each SEACR command used the sample-matched negative control IgG data (e.g., IgG replicate 1 is derived from the same experiment as H3K4me1 replicate 1).

We opted to run SEACR in permissive mode, and then use the same IDR approach we used for our ATAC-seq data to ensure that the resulting consensus peak sets contained reproducible peaks.

(02_cutAndTagPeaks/peakPipe.sh)

Below is an auxiliary R script used within the peak calling pipeline. It reformats SEACR's output into a standard bed format.

(02_cutAndTagPeaks/refBed.R)

Below is an auxiliary R script used within the peak calling pipeline. It generates a consensus peak set by determining which peaks in the merged replicate peak set passed the IDR threshold in at least two pairwise comparisons of biological replicates.

(02_cutAndTagPeaks/getCon.R)

Note that the CUT&Tag peak pipeline also uses the generateSubsampleValue.R script first mentioned in the ATAC-seq peak calling pipeline.

The peak pipeline was run on a computing cluster using this script:

(02_cutAndTagPeaks/slurmPeakPipe.sh)

Annotating Peaks by Their Nearest Gene

To link peaks to their potential target genes, we used UROPA (v4.0.2) to identify the nearest gene model to each peak in our H3K4me1, H3K4me3, and ATAC-seq peak sets (i.e., peaks associated with active regulatory elements), with the maximum allowed distance being 100 Kb.

UROPA uses a config file in JSON format to set relevant parameters. This was the parameter file for annotating our ATAC-seq peaks:

(03_characterizeCREs/peakAnnotATAC.json)

The parameter file for the other two peak sets were largely identical to the one above. For the H3K4me1 peak set annotation, we made the following changes (text is output from diff command comparing ATAC-seq and H3K4me1 config files)

For the H3K4me3 peak set annotation, we made the following changes (text is output from diff command comparing ATAC-seq and H3K4me3 config files)

We then ran UROPA using the following commands:

Calculating Correlation Scores for CUT&Tag and ATAC-seq Biological Replicates

In order to evaluate the reproducibility of our ATAC-seq and CUT&Tag data, and to determine if the different types of data had the expected distribution across the genome relative to each other (e.g., open chromatin should be positively correlated with activating histone modifications), we calculated global correlation scores amongst all of our ATAC-seq and CUT&Tag replicates.

We used the DeepTools multiBigwigSummary function to calculate the correlation scores:

(03_characterizeCREs/multiBWComp.sh)

We then plotted the results using the following script:

(03_characterizeCREs/plotCorrelation.sh)

bwCorPlot

The hierarchical clustering grouped together biological replicates, a good indication of reproducibility. We also saw positive correlation between datasets associated with activating marks (ATAC-seq, H3K4me1, and H3K4me3), whereas there was little to no correlation between activating marks and the repressive mark H3K27me3 or the negative controls.

Visualizing Read ATAC-seq and CUT&Tag read Distribution Around Genes

Because chromatin accessibility and certain histone modifications have clear expected distribution patterns around transcription start sites, we visualized the read distribution of our CUT&Tag and ATAC-seq data around the AEP gene models to further validate the data.

H3K4me3 should be strongly enriched at transcription start sites and H3K4me1 should be be found around expressed genes, but should not be strongly associated with transcription start sites. ATAC-seq is expected to be associated both with transcription start sites and more distal regulatory elements. H3K27me3 should be depleted near actively transcribed genes, but it can be found in silenced genes.

Systematically Characterizing Read Distribution Around Genes

We supplemented our analysis with RNA-seq data, allowing us to verify which genes were actively transcribed. For this we used the file NS_RNA.bw, which is a bigwig file containing read depth information for genome-mapped RNA-seq data from whole adult Hydra polyps. Specifically, the alignment data was generated by merging the bam files NS1_RNA.genome.bam, NS2_RNA.genome.bam, and NS3_RNA.genome.bam from the alignment benchmarking analysis described in 03_aepGenomeAnnotation.md.

We also included data on sequence conservation using the file aepCon.bw, which was generated by the analysis described in 07_genomeConservation.md.

We combined all of these data and calculated their average profiles across the AEP assembly gene models using the DeepTools computeMatrix function:

(03_characterizeCREs/compGeneMatrix.sh)

We then plotted the results using this script:

(03_characterizeCREs/plotGeneHeatmap.sh)

gbHeat

 

Generating Representative Plots for ATAC-seq, CUT&Tag, and Sequence Conservation Centered on Individual Genes

To show individual examples of the correspondence between our various methods of identifying regulatory elements, we plotted each of the genome-wide data tracks we generated for ATAC-seq, CUT&Tag, and sequence conservation centered on specific genes in the genome using the R script below.

However, prior to running the plotting script, we had to fix an issue where CDS phases were improperly formatted in the AEP gene model gff3 file that prevented the gene models from being properly imported. We did this using the following command:

(03_characterizeCREs/plotTrackData.R)

 

bmp5-8cSample

brachyurySample

The following two plots were subsetted to just focus on the viridissima and oligactis conservation tracks for comparing our systematic genome alignment to the results of manual alignments done by Vogg et al. (2019).

wnt3ConSample

sp5ConSample

When we tried to use the above script to generate example plots that showed the individual biological replicates we found that Gviz didn't do a good job of handling the individual CUT&Tag replicates. It's unclear what caused the issue, but the resulting plots looked block-y and didn't accurately represent the data.

We therefore used pyGenomeTracks (v3.6) to plot the data instead. Below is the config file used to generate the plot of individual replicates for the CUT&Tag and ATAC-seq data centered around the brachyury1 gene:

(03_characterizeCREs/repTracks.ini)

The plot was generated using the script below:

(03_characterizeCREs/makeRepPlot.sh)

repPlot

 

Identifying Transcription Factor Footprints using ATAC-seq Data

In order to determine if out CUT&Tag data for H3K4me1 and H3K4me3 were indeed enriched at CREs, we performed transcription factor footprinting using our ATAC-seq data to look for an enrichment of predicted transcription factor binding sites in H3K4me1 and H3K4me3 peaks.

We used the TOBIAS pipeline to perform the footprinting analysis. We first used the ATACorrect function to pre-process the mapped ATAC-seq reads (AEP_MG_final.bam generated by the 01_atacPeaks/ATAC_Mapping_Pipeline.sh script). Input to this function also included the consensus ATAC-seq peak file (consensusATAC.bed, renamed from the file consensusAEP.bed generated by the 01_atacPeaks/getCon.R script).

Following pre-processing, we then generated TF footprint scores using the FootprintScores function.

(03_characterizeCREs/runToby.sh)

To visualize the distribution of the predicted TF footprints (stored in the AEP_footprints.bw file) around H3K4me1, H3K4me3, and H3K27me3 peaks, we used the DeepTools computeMatrix and plotHeatmap functions. 5 kb of flanking sequence on either side of predicted peaks was included in the final plots

(03_characterizeCREs/hisMarksFootprint.sh)

 

(03_characterizeCREs/plotHisFootHeatmap.sh)

 

histoneFootprints

 

We observed a clear enrichment of predicted TF binding sites in the H3K4me1 and H3K4me3 peaks, while the H3K27me3 peaks showed no enrichment. This finding is consistent with H3K4me1 and H3K4me3 peaks occurring within active CREs and H3K27me3 peaks occurring within regions of repressed heterochromatin.

Files Associated with This Document