Characterizing Hydra Genome Sequence Conservation

This document covers our approach to analyzing sequence conservation in the strain AEP H. vulgaris genome assembly. This process entailed collecting and prepping five hydrozoan genomes, aligning them with the progressive cactus whole-genome alignment pipeline, generating sequence conservation data tracks, and identifying conserved transcription factor binding motifs and cis-regulatory elements.

Prepping Input Sequences for Alignment

Repetitive regions are intrinsically difficult to align. Therefore, softmasking repeats in the genomes being aligned can improve results. Softmasking prevents a region from seeding alignments, but allows alignments to extend into the masked region. We had already masked the AEP assembly as part of our annotation pipeline (see 02_repeatMasking.md for details). During that analysis we had also generated a repeat masked version of the strain 105 genome. In addition, softmasked versions of the Clytia and H. viridissima genomes were already available. For Clytia we used the file Clytia_hemisphaerica_gca902728285.GCA902728285v1.dna_sm.toplevel.fa.gz (available for download here), which we renamed to clytia.fa. For H. viridissima we used the file hvir_genome_hm2_250116_renamed.fa (available for download here), which we renamed to virid.fa.

Masking Repeats in the H. oligactis Assembly

When the whole-genome alignment was performed, we had not yet generated oligactis specific repeat libraries for masking repeats (described in 02_repeatMasking.md). We therefore simply used a combination of the strain AEP H. vulgaris repeatmodeler and Dfam eumetazoa repeat family libraries. Using this approach resulted in some repetitive regions going unmasked. Nonetheless, we were still able to capture the majority of the repetitive sequence in the oligactis genome.

To identify and mask repeats in oligactis, we first ran repeatmasker using the repeat families identified for the AEP assembly:

(01_alignment/runOligMask.sh)

We then did another repeatmasker run using the Dfam eumetazoa repeat library:

(01_alignment/runOligMaskEuk.sh)

We then pooled those two repeat predictions:

zcat olig_AEPMAsk/olig_genome.fa.cat.gz oligEuk/olig_genome.fa.cat.gz > fullOligMask.cat

and generated a finalized set of repeat coordinates

(01_alignment/runOligMaskEuk.sh)

This created the following repeat summary table:

(snippet from 01_alignment/fullOligMask.tbl)

We then used these repeat predictions to generate a soft masked version of the oligactis assembly that was used as input into the genome alignment pipeline:

Generating a Cross-Species Whole-Genome Alignment

After prepping the repeat masked genome sequences, we placed them in a subfolder called seqs, and renamed the fastas in the following way:

We then prepared a config file required by cactus (called evolverVulgaris.txt). The first line of this file is a species tree describing the relationships among the five genomes in the alingment in newick format. The distances were derived from a species tree generated by a preliminary Orthofinder analysis that included proteomes from each of the species in this alignment. The remaining lines specify prefixes for each genome and the path to the appropriate fasta file.

(01_alignment/evolverVulgaris.txt)

We then ran the progressive cactus aligner using the following script:

(01_alignment/vulgarisCactus.sh)

This generated the alignment file evolverHydra.hal, which was used for subsequent analysis.

Generating AEP Genome Conservation Data Tracks

One of the goals of generating a whole-genome alignment was to generate genome data tracks of sequence conservation. This would allow researchers to look at sequence conservation at their locus of interest and quickly identify candidate functional regions. We opted to use a fairly straightforward approach to generate these data tracks, which simply involved quantifying the amount of conserved sequence in aligned regions (i.e., which nucleotides were identical across different species). This functionality isn't built into the suite of tools provided with the cactus aligner or any other related software, so we implemented our own solution.

We first converted the hal alignment file into a maf format, which is generally a more widely supported format for multiple sequence alignments. maf files are formated to use one of the aligned sequences as a primary reference. We used the AEP assembly because this was the genome we were ultimately going to generate data tracks for. To aid in parallelization at later steps we generated a single maf file for each chromosome in the AEP assembly.

(02_conservationTracks/makeMafs.sh)

We then used the following custom python script to count the number of other hydrozoan genomes that had the same nucleotide as the AEP assembly at each position in the AEP genome. We then applied a 100bp moving window to smooth these values, and then exported the results to a bedgraph file.

(02_conservationTracks/mafWindows.py)

Note the commented line conBG.to_csv(chrName + '.cactus.bedgraph',sep='\t',header=False,index=False). Uncommenting this line enables the production of a bedgraph without the 100 bp smoothing window (just raw conservation counts, written to the file aepCon.bedgraph that was then converted to aepCon.bw). We used this version of the output for characterizing conservation patterns around genes (see next section)

We executed this python script on a computing cluster using the following script:

(02_conservationTracks/runWindows.sh)

We then pooled the bedgraph files from each chromosome into a single file and converted it into the more compact bigwig format.

(02_conservationTracks/runBW.sh)

Note the file aep.genome, which simply lists the chromosome sizes in the AEP assembly. This file was generated using the following command:

Our initial python script counted the number of identical bases across all non-AEP genomes. We also generated additional modified versions of the mafWindows.py script to look only at pairwise alignments between the AEP assembly and each of the other genomes.

The following is the output of the command diff mafWindows.py mafWindows105.py, highlighting the differences between the original script and the version that only looked at pairwise conservation between the AEP and 105 assemblies.

The following is the output of the command diff mafWindows.py mafWindowsOlig.py, highlighting the differences between the original script and the version that only looked at pairwise conservation between the AEP and oligactis assemblies.

The following is the output of the command diff mafWindows.py mafWindowsVirid.py, highlighting the differences between the original script and the version that only looked at pairwise conservation between the AEP and viridissima assemblies.

The following is the output of the command diff mafWindows.py mafWindowsClytia.py, highlighting the differences between the original script and the version that only looked at pairwise conservation between the AEP and Clytia assemblies.

We executed all of these modified scripts on a computing cluster using the following script:

(02_conservationTracks/runSpecWindows.sh)

We then pooled and converted the output files from each of these scripts into bigwig files.

(02_conservationTracks/runBWbySpec.sh)

Calculating Conservation Patterns Near Genes

One question we wanted to explore using our sequence conservation data involved the size of promoter-proximal regulatory regions in Hydra. More specifically, we wanted to determine the typical distribution of conserved sequence in the regions upstream of gene transcription start sites to determine if 2 Kb upstream was sufficient to capture most promoter sequence in Hydra (as had been proposed by some researchers).

We used the computeMatrix function from the DeepTools package to quantify sequence conservation around each of the gene models in the AEP assemby. This function removed intronic sequences and scaled each gene to be equivalent to 750 bases. It also provided sequence conservation data on the 10 Kb up and downstream of each gene.

(compWideGeneCon.sh)

We then took the matrix generated by this command and used a custom R script to determine the size of the conservation footprint around genes. That is, we asked how far up and downstream of a gene do you typically have to go before conservation rates falls back to baseline levels. We also used this script to determine how far up and downstream you have to go before you encompass 50% or 90% of the total conservation signal (i.e., the area under the curve from the TSS to the point where conservation returned to baseline levels). Finally, the script generated a plot to summarize these results.

(03_promConservation/conCutOffPlot.R)

conservationCutoffFilled

Identifying Conserved Transcription Factor Binding Sites in the AEP Genome

Predicting Conserved Binding Sites Using JASPAR Motif Sequences

We next used our genome alignments to identify putative conserved transcription factor binding sites (TFBS) in the AEP genome. One of the most useful aspects of a whole-genome alignement is that it allows you to convert coordinates from one genome in the alignment to another. We took advantage of this to identify conserved TFBS by separately identifying putative TFBS in each genome and then converting the coordinates to their equivalent coordinates in the AEP assembly. Then we could look for cases where the same TFBS was predicted in the same location in multiple genomes in the alignment.

To predict TFBS in our genomes, we needed a database of experimentally validated binding motifs. We used the JASPAR database for this purpose. Specifically, we used binding motifs from their non-redundant vertebrate (JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt), insect (JASPAR2020_CORE_insects_non-redundant_pfms_jaspar.txt), and nematode (JASPAR2020_CORE_nematodes_non-redundant_pfms_jaspar.txt) databases (downloaded here). We pooled these three files to make a unified database

cat *jaspar.txt > pooledJasparNR.txt

We then used FIMO from the meme suite of software tools to identify predicted binding sites for all motifs in our database in each of the Hydra genomes in our whole-genome alignment. We opted to exclude the Clytia genome because very little non-coding sequence is conserved from Hydra to Clytia. To prepare for running FIMO, we generated a markov model of base frequencies in Hydra. Because base composition is generally quite similar among Hydra genomes, we just used the AEP model for all four genomes.

fasta-get-markov aep.final.genome.fa > genome.markov.txt

To reduce the search space for TFBS, we hard masked repetitive regions so they wouldn't be considered as part of the analysis. We did this by simply converting lower case bases in the soft masked genome fasta files to Ns.

Finally, we had to convert the JASPAR-formatted motifs into MEME-formatted motifs using a utility script included as part of meme suite:

jaspar2meme -bundle pooledJasparNR.txt > pooledJasparNR.meme.txt'

We then ran FIMO on each of the four Hydra genomes

(04_motConservation/findHits.sh)

We then reformatted the tsv output from fimo into bed files. For non-AEP results, we used the liftover functionality provided by the progressive cactus alignment suite of software tools to convert the TFBS coordinates into their equivalent values in the AEP genome.

(04_motConservation/tsv2Bed.sh)

After all TFBS predictions had been placed in the same coordinate space, we needed to find cases where a TFBS prediction from a non-AEP species overlapped a predicted TFBS in the AEP assembly. However, we wanted to first exclude TFBS that either fell outside of an ATAC-seq peak or that fell inside a protein coding region. Both of these exclusions were intended to increase the likelihood that the remaining TFBS predictions fell within functional cis regulatory elements. We had already generated our ATAC-seq peak coordinates (consensusAEP.bed), but we needed to also create a bed file with CDS coordinates. We generated this file from the our AEP gene models gff3 file:

grep -P '\tCDS\t' HVAEP1.GeneModels.gff3 | gff2bed --do-not-sort - > HVAEP1.cds.bed

We then used the bedtools intersect function to eliminate TFBS predictions that did not overlap peaks in consensusAEP.bed or that did orverlap features in HVAEP1.cds.bed. After this filtering, we found all instances where a non-AEP TFBS prediction overlapped an AEP TFBS prediction, again using bedtools intersect. These intersections were then filtered so that the final output included only intersections that occured between two instances of the same binding motif.

(04_motConservation/filterHits.sh)

Generating a Control Motif Dataset Using Shuffled JASPAR Motifs

In order to determine if a particular binding motif was conserved, we needed to demonstrate that the rate at which a particular sequence remained intact over the course of Hydra evolution was higher than would be expected by chance. In order to estimate the expected baseline rate of conservation for any given binding motif we chose to characterize conservation frequencies for a shuffled version of that motif. The rationale being that shuffling a motif should disrupt its function without affecting any of it's intrinsic sequence characteristics (mainly length and GC content).

To execute this approach, we needed to generate a shuffled version of each motif in our database. We wrote a custom R script for this purpose. This script iterated through each motif in our database and randomly reorganized the nucleotides that made up the motif. To make sure this reorganization didn't inadvertantly create a motif that resembled some other functional motif, we used the meme suite tool tomtom to compare our shuffled motif to our collection of JASPAR motifs to make sure the shuffled motif had no significant similarity to any bona fide binding motifs. If the shuffled motif did resemble another motif, it was shuffled again, repeating the process until the shuffled motif was sufficiently dissimilar (E value ≥ 5). In some cases, the motif being shuffled was too simple/short to get an E-value ≥ 5. Therefore, we wrote the script to halve the dissimilarity threshold after 20 consecutive unsuccessful shuffling attempts. The shuffled motifs were then written to the file shuffledJasparMotifs.txt

(04_motConservation/motifShuffle.R)

The next steps were essentially identical to those in the previous section, except we used our shuffled motif file instead of the database of genuine motifs.

We first converted the JASPAR-formatted motif list to a MEME-formatted motif list:

jaspar2meme -bundle pooledJasparNR.txt > pooledJasparNR.meme.txt

We then predicted motifs across all four Hydra genomes:

(04_motConservation/findHitsShuf.sh)

We converted the FIMO output to bed, and converted non-AEP coordinates to AEP coordinates:

(04_motConservation/tsv2BedShuf.sh)

We filtered motif predictions so that they fell inside ATAC-seq peaks and didn't intersect coding sequence. Then we filtered non-AEP hits if they didn't overlap with an identical motif hit in the AEP genome:

(04_motConservation/filterHitsShuf.sh)

Identifying Conserved Motifs and Comparing Conservation Rates of Different Motif Sequences

To classify an individual TFBS prediction as conserved, we looked for instances where the TFBS was present in the AEP assembly, the 105 assembly, and at least one non-vulgaris assembly. That is, a TFBS needed to be present in 105AepOlapCon.bed and either oligAepOlapCon.bed or viridAepOlapCon.bed.

We wanted to then compare the frequency with which different motif sequences met these criteria to the frequency obtained when using a shuffled version of the same motif. This would give us some insight into whether a particular JASPAR motif was indeed functional in Hydra.

To perform these comparisons we used the following R script. This script also generated a plot showing the log odds ratio of the conservation rate of genuine motifs compared to their shuffled controls.

(04_motConservation/motifConservationAnalysis.R)

conMotPlot

Note: the generation of the file motifInfo.csv used in the above script is described in 10_hydraRegulators.md

Identifying Conserved Cis-Regulatory Elements

We next wanted to use our sequence conservation data to identify putative AEP regulatory elements that were conserved in other hydrozoan species. To do this we first used the computeMatrix function from DeepTools to generate a matrix of conservation scores for all Cut&Tag and ATAC-seq peaks, with each pairwise alignment between the AEP genome and all non-AEP references in our alignment being given it's own set of conservation scores in the matrix.

Note: Generation of the Cut&Tag and ATAC-seq peak sets is described in 08_creIdentification.md

Calculating H3K27me3 peak conservation scores:

(05_creConservation/conBySpecH273.sh)

Calculating H3K4me3 peak conservation scores:

(05_creConservation/conBySpecH43)

Calculating H3K4me1 peak conservation scores:

(05_creConservation/conBySpecH41)

Calculating ATAC-seq peak conservation scores:

(05_creConservation/conBySpecATAC.sh)

To translate these scores into predictions of cross-species conservation, we made use of the observation that the distribution of conservation scores for peaks often appeared to be bimodal. For example, here is a conservation score distribution plot for H3K4me3 peaks generated as part of the R script included below (x-axis indicates % identity between AEP peak and aligned non-AEP sequence):

h43ConDist

We interpreted this as an indication that the conservation score distribution was capturing two populations, conserved and non-conserved peaks. We therefore performed k-means clustering to partition conservation scores into high and low populations. We defined a peak as conserved if it belonged to the high-scoring conserved population for at least two pairwise inter-species comparisons. The script below performs this analysis and outputs the conserved peaks as a new subsetted bed file (ATAC.conPeaks.bed,h41.conPeaks.bed,H43.conPeaks.bed, and H273.conPeaks.bed)

(05_creConservation/conClassify.R)

These are the remaining sequence conservation score distributions:

(H3K4me1)

h41ConDist

(H3K27me3)

h273ConDist

(ATAC-seq)

atacConDist

After we identified conserved peaks, we wanted to visualize their distribution around genes. Specifically, we wanted determine the portion of putative conserved enhancer-like regions that were likely to engage in long-distance (≥ 10 Kb) interactions with their target promoter.

To identify candidate enhancer-like regions, we used our conserved H3K4me1 and ATAC-seq peak sets. We used UROPA to calculate the distance from each conserved peak in our H3K4me1 and ATAC-seq data to the nearest TSS.

We specified the parameters of the UROPA run for the conserved H3K4me1 peak set using the following configuration file:

(05_creConservation/conPeakAnnotH41.json)

We also generated an equivalent configuration file for the conserved ATAC-seq peak set, which was identical to the text above except for the following changes (output generated by diff conPeakAnnotH41.json conPeakAnnotATAC.json)

We then ran the UROPA annotation pipeline for both peak sets:

To mitigate the risk of inadvertently including core-promoters in our putative enhancer peaks, we filtered both the ATAC-seq and the H3K4me1 data to remove any peaks that intersected with an H3K4me3 peak (a mark enriched specifically in core promoters).

We then used the following R script to plot the distribution of conserved enhancer-like peaks around TSS. This revealed a sizable portion of peaks located ≥ 10 Kb from the nearest TSS.

(05_creConservation/conEnPlot.R)

enDist

We also calculated some basic summary statistics for distances from peaks to TSS, which revealed that a sizable portion of peaks located > 2 Kb from the nearest TSS.

 

Files Associated with This Document