Characterizing Chromatin Interaction Domains in Hydra

This document provides details on our analysis of 3D genome architecture in the AEP genome assembly. This entailed aligning the raw Hi-C reads to the finalized genome assembly, normalizing contact frequencies, and predicting and characterizing chromatin contact domains.

Re-Mapping the AEP Hi-C Data to the Finalized Genome Assembly

Because we made additional modifications to the AEP genome assembly after the Hi-C scaffolding step (described in 01_aepGenomeAssembly.md), we had to re-map our Hi-C reads to the finalized assembly before we could further characterize chromatin interactions. We used the same mapping approach as when we performed the initial Hi-C scaffolding.

We first predicted restriction enzyme cutsites in the AEP assembly using the generate_site_positions.py script included as part of the Juicer pipeline.

(01_mapping/getCutsFinal.sh)

We then mapped the Hi-C reads using a slightly modified Juicer pipeline script (modifications described in 01_aepGenomeAssembly.md)

(01_mapping/runJuicerfinal.sh)

Note that the above script had to be run twice because of a batch scheduling error in the Juicer pipeline script. Re-executing the pipeline after the initial run failed allowed us to successfully recover the analysis and generate the necessary mapped read files.

 

Identifying and Visualizing Chromatin Contact Domains Using Hi-C Data

Converting and Normalizing the Hi-C Contact Frequency Data

The tools we used for downstream analysis of our mapped Hi-C reads (HiCExplorer) use the .cool format, as opposed to the .hic format used by Juicer. In addition, the tools that generate .cool files are also not compatible with .hic files. We therefore used the merged_nodup2pairs.pl utility script from pairix to convert the mapped read output from Juicer (merged_nodups.txt) to a read pairs file.

(01_mapping/juiceOut2Pairs.sh)

The pairs file was then be used to generate .cool files. When generating this file, you need to specify the resolution of the data by picking the bin size used to pool contact data. We used a bin size of 8 Kb for visualizing the data, and a bin size of 16 Kb for domain calling.

We first generated the 8 kb bin .cool file from the read pairs file:

(01_mapping/pair2Cool8k.sh)

We then normalized the contact frequency data using hicCorrectMatrix from HiCExplorer. This involved generating a diagnostic plot showing the distribution of contact frequencies for all bins in the genome at the specified bin size:

(01_mapping/correctionPlot8k.sh)

correctionPlot8k

This plot provides guidance for selecting cutoff thresholds for removing low and high contact frequency outliers that could skew the normalization. We selected a cutoff of -1.75 and 2.5 for the 8 Kb bin size.

(01_mapping/runCorrection8k.sh)

We then performed similar conversion and normalization steps for a 16 Kb bin size.

We first converted the read pairs data to contact frequency data with a 16 Kb resolution:

(01_mapping/pair2Cool16k.sh)

We next generated a diagnostic contact frequency distribution plot:

(01_mapping/correctionPlot16k.sh)

correctionPlot16k

We normalized the data using cutoff values of -2 and 4:

(01_mapping/runCorrection16k.sh)

Predicting and Visualizing Chromatin Contact Domains

We predicted chromatin contact domains using the 16 Kb bin size contact frequency data. We used the HiCExplorer hicFindTADs function to predict domain boundaries. --minDepth was set to 3x the bin size and --maxDepth to 10x the bin size as per the recommendations in the function's documentation. Domain boundaries were identified using a FDR threshold of 0.05.

(02_domains/findTadsAep16k.sh)

This generated several output files, most notably aep16k_domains.bed, which contains coordinates for the predicted contact domains in the AEP genome; aep16k_boundaries.bed, which contains coordinates for the boundaries of the predicted contact domains; and aep16k_score.bedgraph, which contains insulation scores as a data track for the AEP genome. The insulation score is the basis for the domain boundary prediction, which are marked by rapid shifts in local chromatin contact frequency.

We visualized contact frequency and domain prediction results using JuiceBox.

Characterizing Genomic Features at Contact Domain Boundaries

Bilaterian TAD boundaries are typically located in conserved regions of euchromatin. We therefore sought to determine if this was also the case for the domain boundaries we predicted in the AEP assembly using our Hi-C data. To do this, we characterized the distribution of both repressive (H3K27me3) and activating (H3K4me1 and H3K4me3) histone marks, chromatin accessibility, sequence conservation, and repetitive elements relative to predicted chromatin domain boundaries using the deeptools function computeMatrix. We also included flanking regions 100 kb up- and downstream of the domain boundaries.

The CUT&Tag and ATAC-seq bigwigs used for these plots (e.g., AEP_MG_final_shift.bw, H41_MG.bw, etc.) were generated in 08_creIdentification.md. The sequence conservation bigwig (aepCon.bw) was generated in 07_genomeConservation.md. The repeat density bigwig (repDensity.bw) was generated in 02_repeatMasking.md.

(02_domains/calcTADMat_hetCon.sh)

We visualized the results using the deeptools plotHeatmap function:

(02_domains/plotBoundHeat_hetCon.sh)

tadHeat

 

Investigating a Role for Contact Domains in Transcriptional Regulation

We next wanted to determine if chromatin contact domains influence transcriptional regulation in Hydra. Our approach for testing this was to use the single-cell Hydra atlas to determine if genes that fell within the same contact domain tended to have more similar expression patterns than genes that were not within the same contact domain.

To explore this question, we needed to assign each AEP gene model to a chromatin contact domain. We extracted gene coordinates from the HVAEP1.GeneModels.gtf file and converted them into a bed file.

We then used the bedtools closest function to find the contact domain boundary that was closest to each AEP gene model. The output genesCloseTads.bed included the name and coordinates of the closest boundary as well as its distance to the target gene.

(02_domains/getCloseTads.sh)

The output from bedtools closest was then used for a custom R script. This script identified sets of three consecutive genes that spanned a domain boundary. We excluded all triplets where the central gene fell within the predicted boundary coordinates, which left triplets where two of the genes were in the same domain and one of the genes was in a different domain. We could then generate two different consecutive gene pairs: an inter-domain pair and a intra-domain pair:

gPairTypes

We then imported the NMF gene scores for the Hydra single-cell atlas (generated in 05_hydraAtlasReMap.md) and used them to determine if inter-domain gene pairs had more or less correlated expression patterns than intra-domain pairs.

(02_domains/boundaryCor.R)

tadExpCor

The distribution of correlation scores suggested that consecutive gene pairs that fell within the same contact domain had more similar expression patterns than consecutive gene pairs that spanned a contact domain. To determine if this difference was significant, we used a standard student's t-test:

 

Comparative analysis of chromosome-level 3D genome architecture in cnidarians

Compiling and re-analyzing previously published cnidarian Hi-C data

To contextualize the 3D organization of the Hydra genome relative to other cnidarian genomes, we downloaded previously released Hi-C data for six other cnidarian chromosome-level genome assemblies. The table below provides information on the specific SRA datasets we accessed.

SpeciesSRA Accession #'sGenome ReferenceRestriction Enzyme
A. milleporaSRR13361157 SRR13361158 SRR13361159 SRR13361156 SRR13361160 SRR13361155 SRR13361154 SRR13361162 SRR13361163GSM5182734 (GEO Accession)MboI
D. lineataERR6688655GCA_918843875.1 (GenBank Accession)Arima
H. octoradiatusERR6745733GCA_916610825.1 (GenBank Accession)Arima
N. vectensisSRR12775957SIMRBASE LinkDpnII
R. esculentumSRR11649085GCA_013076305.1 (GenBank Accession)DpnII

We downloaded the raw reads files for these datasets from SRA using the SRA Tools fasters-dump function.

fasterq-dump <list of accession #'s for a dataset of interest>

As with the AEP Hi-C data, we then used the Juicer pipeline to align the Hi-C data and generate contact maps for each genome. This entailed first predicting restriction enzyme cutsites using the generate_site_positions.py script included as part of the Juicer pipeline. Note that for the next several scripts, a single species is used as an example. The scripts for the other species use the same commands, but with the relevant changes in files and restriction enzymes.

(03_compare/getCutsAmil.sh)

We then generated a .genome file of contig sizes.

The genome was also indexed for mapping using bwa.

bwa index amil.fa

Finally, we ran the modified Juicer pipeline (described in 01_aepGenomeAssembly.md) to generate the contact frequency maps for each species.

(03_compare/runJuicerAmil.sh)

Quantifying telomere interaction frequencies using Aggregate Chromosome Analysis (ACA)

A previous publication (Hoencamp et al., 2021) established an unbiased quantitative framework–called aggregate chromosome analysis, or ACA–for systematically comparing inter- and intra-chromasomal interactions across different species. ACA is based around generating a representative chromosome interaction profile for a given species by averaging length-normalized interaction maps of individual chromosomes. After this representative profile is generated, a number of metrics are calculated in order to characterize the rate at with different chromosome regions interact both in cis and in trans. Specifically, ACA calculates metrics for telomere-to-telomere, telomere-to-centromere, and centromere-to-centromere interactions.

Quantifying 3D chromatin interactions at centromeres requires knowing the centromere coordinates. Apart from Hydra, such information is not currently available for cnidarian genome assemblies. Thus, we were unable to use ACA for quantifying centromere interactions. However, the telomere-to-telomere metric does not depend on having accurate centromere coordinates, so we ran the ACA using 'dummy' centromere coordinates and used only the telomere-to-telomere interaction quantification results.

To perform the ACA, we first filtered out any non-chromosome scaffolds from the assembly using seqkit

(The code examples below show only a single species, but the same basic steps were applied to all species in the analysis.)

seqkit sort -l -r amil.fa | seqkit head -n 14 - > amil.chroms.fa

This required that we regenerate the .genome file for each assembly.

We then used the build-aca-hic.sh script from the 3d-dna package to perform the ACA. Prior to running this script, we generated dummy coordinates for 10 kb centromeres in the center of each psuedo-chromosome in the assembly. The input for this script also included the merged_nodups.txt file generated for each genome by the Juicer mapping pipeline.

(03_compare/pseudoACA.sh)

To run the ACA across each genome in our compiled dataset, we used the following wrapper script:

(03_compare/runPseudoACA.sh)

The build-aca-hic.sh produced a .hic file containing the representative chromosome interaction profile. We used the score-aca.sh script from the 3d-dna pipeline to calculate the telomere-to-telomere interaction score (among other metrics) for each species:

(03_compare/getPseudoAcaScores.sh)

This produced the following output:

(03_compare/ACAscore.out)

The telomere-to-telomere interaction metric is the forth and last number outputted for each species. These values were used for generating plots in the section below

Quantifying centromere interaction frequencies

Because we were unable to use the centromere interaction metrics from the ACA pipeline, we developed a novel method for quantifying centromere-to-centromere interactions that did not rely on previous knowledge of centromere coordinates. The concept behind this approach was that strong inter-centromeric interactions should be discernible as highly localized regions with elevated rates of inter-chromosomal interactions compared to other regions within a given chromosome; however, such localized enrichment should be absent in species with low levels of inter-centromeric interactivity.

To calculate a metric that captures this localized enrichment signature in an unbiased fashion, we first used Juicer Tools to output Knight and Ruiz normalized interaction matrices with a 100 kb bin-size for all inter-chromosomal scaffold pairs (i.e., chr-1 interactions with chr-2 but not chr-1 interactions with chr-1) using the Juicer-derived .hic files we generated above for each cnidarian species of interest. This required the .genome files for the chromosome-only version of each assembly that were generated as part of the ACA (see above).

(03_compare/crossChromDump.sh)

The above script was executed for each species as follows:

After generating the interaction matrices, we then quantified the extent to which each chromosome in the assembly possessed a localized region with greatly elevated rates of inter-chromosomal interaction. To do this, we first calculated the median number of normalized inter-chromosomal contacts for each 100 kb bin along the length of each scaffold across all possible inter-chromosomal scaffold pairs. We then removed the top and bottom tenth of each chromosome (to remove the telomere interaction signal), and converted the remaining values into z-scores. The inter-centromeric interaction score for each chromosome was defined as the highest z-score value along the length of a given scaffold. We then plotted the distribution for this score across all species for all chromosomes.

(03_compare/chromCalcs.R)

interCentScores

This analysis suggested that the H. vulgaris assembly has a markedly higher levels of inter-centromeric interaction than other cnidarian genomes. To determine if this difference was significant, we used Tukey’s Honest Significant Difference method to perform a post-hoc significance test on an ANOVA calculated on all inter-centromeric contact scores for all species.

(03_compare/chromCalcs.R)

This generated the following result:

This indicates that the AEP inter-centromeric interaction scores were significantly higher than all other cnidarian genomes considered in the analysis.

In this script, we also generated plots for the inter-telomeric interaction scores generated using the ACA method.

(03_compare/chromCalcs.R)

interTelScores

These results suggest that there is little variation in the levels of inter-telomeric interactions among cnidarians, and that H. vulgaris is not markedly different from the other species considered in this analysis.

Files Associated with This Document