This document provides details on our analysis of 3D genome architecture in the AEP genome assembly. This entailed aligning the raw Hi-C reads to the finalized genome assembly, normalizing contact frequencies, and predicting and characterizing chromatin contact domains.
Characterizing Chromatin Interaction Domains in HydraRe-Mapping the AEP Hi-C Data to the Finalized Genome AssemblyIdentifying and Visualizing Chromatin Contact Domains Using Hi-C DataConverting and Normalizing the Hi-C Contact Frequency DataPredicting and Visualizing Chromatin Contact DomainsCharacterizing Genomic Features at Contact Domain BoundariesInvestigating a Role for Contact Domains in Transcriptional RegulationComparative analysis of chromosome-level 3D genome architecture in cnidariansCompiling and re-analyzing previously published cnidarian Hi-C dataQuantifying telomere interaction frequencies using Aggregate Chromosome Analysis (ACA)Quantifying centromere interaction frequenciesFiles Associated with This Document
Because we made additional modifications to the AEP genome assembly after the Hi-C scaffolding step (described in 01_aepGenomeAssembly.md), we had to re-map our Hi-C reads to the finalized assembly before we could further characterize chromatin interactions. We used the same mapping approach as when we performed the initial Hi-C scaffolding.
We first predicted restriction enzyme cutsites in the AEP assembly using the generate_site_positions.py script included as part of the Juicer pipeline.
(01_mapping/getCutsFinal.sh)
x#SBATCH -p med#SBATCH --job-name=cutS#SBATCH -c 1#SBATCH -t 60-0#SBATCH --mem=16G#SBATCH --error=cutS_%j.err#SBATCH --output=cutS_%j.out
python ../juicer/misc/generate_site_positions.py Arima final \ ../../aep.final.genome.faWe then mapped the Hi-C reads using a slightly modified Juicer pipeline script (modifications described in 01_aepGenomeAssembly.md)
(01_mapping/runJuicerfinal.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=jLaunch#SBATCH -c 1#SBATCH -t 60-0#SBATCH --mem=8G#SBATCH --error=jLaunch_%j.err#SBATCH --output=jLaunch_%j.out
./scripts/juicerMod.sh \ -g final -z ../resources/references/final/aep.final.genome.fa \ -p ../aep.final.genome \ -q med -Q 60-0 -l med -L 60-0 -t 8 \ -D /home/jacazet/reference/aepAssembly/06_HiC \ -d /home/jacazet/reference/aepAssembly/06_HiC/workNote that the above script had to be run twice because of a batch scheduling error in the Juicer pipeline script. Re-executing the pipeline after the initial run failed allowed us to successfully recover the analysis and generate the necessary mapped read files.
The tools we used for downstream analysis of our mapped Hi-C reads (HiCExplorer) use the .cool format, as opposed to the .hic format used by Juicer. In addition, the tools that generate .cool files are also not compatible with .hic files. We therefore used the merged_nodup2pairs.pl utility script from pairix to convert the mapped read output from Juicer (merged_nodups.txt) to a read pairs file.
(01_mapping/juiceOut2Pairs.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=con#SBATCH -c 8#SBATCH -t 60-0#SBATCH --mem=16G#SBATCH --error=con.err#SBATCH --output=con.out
merged_nodup2pairs.pl -m 29 -s 6 work/aligned/merged_nodups.txt aep.genome aepThe pairs file was then be used to generate .cool files. When generating this file, you need to specify the resolution of the data by picking the bin size used to pool contact data. We used a bin size of 8 Kb for visualizing the data, and a bin size of 16 Kb for domain calling.
We first generated the 8 kb bin .cool file from the read pairs file:
(01_mapping/pair2Cool8k.sh)
xxxxxxxxxx
cooler cload pairix -p 6 aep.genome:8000 aep.bsorted.pairs.gz aepHic.8k.coolWe then normalized the contact frequency data using hicCorrectMatrix from HiCExplorer. This involved generating a diagnostic plot showing the distribution of contact frequencies for all bins in the genome at the specified bin size:
(01_mapping/correctionPlot8k.sh)
xxxxxxxxxx
hicCorrectMatrix diagnostic_plot -m aepHic.8k.cool -o correctionPlot.png
This plot provides guidance for selecting cutoff thresholds for removing low and high contact frequency outliers that could skew the normalization. We selected a cutoff of -1.75 and 2.5 for the 8 Kb bin size.
(01_mapping/runCorrection8k.sh)
xxxxxxxxxx
hicCorrectMatrix correct -m aepHic.8k.cool --filterThreshold -1.75 2.5 -o hic_corrected8k.coolWe then performed similar conversion and normalization steps for a 16 Kb bin size.
We first converted the read pairs data to contact frequency data with a 16 Kb resolution:
(01_mapping/pair2Cool16k.sh)
xxxxxxxxxx
cooler cload pairix -p 6 aep.genome:16000 aep.bsorted.pairs.gz aepHic.16k.coolWe next generated a diagnostic contact frequency distribution plot:
(01_mapping/correctionPlot16k.sh)
xxxxxxxxxx
hicCorrectMatrix diagnostic_plot -m aepHic.16k.cool -o correctionPlot.png
We normalized the data using cutoff values of -2 and 4:
(01_mapping/runCorrection16k.sh)
xxxxxxxxxx
hicCorrectMatrix correct -m aepHic.16k.cool --filterThreshold -2 4 -o hic_corrected16k.coolWe predicted chromatin contact domains using the 16 Kb bin size contact frequency data. We used the HiCExplorer hicFindTADs function to predict domain boundaries. --minDepth was set to 3x the bin size and --maxDepth to 10x the bin size as per the recommendations in the function's documentation. Domain boundaries were identified using a FDR threshold of 0.05.
(02_domains/findTadsAep16k.sh)
xxxxxxxxxx
hicFindTADs -m hic_corrected16k.cool --outPrefix aep16k --correctForMultipleTesting fdr \ --minDepth 48000 --maxDepth 160000 --step 16000 -p 4 --thresholdComparisons 0.05This generated several output files, most notably aep16k_domains.bed, which contains coordinates for the predicted contact domains in the AEP genome; aep16k_boundaries.bed, which contains coordinates for the boundaries of the predicted contact domains; and aep16k_score.bedgraph, which contains insulation scores as a data track for the AEP genome. The insulation score is the basis for the domain boundary prediction, which are marked by rapid shifts in local chromatin contact frequency.
We visualized contact frequency and domain prediction results using JuiceBox.
Bilaterian TAD boundaries are typically located in conserved regions of euchromatin. We therefore sought to determine if this was also the case for the domain boundaries we predicted in the AEP assembly using our Hi-C data. To do this, we characterized the distribution of both repressive (H3K27me3) and activating (H3K4me1 and H3K4me3) histone marks, chromatin accessibility, sequence conservation, and repetitive elements relative to predicted chromatin domain boundaries using the deeptools function computeMatrix. We also included flanking regions 100 kb up- and downstream of the domain boundaries.
The CUT&Tag and ATAC-seq bigwigs used for these plots (e.g., AEP_MG_final_shift.bw, H41_MG.bw, etc.) were generated in 08_creIdentification.md. The sequence conservation bigwig (aepCon.bw) was generated in 07_genomeConservation.md. The repeat density bigwig (repDensity.bw) was generated in 02_repeatMasking.md.
(02_domains/calcTADMat_hetCon.sh)
xxxxxxxxxx
computeMatrix scale-regions -S '../../../Cut&Stuff/CnT/H273_MG.bw' \ ../../alignment_conservation/windows/aepCon.bw \ ../../repeats/repDensity.bw \ '../../../Cut&Stuff/CnT/H41_MG.bw' \ '../../../Cut&Stuff/CnT/H43_MG.bw' \ '../../../Cut&Stuff/ATAC/AEP_MG_final_shift.bw' \ -R /Volumes/Data/genome/hic/aep16k_domains.bed \ -o tadMat_hetCon.gz \ -m 100000 -b 100000 -a 100000 \ --averageTypeBins median \ -bs 1000 \ --missingDataAsZero -p 6We visualized the results using the deeptools plotHeatmap function:
(02_domains/plotBoundHeat_hetCon.sh)
xxxxxxxxxx
plotHeatmap -m tadMat_hetCon.gz -o tadHeat_hetCon.pdf \ --colorList "white,darkblue" \ --heatmapHeight 5 \ --yMax 0.1 0.4 0.9 0.07 0.07 0.08 \ --yMin 0 0.1 0.7 0 0 0.04 \ --zMax 1 1 1 0.08 0.2 0.2
We next wanted to determine if chromatin contact domains influence transcriptional regulation in Hydra. Our approach for testing this was to use the single-cell Hydra atlas to determine if genes that fell within the same contact domain tended to have more similar expression patterns than genes that were not within the same contact domain.
To explore this question, we needed to assign each AEP gene model to a chromatin contact domain. We extracted gene coordinates from the HVAEP1.GeneModels.gtf file and converted them into a bed file.
xxxxxxxxxxawk 'BEGIN { OFS = "\t" ; FS = "\t" } ; $3 ~ /gene/ {print $1,$4-1,$5,$9,0,$7}' HVAEP1.GeneModels.gtf | sed 's/ID "//g;s/";.*\t0/\t0/g' | gsort -k1,1 -k2,2n > HVAEP1.genes.sorted.bed
gsort -k1,1 -k2,2n aep16k_boundaries.bed > aep16k_boundaries.sorted.bedWe then used the bedtools closest function to find the contact domain boundary that was closest to each AEP gene model. The output genesCloseTads.bed included the name and coordinates of the closest boundary as well as its distance to the target gene.
(02_domains/getCloseTads.sh)
xxxxxxxxxx
bedtools closest -D ref -a HVAEP1.genes.sorted.bed -b aep16k_boundaries.sorted.bed > genesCloseTads.bedThe output from bedtools closest was then used for a custom R script. This script identified sets of three consecutive genes that spanned a domain boundary. We excluded all triplets where the central gene fell within the predicted boundary coordinates, which left triplets where two of the genes were in the same domain and one of the genes was in a different domain. We could then generate two different consecutive gene pairs: an inter-domain pair and a intra-domain pair:

We then imported the NMF gene scores for the Hydra single-cell atlas (generated in 05_hydraAtlasReMap.md) and used them to determine if inter-domain gene pairs had more or less correlated expression patterns than intra-domain pairs.
(02_domains/boundaryCor.R)
xxxxxxxxxxlibrary(rstudioapi)library(ggplot2)
setwd(dirname(getActiveDocumentContext()$path))
#import information on the nearest HIC boundary for each genetadLink <- read.delim('genesCloseTads.bed', header = F)
#drop any genes that fall within a boundarytadLink <- tadLink[tadLink$V13 != 0,]
#save original df to use latertadLink.orig <- tadLink
#split genes by their nearest TADtadLink <- split(tadLink,tadLink$V10)
crossPairs <- lapply(tadLink, function(x){ #get all genes that lie downstream of the domain boundary rightG <- x[x$V13 < 0,] #weirdly, bedtools gave downstream genes negative distance values #pick the downstream gene that is closest to the domain boundary rightG <- rightG[which.max(rightG$V13),'V4'] #get numeric equivalent of gene ID (to check if left and right genes are consecutive) rightG.n <- as.numeric(gsub('HVAEP1_G','',rightG)) #get all genes that lie upstream of the domain boundary leftG <- x[x$V13 > 0,] #pick the upstream gene that is closest to the domain boundary leftG <- leftG[which.min(leftG$V13),'V4'] #get numeric equivalent of gene ID leftG.n <- as.numeric(gsub('HVAEP1_G','',leftG)) #if the two genes flanking a boundary are consecutive, return the gene pair #otherwise do nothing if(length(rightG.n) > 0 & length(leftG.n) > 0){ if((rightG.n - leftG.n) == 1){ return(c(leftG,rightG)) } }})
#drop empty resultscrossPairs <- crossPairs[sapply(crossPairs,length) > 0]
#collapse into tablecrossPairs <- do.call(rbind,crossPairs)
#import NMF gene scoresgScore <- read.delim('../../ds/nmf/final/whole_unfilt_fine_broad.gene_spectra_score.k_28.dt_0_2.txt',row.names = 1)
gScore <- t(gScore)
#fix gene name formattingrownames(gScore) <- gsub('[.]','_',rownames(gScore))
#drop gene pairs that don't have gene scorescrossPairs <- as.data.frame(crossPairs[crossPairs[,1] %in% rownames(gScore) & crossPairs[,2] %in% rownames(gScore),])
#generate gene ID for genes that are downstream of the righthand genes in the crosspairs list#these genes will be in the same domain as the righthand crosspairs genecisPairs <- as.numeric(substr(crossPairs[,2],9,14)) + 1
cisPairs <- formatC(cisPairs, width = 6, format = "d", flag = "0")
cisPairs <- paste0('HVAEP1_G',cisPairs)
cisPairs <- data.frame(g1 = crossPairs[,2],g2=cisPairs)
#make sure cispairs both have gscorescisPairs <- as.data.frame(cisPairs[cisPairs[,1] %in% rownames(gScore) & cisPairs[,2] %in% rownames(gScore),])
#check and make sure the two cispair genes are indeed in the same domaincisPairs <- cisPairs[cisPairs$g2 %in% tadLink.orig$V4,]
cisPairs$g1Bid <- tadLink.orig[match(cisPairs$g1, tadLink.orig$V4),'V10']cisPairs$g2Bid <- tadLink.orig[match(cisPairs$g2, tadLink.orig$V4),'V10']
cisPairs <- cisPairs[cisPairs$g1Bid == cisPairs$g2Bid,]
#limit the crosspairs to only those genes that also had a valid cis paircrossPairs <- crossPairs[crossPairs$V2 %in% cisPairs$g1,]
#compare gene scores across metagenes for crosspairsxPairCor <- apply(crossPairs,1,function(x){ cor(gScore[x[1],],gScore[x[2],])})
#compare gene scores across metagenes for cispairscisPairCor <- apply(cisPairs,1,function(x){ cor(gScore[x[1],],gScore[x[2],])})
#generate dataframe of correlation scores for plottingplotDF <- data.frame(corVal = c(cisPairCor,xPairCor), lab = as.factor(rep(c('cis','cross'),c(nrow(cisPairs),nrow(cisPairs)))))
#define cross as first level of factors (specifying plotting order)plotDF$lab <- relevel(plotDF$lab, "cross")
ggplot(plotDF,aes(x=lab,y=corVal,fill=lab)) + geom_boxplot
The distribution of correlation scores suggested that consecutive gene pairs that fell within the same contact domain had more similar expression patterns than consecutive gene pairs that spanned a contact domain. To determine if this difference was significant, we used a standard student's t-test:
xxxxxxxxxx#use t-test to see if cispairs have significantly higher correlation scores than crosspairst.test(x=cisPairCor,y=xPairCor,alternative = 't',var.equal = F)xxxxxxxxxxWelch Two Sample t-testdata: cisPairCor and xPairCort = 4.0072, df = 591.5, p-value = 6.929e-05alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.05699933 0.16657721sample estimates:mean of x mean of y0.2223149 0.1105267
To contextualize the 3D organization of the Hydra genome relative to other cnidarian genomes, we downloaded previously released Hi-C data for six other cnidarian chromosome-level genome assemblies. The table below provides information on the specific SRA datasets we accessed.
| Species | SRA Accession #'s | Genome Reference | Restriction Enzyme |
|---|---|---|---|
| A. millepora | SRR13361157 SRR13361158 SRR13361159 SRR13361156 SRR13361160 SRR13361155 SRR13361154 SRR13361162 SRR13361163 | GSM5182734 (GEO Accession) | MboI |
| D. lineata | ERR6688655 | GCA_918843875.1 (GenBank Accession) | Arima |
| H. octoradiatus | ERR6745733 | GCA_916610825.1 (GenBank Accession) | Arima |
| N. vectensis | SRR12775957 | SIMRBASE Link | DpnII |
| R. esculentum | SRR11649085 | GCA_013076305.1 (GenBank Accession) | DpnII |
We downloaded the raw reads files for these datasets from SRA using the SRA Tools fasters-dump function.
fasterq-dump <list of accession #'s for a dataset of interest>
As with the AEP Hi-C data, we then used the Juicer pipeline to align the Hi-C data and generate contact maps for each genome. This entailed first predicting restriction enzyme cutsites using the generate_site_positions.py script included as part of the Juicer pipeline. Note that for the next several scripts, a single species is used as an example. The scripts for the other species use the same commands, but with the relevant changes in files and restriction enzymes.
(03_compare/getCutsAmil.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=cutS#SBATCH -c 1#SBATCH -t 60-0#SBATCH --mem=16G#SBATCH --error=cutS_%j.err#SBATCH --output=cutS_%j.out
python ../juicer/misc/generate_site_positions.py MboI amil \ ../amil.faWe then generated a .genome file of contig sizes.
xxxxxxxxxxsamtools faidx amil.facut -f 1,2 amil.fa.fai > amil.genomeThe genome was also indexed for mapping using bwa.
bwa index amil.fa
Finally, we ran the modified Juicer pipeline (described in 01_aepGenomeAssembly.md) to generate the contact frequency maps for each species.
(03_compare/runJuicerAmil.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=jLaunch#SBATCH -c 1#SBATCH -t 30-0#SBATCH --mem=8G#SBATCH --error=jLaunch_%j.err#SBATCH --output=jLaunch_%j.out
./scripts/juicerMod.sh \ -g amil -z amil.fa \ -p amil.genome \ -s MboI \ -A jacazet -q med -Q 30-0 -l med -L 30-0 -t 8 \ -D /home/jacazet/reference/revision/hic \ -d /home/jacazet/reference/revision/hic/workA previous publication (Hoencamp et al., 2021) established an unbiased quantitative framework–called aggregate chromosome analysis, or ACA–for systematically comparing inter- and intra-chromasomal interactions across different species. ACA is based around generating a representative chromosome interaction profile for a given species by averaging length-normalized interaction maps of individual chromosomes. After this representative profile is generated, a number of metrics are calculated in order to characterize the rate at with different chromosome regions interact both in cis and in trans. Specifically, ACA calculates metrics for telomere-to-telomere, telomere-to-centromere, and centromere-to-centromere interactions.
Quantifying 3D chromatin interactions at centromeres requires knowing the centromere coordinates. Apart from Hydra, such information is not currently available for cnidarian genome assemblies. Thus, we were unable to use ACA for quantifying centromere interactions. However, the telomere-to-telomere metric does not depend on having accurate centromere coordinates, so we ran the ACA using 'dummy' centromere coordinates and used only the telomere-to-telomere interaction quantification results.
To perform the ACA, we first filtered out any non-chromosome scaffolds from the assembly using seqkit
(The code examples below show only a single species, but the same basic steps were applied to all species in the analysis.)
seqkit sort -l -r amil.fa | seqkit head -n 14 - > amil.chroms.fa
This required that we regenerate the .genome file for each assembly.
xxxxxxxxxxsamtools faidx amil.chroms.facut -f 1,2 amil.chroms.fa.fai > amil.chroms.genomeWe then used the build-aca-hic.sh script from the 3d-dna package to perform the ACA. Prior to running this script, we generated dummy coordinates for 10 kb centromeres in the center of each psuedo-chromosome in the assembly. The input for this script also included the merged_nodups.txt file generated for each genome by the Juicer mapping pipeline.
(03_compare/pseudoACA.sh)
xxxxxxxxxx
specUse="$1"
specDir="$2"
awk -v OFMT='%f' -F '\t' '{print $1, int($2/2), int($2/2)+10000}' "$specUse".chroms.genome > "$specUse".pseudochroms.bed
./aidenlab-3d-dna-cb63403/supp/build-aca-hic.sh \ "$specUse".chroms.genome \ "$specUse".pseudochroms.bed \ "$specDir"/aligned/merged_nodups.txtTo run the ACA across each genome in our compiled dataset, we used the following wrapper script:
(03_compare/runPseudoACA.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=ACA#SBATCH --exclusive#SBATCH -t 30-0#SBATCH --mem=0#SBATCH --error=ACA.err#SBATCH --output=ACA.out
dirAr=( workAEP workAmil workHoct workResc workDili workNvec200 )specAr=( aep amil hoct resc dili Nvec200 )
for i in {1..6..1}; do dirUse=${dirAr[$i]} specUse=${specAr[$i]} echo "$dirUse" "$specUse" ./pseudoACA.sh "$specUse" "$dirUse"doneThe build-aca-hic.sh produced a .hic file containing the representative chromosome interaction profile. We used the score-aca.sh script from the 3d-dna pipeline to calculate the telomere-to-telomere interaction score (among other metrics) for each species:
(03_compare/getPseudoAcaScores.sh)
xxxxxxxxxx#SBATCH -p med#SBATCH --job-name=ACAscores#SBATCH -c 1#SBATCH -t 30-0#SBATCH --mem=20G#SBATCH --error=ACAscore.err#SBATCH --output=ACAscore.out
for i in *chroms.genome.aca.hic; do echo "$i" ./aidenlab-3d-dna-cb63403/supp/score-aca.sh "$i"doneThis produced the following output:
(03_compare/ACAscore.out)
xxxxxxxxxxaep.chroms.genome.aca.hic4.294 1.012 0.988 1.618amil.chroms.genome.aca.hic1.939 1.027 1.011 1.588dili.chroms.genome.aca.hic2.795 0.988 0.961 1.283hoct.chroms.genome.aca.hic2.593 0.992 1.080 1.809nem.chroms.genome.aca.hic4.861 1.041 0.982 0.923Nvec200.chroms.genome.aca.hic6.434 0.998 1.006 1.141resc.chroms.genome.aca.hic4.489 1.013 1.019 1.183The telomere-to-telomere interaction metric is the forth and last number outputted for each species. These values were used for generating plots in the section below
Because we were unable to use the centromere interaction metrics from the ACA pipeline, we developed a novel method for quantifying centromere-to-centromere interactions that did not rely on previous knowledge of centromere coordinates. The concept behind this approach was that strong inter-centromeric interactions should be discernible as highly localized regions with elevated rates of inter-chromosomal interactions compared to other regions within a given chromosome; however, such localized enrichment should be absent in species with low levels of inter-centromeric interactivity.
To calculate a metric that captures this localized enrichment signature in an unbiased fashion, we first used Juicer Tools to output Knight and Ruiz normalized interaction matrices with a 100 kb bin-size for all inter-chromosomal scaffold pairs (i.e., chr-1 interactions with chr-2 but not chr-1 interactions with chr-1) using the Juicer-derived .hic files we generated above for each cnidarian species of interest. This required the .genome files for the chromosome-only version of each assembly that were generated as part of the ACA (see above).
(03_compare/crossChromDump.sh)
xxxxxxxxxx
cd "$1"
while read -ra array; do ar1+=("${array[0]}") ar2+=("${array[1]}")done < "$1".genome
for i in "${ar1[@]}"; do for j in "${ar1[@]}"; do if [[ "$i" != "$j" ]]; then
java -Xms512m -Xmx2048m -jar ../juicer_tools.jar \ dump observed KR inter.*hic "$i" "$j" \ BP 100000 > "$i"_"$j".txt fi donedoneThe above script was executed for each species as follows:
xxxxxxxxxx./crossChromDump.sh aep./crossChromDump.sh amil./crossChromDump.sh resc./crossChromDump.sh dili./crossChromDump.sh nvec200./crossChromDump.sh hoctAfter generating the interaction matrices, we then quantified the extent to which each chromosome in the assembly possessed a localized region with greatly elevated rates of inter-chromosomal interaction. To do this, we first calculated the median number of normalized inter-chromosomal contacts for each 100 kb bin along the length of each scaffold across all possible inter-chromosomal scaffold pairs. We then removed the top and bottom tenth of each chromosome (to remove the telomere interaction signal), and converted the remaining values into z-scores. The inter-centromeric interaction score for each chromosome was defined as the highest z-score value along the length of a given scaffold. We then plotted the distribution for this score across all species for all chromosomes.
(03_compare/chromCalcs.R)
xxxxxxxxxxlibrary(ggplot2)library(zoo)
#####cross chrom cent search#####
centCrossCheck <- function(specCheck,chrCheck){ print(chrCheck) chrSize <- cSizes[cSizes$V1 == chrCheck,'V2'] crossChrs <- list.files(path=specCheck,pattern = paste0(chrCheck,'_.*txt'), full.names = T) crossChrs <- lapply(crossChrs, read.delim, header=F, skip=1) names(crossChrs) <- list.files(path=specCheck,pattern = paste0(chrCheck,'_.*txt')) colUse <- lapply(crossChrs, function(x) sapply(x, function(y) abs(max(y) - chrSize))) colUse <- sapply(colUse, function(x) which.min(x)) crossChrs.sp <- lapply(1:length(crossChrs), function(x) { split(crossChrs[[x]],crossChrs[[x]][,colUse[x]]) }) crossChrs.sp <- lapply(crossChrs.sp, function(x) sapply(x, function(y) sum(y$V3))) print(sapply(crossChrs.sp,length)) crossChrs.df <- do.call(cbind,crossChrs.sp) crossChrs.df <- crossChrs.df[rowSums(is.na(crossChrs.df)) != ncol(crossChrs.df),] crossChrs.ave <- apply(crossChrs.df,1,median, na.rm = T) crossChrs.ave.trim <- crossChrs.ave[floor(length(crossChrs.ave)/10):floor(length(crossChrs.ave) - length(crossChrs.ave)/10)] crossChrs.zs <- (crossChrs.ave.trim - mean(crossChrs.ave.trim))/sd(crossChrs.ave.trim) plot(1:length(crossChrs.ave.trim), crossChrs.zs, type = 'l') print(max(crossChrs.zs))}
specCheck <- 'aep'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
aepScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
specCheck <- 'dili'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
diliScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
specCheck <- 'hoct'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
hoctScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
specCheck <- 'nvec200'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
nemScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
specCheck <- 'resc'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
rescScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
specCheck <- 'amil'
cSizes <- read.delim(paste0(specCheck,'/', specCheck,'.genome'), header=F)
amilScores <- sapply(cSizes$V1, function(x) centCrossCheck(specCheck,x))
plotDF <- data.frame(spec = rep(c('aep','dili','hoct','nem','resc','amil'),c(15,16,9,15,21,14)), scores = c(aepScores,diliScores,hoctScores,nemScores,rescScores,amilScores))
plotDF$spec <- factor(plotDF$spec, levels=c('aep', 'resc', 'hoct', 'nem', 'dili','amil'))
ggplot(plotDF, aes(x = spec, y = scores, fill=spec)) + geom_violin() + geom_jitter(width = 0.2) + theme_bw()ggsave('interCentScores.pdf',width = 8, height = 4)
This analysis suggested that the H. vulgaris assembly has a markedly higher levels of inter-centromeric interaction than other cnidarian genomes. To determine if this difference was significant, we used Tukey’s Honest Significant Difference method to perform a post-hoc significance test on an ANOVA calculated on all inter-centromeric contact scores for all species.
(03_compare/chromCalcs.R)
xxxxxxxxxx#significance testcent.lm <- lm(scores ~ spec, data = plotDF)
cent.av <- aov(cent.lm)
tukey.test <- TukeyHSD(cent.av)
tukey.testThis generated the following result:
xxxxxxxxxx Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = cent.lm)
$spec diff lwr upr p adjresc-aep -10.1648942 -13.1385824 -7.191206 0.0000000hoct-aep -9.4723729 -13.1812137 -5.763532 0.0000000nem-aep -8.6903094 -11.9022598 -5.478359 0.0000000dili-aep -9.5561526 -12.7175179 -6.394787 0.0000000amil-aep -5.2912110 -8.5600144 -2.022408 0.0001324hoct-resc 0.6925213 -2.8120038 4.197046 0.9923268nem-resc 1.4745848 -1.4991034 4.448273 0.6988178dili-resc 0.6087416 -2.3102354 3.527719 0.9901689amil-resc 4.8736833 1.8386755 7.908691 0.0001529nem-hoct 0.7820635 -2.9267773 4.490904 0.9896589dili-hoct -0.0837797 -3.7488998 3.581340 0.9999998amil-hoct 4.1811620 0.4229774 7.939346 0.0202573dili-nem -0.8658432 -4.0272085 2.295522 0.9669727amil-nem 3.3990985 0.1302950 6.667902 0.0367072amil-dili 4.2649417 1.0458298 7.484054 0.0029111This indicates that the AEP inter-centromeric interaction scores were significantly higher than all other cnidarian genomes considered in the analysis.
In this script, we also generated plots for the inter-telomeric interaction scores generated using the ACA method.
(03_compare/chromCalcs.R)
xxxxxxxxxx#ACA telomere scores
tScores <- data.frame(spec = c('aep','amil','dili','hoct','nem','resc'), score = c(1.618,1.588,1.283,1.809,1.141,1.183))
tScores$spec <- factor(tScores$spec, levels=c('aep', 'resc', 'hoct', 'nem', 'dili','amil'))
ggplot(tScores, aes(x=spec,y=score, fill=spec)) + geom_col() + theme_bw()
These results suggest that there is little variation in the levels of inter-telomeric interactions among cnidarians, and that H. vulgaris is not markedly different from the other species considered in this analysis.
xxxxxxxxxx09_3dChromatin/├── 01_mapping│ ├── aepHic.16k.coolcooler-formatted file containing chromatin contact frequency data for theAEP genome. Uses a 16 Kb bin size.│ ├── aepHic.8k.coolcooler-formatted file containing chromatin contact frequency data for theAEP genome. Uses an 8 Kb bin size.│ ├── correctionPlot16k.shShell script that generates a plot of the contact frequency distributionfor the AEP Hi-C data using a 16 Kb bin size. Used for tuning parameterswhen performing normalization.│ ├── correctionPlot8k.shShell script that generates a plot of the contact frequency distributionfor the AEP Hi-C data using a 8 Kb bin size. Used for tuning parameterswhen performing normalization.│ ├── getCutsFinal.shShell script that predicts cut sites in the AEP assembly for the restrictionenzymes used to generate the AEP Hi-C library, which is required for runningthe Juicer mapping pipeline.│ ├── hic_corrected16k.coolcooler-formatted file containing normalized chromatin contact frequency datafor the AEP genome. Uses a 16 Kb bin size.│ ├── hic_corrected8k.coolcooler-formatted file containing normalized chromatin contact frequency datafor the AEP genome. Uses an 8 Kb bin size.│ ├── inter_30.hicJuicer-formatted file containing chromatin contact frequency data for theAEP genome. This file caculates contact frequency only using reads with aMAPQ of 30 or greater.│ ├── inter.hicJuicer-formatted file containing chromatin contact frequency data for theAEP genome.│ ├── juiceOut2Pairs.shShell script that converts the Juicer-formatted Hi-C mapped read file intothe pairix format. Needed to use Juicer-mapped data with HiCExplorer.│ ├── pair2Cool16k.shShell script that generates a .cool file of chromatin contact frequencyfrom a pairix file using a 16 Kb bin size.│ ├── pair2Cool8k.shShell script that generates a .cool file of chromatin contact frequencyfrom a pairix file using a 8 Kb bin size.│ ├── runCorrection16k.shShell script that normalizes the chromatin contact frequency data inaepHic.16k.cool. Generates hic_corrected16k.cool.│ ├── runCorrection8k.shShell script that normalizes the chromatin contact frequency data inaepHic.8k.cool. Generates hic_corrected8k.cool.│ └── runJuicerfinal.shShell script that uses the Juicer alignment pipeline to map the AEP Hi-Cdata to the final AEP genome assembly.├── 02_domains│ ├── aep16k_boundaries.sorted.bedCoordinate sorted bed file containing the locations of all chromatincontact domain boundaries predicted by findTadsAep16k.sh.│ ├── aep16k_domains.bedBed file containing the locations of all chromatin contact domainspredicted by findTadsAep16k.sh.│ ├── aep16k_score.bedgraphBedgraph file containing insulation scores for the AEP genome. The insulationscore is a sliding window measure of chromatin contact frequency. Low/negativeinsulation scores are distinctive of contact domain boundaries.│ ├── boundaryCor.RR script that uses the Hydra single cell atlas to compare the expression patternsof consecutive gene pairs in the AEP genome that either fall within a singlechromatin contact domain (intra-domain pairs) or span two contact domains (inter-domain pairs) to determine if contact domains in Hydra influence gene expression.│ ├── calcTADMat_hetCon.shShell script that uses the deeptools computeMatrix function to characterizesequence conservation, repeat density, ATAC-seq, H3K4me1, H3K4me3, and H3K27me3distribution around chromatin contact domains.│ ├── findTadsAep16k.shShell script that uses the HiCExplorer hicFindTADs function to predict chromatincontact domain boundaries using a 16 Kb bin size.│ ├── genesCloseTads.bedBed genome coordinate file that includes the nearest chromatin contact domainboundary for each AEP gene model.│ ├── getCloseTads.shShell script that uses bedtools to identify the nearest chromatin contact domainboundary to each AEP gene model.│ ├── HVAEP1.genes.sorted.bedBed genome coordinate file that includes the coordinates for all AEP gene models.│ ├── makeTadPlot.shShell script that generates a reprentative plot of chromatin contact frequency,domain predictions, and insulation scores.│ └── ploTADHeat_hetCon.shShell script that plots the results of calcBoundMat_hetCon.sh using the deeptoolsplotHeatmap function to show trends in sequence conservation, repeat density,ATAC-seq, H3K4me1, H3K4me3, and H3K27me3 around predicted chromatin contact domains.└── 03_compare├── ACAscore.outOutput file containing the four ACA scores for six different cnidarian species.Generated by getPseudoAcaScores.sh. Note that the two metrics quantifyingcentromeric interactions (the second and third values on each line) are invalid,as false centromere coordinates were used for this analysis.├── Nvec200.chroms.genome.aca.hicACA interaction map for N. vectensis. Generated by runPseudoACA.sh. In part,serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── aepDirectory used for generating inter-chromosomal interaction matrices for thestrain AEP H. vulgaris genome assembly.│ ├── aep.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.│ └── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerfinal.sh.├── aep.chroms.genome.aca.hicACA interaction map for H. vulgaris, strain AEP. Generated by runPseudoACA.sh.In part, serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── amilDirectory used for generating inter-chromosomal interaction matrices for theA. millepora genome assembly.│ ├── amil.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.│ └── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerAmil.sh├── amil.chroms.genome.aca.hicACA interaction map for A. millepora. Generated by runPseudoACA.sh. In part,serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── chromCalcs.RR script that uses the normalized interaction frequency matrices generated bycrossChromDump.sh to calculate inter-centromeric interaction metrics. Alsoincludes a statistical comparison of the results across different cnidarianspecies and generates plots.├── crossChromDump.shShell script that outputs Knight and Ruiz normalized interaction matricesfor all inter-chromosomal scaffold pairs for a species of interest using a100 kb window size. The matrix for each chromosome pair is written as aseparate text file.├── diliDirectory used for generating inter-chromosomal interaction matrices for theD. lineata genome assembly.│ ├── dili.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.│ └── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerDili.sh.├── dili.chroms.genome.aca.hicACA interaction map for D. lineata. Generated by runPseudoACA.sh. In part,serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── getCutsAmil.shShell script that uses the generate_site_positions.py script provided withthe Juicer pipeline to predict MboI cut sites in the A. millepora genomeassembly.├── getCutsDili.shShell script that uses the generate_site_positions.py script provided withthe Juicer pipeline to predict cut sites generated by the Arima Hi-C kit inthe D. lineata genome assembly.├── getCutsHoct.shShell script that uses the generate_site_positions.py script provided withthe Juicer pipeline to predict cut sites generated by the Arima Hi-C kit inthe H. octoradiatus genome assembly.├── getCutsNem.shShell script that uses the generate_site_positions.py script provided withthe Juicer pipeline to predict DpnII cut sites in the N. vectensis genomeassembly.├── getCutsResc.shShell script that uses the generate_site_positions.py script provided withthe Juicer pipeline to predict DpnII cut sites in the R. esculentum genomeassembly.├── getPseudoAcaScores.shShell script that runs the score-aca.sh script provided with the 3d-dna packageon the .hic files generated by the runPseudoACA.sh script. ACA scores are writtento the ACAscore.out file.├── hoctDirectory used for generating inter-chromosomal interaction matrices for theH. octoradiatus genome assembly.│ ├── hoct.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.│ └── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerHoct.sh.├── hoct.chroms.genome.aca.hicACA interaction map for H. octoradiatus. Generated by runPseudoACA.sh. In part,serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── nvec200Directory used for generating inter-chromosomal interaction matrices for theN. vectensis genome assembly.│ ├── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerNem.sh.│ └── nvec200.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.├── pseudoACA.shShell script that first generates pseudo-centromere coordinates and then usesthose coordinates for ACA by calling the build-aca-hic.sh script provided withthe 3d-dna package. Uses the merged_nodups.txt file generated by the Juicerpipeline.├── rescDirectory used for generating inter-chromosomal interaction matrices for theR. esculentum genome assembly.│ ├── inter.hicBinary file containing the Juicer-derived Hi-C interaction frequencies. Generatedby runJuicerResc.sh.│ └── resc.genomeText file containing the lengths of each chromosomal scaffold in the referenceassembly.├── resc.chroms.genome.aca.hicACA interaction map for R. esculentum. Generated by runPseudoACA.sh. In part,serves as the basis for the ACA metrics generated by getPseudoAcaScores.sh.├── runJuicerAmil.shShell script that runs the Juicer pipeline on the A. millepora genome assembly.├── runJuicerDili.shShell script that runs the Juicer pipeline on the D. lineata genome assembly.├── runJuicerHoct.shShell script that runs the Juicer pipeline on the H. octoradiatus genome assembly.├── runJuicerNem.shShell script that runs the Juicer pipeline on the N. vectensis genome assembly.├── runJuicerResc.shShell script that runs the Juicer pipeline on the R. esculentum genome assembly.└── runPseudoACA.shWrapper shell script that runs the pseudoACA.sh script on six different cnidariangenome assemblies.