Identifying Conserved Regulators in Clytia and Hydra

This document covers our analysis aimed at identifying conserved regulators of cell-type-specific transcription in Clytia and Hydra. This required that we first perform a motif enrichment analysis on co-expressed genes in the Clytia atlas. For this part of the analysis, we used largely the same approach that we used for the Hydra single-cell atlas, although we had to make some modifications due to the lack of cis-regulatory element annotations for Clytia. After characterizing motif enrichment patterns in the Clytia atlas, we compared results across the Clytia and Hydra single-cell datasets to identify instances where both the expression of a transcription factor and the enrichment pattern of its motif were conserved in the two species.

Gene Co-Expression Analysis Using Non-Negative Matrix Factorization (NMF)

Our first goal was to identify gene co-expression modules in the Clytia single-cell atlas using non-negative matrix factorization (NMF). For a description on the basic principles of NMF and our approach to applying it to single-cell expression data, see 05_hydraAtlasReMap.md

We first imported the Seurat object containing the re-mapped Clytia single-cell data (generated in 11_clytiaAtlasReMap.md) into R and then exported both the raw and normalized read count matrices to tsv files, along with a list of the variable genes used for the Seurat clustering.

(01_clNMF/clNMF.R)

We then set up a cNMF run to do a parameter sweep to determine the number of metagenes (k) to use:

(01_clNMF/runPrep.sh)

We then ran the primary factorization step of the cNMF analysis:

(01_clNMF/runFactorize.sh)

We pooled the resulting analysis files:

(01_clNMF/runCombine.sh)

And visualized stability and error metrics to select the approximate K value to use:

(01_clNMF/runKselect.sh)

clCourseKselect

Based on this plot, we selected the K range from 35-45. We then performed a finer grained parameter sweep to identify the 'best' k value within that range.

We re-prepped the analysis for the new range of K values:

(01_clNMF/runPrep2.sh)

We then ran the factorization step:

(01_clNMF/runFactorize2.sh)

Combined the resulting files:

(01_clNMF/runCombine2.sh)

And visualized the stability and error metrics:

(01_clNMF/runKselect2.sh)

clFineKselect

Based on these results, we selected 37 as the number of metagenes to use in downstream analyses.

We then had to generate consensus results from the 200 replicates run for the 37-metagene analysis. We initially screened the similarity of the different runs using a permissive threshold setting (we retained runs with distances < 2.0) for the consensus function.

(01_clNMF/runConseunsus2.sh)

consensusInit

 

Based on the distance distribution, we set the cutoff to 0.13 to remove outliers:

(01_clNMF/runConseunsus2.sh)

clFinalConsensus

To link the gene co-expression programs from NMF to potential cell types/functions, we plotted the cell scores for each metagene on the UMAP from the original Clytia atlas publication (using a Seurat object generated in 11_clytiaAtlasReMap.md)

(01_clNMF/visNMF.R)

clMgUsage

Analysis of Motif Enrichment

In order to use the NMF results for an enrichment analysis, we needed a way of isolating regulatory sequence associated with genes belonging to different metagenes; however, we did not have access to ATAC-seq or histone modification data for the Clytia genome. In addition, because there was little to no conservation in non-coding regions when comparing the Clytia genome to various Hydra genomes, we couldn't use phylogenetic footprinting to identify putative transcription factor binding sites. We therefore selected putative regulatory sequences by simply using the 1000 bp of sequence upstream of gene transcription start sites, since this region is very likely to contain at least some sequence with regulatory function.

Our approach for identifying enriched motifs in Hydra was based on a gene set enrichment analysis framework; however, this method relies on relatively low false positive rates for binding site predictions, which we didn't have for the Clytia genome. We therefore used the analysis of motif enrichment (AME) pipeline from the meme suite of software tools to perform a more conventional motif enrichment analysis.

AME requires a list of sequences that have been scored in a way that assigns 'positive' sequences (e.g., genes strongly associated with a particular metagene) a small score and 'negative' sequences a high score. To generate scores that fit this criterion, we simply reversed the sign of the metagene gene scores, which use positive values to indicate a strong association between a gene and a metagene.

In the following R script, we use the Clytia gene models (generated in 11_clytiaAtlasReMap.md) to determine the coordinates of 1000 bp long regions just upstream of TSS. Then, for each metagene in the Clytia NMF analysis, we assigned each putative promoter a score based on how strongly its nearby gene was associated with the metagene of interest. These results were then exported as bed files (titled mg#_ScoredProms.bed where the number corresponded to the metagene used to assign the gene scores for that particular bed file).

(02_clEnrichment/enrichPrep.R)

The bed files were then used to extract FASTA sequences from the Clytia genome for all putative promoter sequences. The headers of this fasta file also contained metagene scores for the corresponding gene model.

(02_clEnrichment/bedToFasta.sh)

In addition to the fasta sequences generated above, we also needed to generate a background file that specified the nucleotide frequencies in the Clytia genome. This was done using the following command:

fasta-get-markov clytiaG.fa > clytiaBG.txt

We then performed AME on each of our metagene-specific promoter FASTA files to get motif enrichment results for each metagene. We used the pooledJasparNR.meme.txt motif database generated as part of the analysis described in 07_genomeConservation.md

(02_clEnrichment/runAme.sh)

Using the following R script, we pooled all the individual AME output files into a single enrichment results table. We then calculated a fold-enrichment score for each significant enrichment result (we used an E-value threshold of 10) by dividing the % of positive (i.e., strongly metagene-associated) genes that contained the target motif by the % of negative genes that contained the target motif. We then mapped these fold-enrichment scores onto the Clytia single cell atlas by generating a weighted average of fold-enrichment values for each motif for each cell using the NMF metagene cell scores. These single-cell enrichment scores were then used in a subsequent analysis to identify conserved motif enrichment patterns in Clytia and Hydra. Finally, we generated a motif by metagene heat map of enrichment scores to summarize the results.

(02_clEnrichment/visClEn.R)

clEnHM

Comparing Motif Enrichment and Transcription Factor Expression Conservation in Clytia and Hydra

Analysis Using Bona Fide Motif Sequences

We next wanted to compare the motif enrichment results for Clytia and Hydra using our aligned single-cell atlas. To do this, we adopted a strategy that was similar to the method we used to identify genes with conserved expression patterns (described in 12_crossSpeciesAtlasAlignment.md).

We had already generated single-cell motif enrichment scores for each species, so to identify the motifs that had similar enrichment patterns in the aligned principal component space we generated pseudo-cells using a high resolution louvain clustering analysis. We could then use these pseudo-cells to group small sets of cells from each species together and thus identify motifs with similar pseudo-cell enrichment patterns in the two species.

To begin, we imported the single-cell enrichment scores for Clytia and Hydra along with the aligned cross-species atlas. We then generated pseudo-cells using the Seurat implementation of the Louvain clustering algorithm.

(03_crossCompare/enrichComp.R)

motCorPseudoCell

We next calculated the average enrichment score for each motif in our results matrices, grouping cells first by species then by pseudo-cell. We then calculated a correlation score to compare enrichment patterns across pseudo-cells in the two species.

In order to identify putative conserved regulators, we manually reviewed the results contained in the crossMotCor object from this analysis (which contained correlation values indicating the degree of similarity in motif enrichment patterns in the two species), as well as the transcription factor expression conservation results described in 12_crossSpeciesAtlasAlignment.md. This revealed the conserved regulators presented in figure 4. We used the following code to visualize these conserved enrichment and expression patterns in the aligned cross-species atlas (here we show example plots for ebf):

(EBF3 motif enrichment for Hydra)

enPlot_EBF3_aep

(EBF3 motif enrichment for Clytia)

enPlot_EBF3_cl

(ebf expression in Hydra)

ebfTran_aep

(ebf expression in Clytia)

ebfTran_cl

Analysis Using Shuffled Motif Sequences

To determine if the similarities in motif enrichment patterns we observed in the two species were greater than would be expected based on chance, we repeated our enrichment comparison analysis using results based on shuffled versions of each transcription factor binding motif.

This required that we re-run the AME analysis using shuffled versions of each motif in our database (the generation of these shuffled motifs is described in 07_genomeConservation.md)

(02_clEnrichment/runAmeShuf.sh)

We then transferred these enrichment results onto the Clytia single cell atlas by generating single-cell enrichment scores for each motif. As with the Hydra enrichment analysis (described in 10_hydraRegulators.md ), we found that shuffling the motifs completely changed the resulting enrichment patterns, demonstrating that our enrichment results were not primarily being driven by sequence bias artifacts.

clMotifHeatmapShuf

We then looked for similarities in the enrichment patterns of these shuffled motifs across the Hydra and Clytia atlases. We found that no shuffled motifs showed signs of conservation (defined as having a correlation score > 0.5). This suggests that the similarities in motif enrichment across homologous cell types that we observed in our analysis using bona fide motif sequences are unlikely to have been driven purely by chance.

(03_crossCompare/enrichCompShuf.R)

 

Files Associated with This Document