Identifying Candidate Regulators of Cell-Type-Specific Transcription in Hydra

This document covers our analysis aimed at identifying transcriptional regulators of gene co-expression modules in the Hydra single-cell atlas. This entailed integrating our non-negative matrix factorization (NMF) results (described in 05_hydraAtlasReMap.md) with our transcription factor binding site conservation analysis (described in 07_genomeConservation.md) to identify motifs enriched in the putative regulatory sequences of sets of co-expressed genes (i.e., metagenes). Following the motif enrichment analysis, we then looked for transcription factors whose expression pattern and sequence binding preferences correlated with motif enrichement patterns.

Preparatory Motif and Transcription Factor Analyses

Prior to the main analysis, several supplemental files needed to be generated. These included annotations linking conserved motifs to their most proximal gene, information on motif sequence similarity (to reduce redundancy in the enrichment results), the names of the transcription factor associated with each binding motif, and predictions of which gene models in the AEP assembly are transcription factors based on their pfam domain annotations.

Linking Predicted Transcription Factor Sites to Their Nearest Genes

First, we used UROPA to link conserved transcription factor binding sites to their putative target genes based on proximity. Specifically, we linked a motif to the nearest transcription start site as long as it fell within 30 Kb. We did this for both the binding site predictions based on the bona fide JASPAR motif sequences as well as the predictions based on shuffled motif sequences (both generated in 07_genomeConservation.md).

We used the following config file to annotate the binding site predictions for the bona fide motif sequences (conMotsATAC.bed):

(01_motifPrep/conMotATACAnnot.json)

We used the following config file to annotate the binding site predictions for the shuffled motif predictions (conShufMotsATAC.bed)

(01_motifPrep/conShufMotATACAnnot.json)

We then ran the two annotation analyses with the following commands:

Clustering Motifs to Reduce Redundancy

Closely related transcription factors that use the same DNA binding domain tend to have highly similar (or virtually identical) binding preferences. Thus, motif enrichment analyses tend to have many redundant motifs in their results. To identify redundant motifs in our compiled JASPAR database, we used two clustering analyses: one that grouped motifs purely based on their sequence composition and another that grouped motifs based on their enrichment patterns in the single-cell atlas. In this section, we describe the former clustering approach (the latter approach is described later).

To perform the sequence-based clustering, we chose to make use of the compareMotifs.pl utility script provided as part of the HOMER suite of motif enrichment tools. In order to use this script, we needed to reformat our JASPAR-formated motifs into the HOMER motif format. We did this using a utility script also provided as part of HOMER:

We then used the compareMotifs.pl to generate a matrix of pairwise similarity scores for all motifs in our custom JASPAR database

(01_motifPrep/homerMotCompare.sh)

Within a custom R script, we used these similarity scores for a hierarchical clustering analysis that grouped together motifs with similar sequence composition. These results were later combined with a second hierarchical clustering analysis (described in the "Single-Cell Motif Enrichment Analysis" section) to arrive at the final motif cluster assignments.

(01_motifPrep/clusterMots.R)

Below is the tree generated by the hierarchical clustering analysis used to generate the motif_clusters.csv file:

motClust

 

Annotate Transcription Factors in the Genome Gene Models

Ultimately, the goal of the analysis was to assign transcription factors as candidate regulators of co-expressed genes. This required that we determine which gene models in the AEP assembly are likely to be transcription factors. To do this, we made use of our InterProScan results (described in 03_aepGenomeAnnotation.md). Specifically, we used a manually curated list of protein domains and gene ontology terms to generate a candidate list of transcription factors.

(01_motifPrep/tfList.R)

Based on this analysis, we identified a total of 811 candidate transcription factors in the AEP gene models.

Download Motif Metadata

In order to determine which transcription factors could bind which JASPAR motifs, we made use of the uniprot database protein domain annotations available for each JASPAR motif. JASPAR provides uniprot IDs for each motif in its database, and each uniprot entry contains Pfam domain annotations. This allowed us to determine the pfam domains associated with each motif in our motif database. These domains could then be linked to the domain predictions from our InterProScan analysis to associate specific gene models with JASPAR motifs. These gene/motif links provided the basis for predicting regulators of gene co-expression.

To access the metadata for each JASPAR motif, we first had to compile a list of all JASPAR motif IDs of interest:

We then used both the JASPAR and uniprot REST APIs to download the Pfam domains associated with each JASPAR motif. We then cross-referenced these domains with the domain composition of our list of predicted transcription factor gene models to generate lists of genes that could plausibly bind to each binding motif.

(01_motifPrep/protDBLink.R)

Single-Cell Motif Enrichment Analysis

Analysis Using Bona Fide Binding Motifs

We performed the main motif enrichment and gene co-expression regulator predictions within a single script (hydraRegulators.R). We have opted to break this script into smaller code blocks in this document to facilitate explanation.

Performing Gene Set Enrichment Analyses

After the initial steps of setting up the environment for the analysis, our first step was to generate gene sets. These gene sets, which group genes by the motifs present in their presumptive regulatory regions, served as the basis for a gene set enrichment analysis (GSEA; described in the next section). Essentially, the GSEA analysis determined if the presence of a particular binding motif in a gene's regulatory sequence was positively correlated with the gene having a higher NMF gene score for a particular metagene (the metagene gene score reflects how well a gene's expression pattern mimics the metagene expression pattern).

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

To perform the motif enrichment analysis we used the fgseaMultilevel function from the fgsea package. This function requires a list of gene sets and a named vector of scores. For our analysis, the gene sets list was a list made up of vectors of gene name, with the name for each vector being the motif ID associated with the gene IDs contained in that vector. The named vector of scores correspond to the gene scores for a specific metagene.

For our analysis, we performed gsea iteratively for each of the 56 metagenes we identified in the Hydra single-cell atlas. We then dropped (i.e., converted to 0) any enrichment score that failed to pass our significance cutoff (adjusted p-value ≤ 0.01). We then combined all the enrichment results into a single data frame.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

Generating Single-Cell Motif Enrichment Scores

To map our enrichment results onto our Seurat object (for visualization and to correlate the results with gene expression data), we used the NMF cell scores. NMF cell scores are weights that specify how strongly each metagene contributes to a cell's overall transcriptional profile. We therefore used these cell scores to generate a weighted average of enrichment scores for each motif at a single-cell level, such that the enrichment score from a high scoring metagene contributed more strongly to a cell's enrichment score than a lowly scoring metagene.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

We then used the enrichment scores we calculated for our Hydra atlas to generate a heat map of single-cell enrichment scores averaged across different cell types. For this plot, and for other downstream analyses, we dropped any motifs that were not found to show signs of conservation in our cross-species whole-genome alignment (specified in the motifConservationStats.csv file generated in 07_genomeConservation.md). This section of code also makes use of the file k56_mg_annot.csv, which was manually generated and includes descriptive names for each metagene. There's also the file mg_order.csv which specifies the order in which to metagenes appear in the heat map.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

motifHeat

Removing Redundant Motifs from the Enrichment Results

This initial heat map had over 300 rows, and wasn't very compact. To reduce the number of rows in the plot, we removed redundant motif entries, which we defined as motifs with highly similar sequence composition and enrichment patterns. We had already grouped motifs according to sequence similarity. In the following section of the analysis, we performed another clustering analysis where we clustered motifs based on correlation scores calculated by comparing motif enrichment results.

We grouped the motifs in a way that integrated both the sequence-based and the enrichment-based clustering analyses, such that motifs were only considered redundant if they had similar sequence composition and enrichment results. To collapse a group of motifs that were flagged as redundant, we presented the averaged enrichment profile of all motifs within that group.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

motDistClustEn

We then generated a new, 'low-redundancy' version of the heat map.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

 

 

 

motHeatLr

Identifying Candidate Regulators of Enriched Motifs

We next linked enriched motifs to putative regulators. To do this we needed to correlate a motif's enrichment pattern in the single-cell atlas to the expression patterns of predicted transcription factors. One potential challenge for performing this type of correlation analysis involving single-cell gene expression data is that the low depth of scRNA-seq makes it susceptible to frequent 'drop-outs', where a gene with moderate expression has zero counts in a non-trivial number of cells. To mitigate the issues this might cause for the correlation analysis between motif enrichment and gene expression, we generated a matrix of imputed read counts based on our NMF analysis.

The goal of NMF is to generate two matrices (the cell score matrix and the gene score matrix) that, when multiplied together, create an approximation of the original data matrix (the single-cell gene expression matrix). We took advantage of this to create an expression matrix that roughly recapitulated the original expression data, but without any drop-outs, smoothing out the data and making it more suitable for correlation analyses. This was the matrix that we used to determine the correlation score between predicted transcription factors and enriched motifs.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

From this motif/transcription factor correlation matrix, we generated a summary table of all putative co-expression regulators. A candidate regulator was defined as a transcription factor that had a correlation score of 0.5 or greater with a motif that it could plausibly bind based on its Pfam domain content. This table included the gene ID of the candidate regulator as well as a list of all the enriched motifs that transcription factor could potentially be regulating. We also added some functional annotation data for the candidate regulator, such as predicted orthologs and Pfam domains.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

To generate the motif enrichment and gene expression plots provided in the paper we used the template below. The motif IDs and gene IDs were swapped to create the other plots presented throughout the text.

(snippet from 02_enrichmentAnalysis/hydraRegulators.R)

ebfM

ebfGexp

Analysis Using Shuffled Binding Motifs

We next wanted to determine the extent to which the enrichment patterns we observed above were due to true biological signal, as opposed to being artifacts driven by chance. To test this, we repeated the single-cell motif enrichment analysis using shuffled instead of bona fide transcription factor binding motifs. We had already generated a list of shuffled motifs for the analysis described in 07_genomeConservation.md. In addition we had subjected predicted instances of these shuffled motifs to the same filtering criteria (i.e., requiring conservation across multiple Hydra genomes and localization to open chromatin) we used for generating binding site predictions for our database of bona fide binding sites.

Because the shuffled motifs should be random, non-functional sequences, any enrichment patterns we identify using the shuffled motifs will likely be artifacts. If we get similar enrichment results using both the shuffled and bona fide versions of a motif, we can determine that our enrichment results are likely not biologically meaningful; however, if the bona fide motif has a dramatically different enrichment pattern than the shuffled motif, then we can conclude that the enrichment results are not purely the result of sequence bias.

Performing Gene Set Enrichment Analyses

To repeat our enrichment analysis using shuffled motifs, we used the conShufMotsATAC_finalhits.txt table to generate gene sets. We then performed gsea for each metagene in the Hydra single cell atlas using these shuffled motif gene sets:

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

Generating Single-Cell Motif Enrichment Scores

We then used NMF cell scores to translate the metagene enrichment results into single-cell enrichment scores for each motif:

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

We then visualized these motif enrichment scores using a motif by cell type heat map:

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

motifHeatmap_shuf

This heat map revealed a large number of enriched motifs when using the shuffled motif set, although markedly fewer metagenes had enriched motifs when compared to the bona fide motif set.

Comparing Enrichment Results for Genuine and Shuffled Motifs

To determine if the enrichment patterns we observed were similar for both the shuffled and bona fide versions of each motif (indicating that the enrichment patterns were driven by sequence bias), we calculated correlation scores for the enrichment results for the bona fide and shuffled motif results. We found that the enrichment results were highly different for the vast majority of motifs, indicating that the enrichment patterns we observed were not primarily being driven by sequence bias artifacts.

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

enCorScor

One of the criteria we used for identifying transcriptional regulators was that their transcriptional expression pattern should overlap with the enrichment pattern of their target motif. We therefore sought to determine if the enrichment patterns we calculated using bona fide motifs tended to correlate more closely with the expression patterns of their candidate regulators than the enrichment patterns calculated using shuffled motifs.

To do this we calculated a correlation score for each possible gene/motif pair (determined above based on pfam domains) by comparing the NMF-imputed gene expression values for the gene and the corresponding motif enrichment scores. We calculated this correlation score for both the bona fide and shuffled versions of the motif. We then compared the two correlation scores (one for bona fide and one for shuffled) using a scatter plot, including only on those gene/motif pairs that had a high correlation score (above 0.4) for either motif version.

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

motEnCorComp

(Note: the x-axis corresponds to scores calculated using shuffled motif enrichment patterns and the y-axis corresponds to scores calculated using bona fide motif enrichment patterns)

The plot above shows that for any given gene/motif pair, the degree of correspondence between enrichment and expression for the shuffled motif has virtually no correlation to the degree of correspondence for the bona fide motif. This can also be demonstrated more concisely by just calculating the correlation between the two scores:

In addition, the plot reveals a trend where unshuffled motifs tend to have higher correlation scores than the shuffled motifs, as there are more points with high values on the y-axis than the x-axis. This can also be demonstrated using a box plot showing the distribution of correlation scores for bona fide and shuffled motifs:

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

geneMotCorComp

Although most gene/motif pairs showed no correlation (even for bona fide motif sequences) the overall correspondence between motif enrichment and gene expression was significantly higher for bona fide motifs than for shuffled motifs.

(snippet from 02_enrichmentAnalysis/hydraRegulatorsShuf.R)

Overall, these results strongly suggest that the enrichment patterns we observed using our bona fide motif set are not purely driven by sequence bias artifacts. In addition, the enrichment patterns of unshuffled motifs more closely matched the expression patterns of their putative regulators than would be expected by chance.

 

Files Associated with This Document