Aligning, Processing, and Analyzing Hydra Drop-seq Data

This document covers the re-mapping and re-analysis of the Hydra single cell atlas using the strain AEP H. vulgaris genome as a reference. This document covers initial read mapping, cell filtering and QC, clustering, plotting, co-expression analysis, doublet removal, and final annotation of the Hydra drop-seq dataset originally published in Siebert et al. (2019) (raw data available here).

Mapping Drop-seq Data to the AEP Assembly Transcriptome

To map our raw Drop-seq reads to the AEP assembly, we made use of tools provided in the Drop-seq toolkit provided by the Broad Institute (v2.4).

First we prepped the reference files for mapping. We used the transcript sequences for the longest isoforms for each gene (HVAEP1.tran.longestIso.fa). We opted to use transcriptomic sequence as opposed to genomic sequence because it eliminated the possibility of mapping to off-target intronic and intergenic regions. We also supplemented these sequences with mitochondrial genes, whose expression levels can provide a useful readout of stress in a cell. The final file was called HVAEP1.tran.final.mito.fasta

The Drop-seq Tools pipeline requires a gtf file for the reference sequence, even if it's not genomic sequence. This required the generation of a custom gtf file. We generated a gtf file from our gff3 reference file containing only the longest isoform of each gene (generated in the AEP Genome Annotation step) using AGAT:

We then modified the resulting gtf using the following R script:

(*01_mapping/makeTranGtf.R)

We then appended mitochondrial genes to both the transcriptome fasta file (creating HVAEP1.tran.final.mito.fasta) and the transcriptome gtf file (creating HVAEP1.transcriptome.mito.gtf)

We used the create_Drop-seq_reference_metadata.sh script from Drop-seq Tools to prep the reference files for mapping. We had to make one minor change because of an issue with the STAR indexing step. We named the modified script create_Drop-seq_reference_metadata_mod.sh. We made the following change (output from diff create_Drop-seq_reference_metadata.sh create_Drop-seq_reference_metadata_mod.sh):

We ran the modified pipeline using the following script:

(01_mapping/slurm_metadata_mod.sh)

We next prepared the fastq files for mapping. The Drop-seq Tools alignment pipeline requires bam files instead of fastq files, so we used the following script to call the Picard FastqToSam function to pool read 1 and read 2 from each sample into a single bam file.

(01_mapping/dropseq_FastqtoSam.sh)

We then used the Drop-seq_alignment.sh script included with Drop-seq Tools to map the bam files for each of our drop-seq libraries using the following script:

(01_mapping/slurm_Drop-seq_alignment.sh)

This created subdirectories in the out directory for each library that was mapped containing among other things the final bamfile of mapped reads (final.bam).

Filtering Cell Barcodes by Read Depth

Only a small minority of beads in a Drop-seq run will end up in a droplet that also contains a cell. This means that a huge portion of cell barcodes in a Drop-seq library will be from beads that were only exposed to 'ambient' contaminating RNA and will contain no useful information. To filter out these 'junk' cell barcodes, we needed to generate tables for each library that ranked cell barcodes by the number of reads they recieved, from most reads to least reads. To do this, we used the BamTagHistogram command from Drop-seq Tools:

(01_mapping/getCellCounts.sh)

Plotting the cumulative sum of reads from these tables reveals a curve with a distinct "elbow", an inflection point where the curve rapidly plateaus. This elbow is the transition from 'real' cell barcodes to 'junk' barcodes. To identify the inflection point in a reproducible and relatively unbiased way, we used a geometric approach. We plotted the read count tables to create an elbow plot:

justElbowPlot

We then drew a line from the origin to the point at which the curve reached 85% of total reads on the y-axis:

ElbowWithEightyLine

We then found the point on the elbow plot that was farthest from this diagonal. The X-coordinate of this point determined the total number of 'real' cell barcodes in the library:

fullElbow

This was done using the following R script:

(01_mapping/makeElbowPlots.R)

We then passed the estimated cell number determined above to the DigitalExpression function from Drop-seq Tools. This function outputs a digital gene expression (DGE) matrix that contains read counts for the specified number of cells. The specific cell barcodes exported into this matrix are determined by read depth, with the highest depth cells being used first, and lowest depth cells being used last.

(01_mapping/makeDGE.sh)

The resulting DGE files are formatted as gene-by-cell matrices. These were used as the initial input into the Seurat Analysis below.

Initial Clustering

We imported the drop-seq data formatted as DGE matrices into Seurat (v4.1.0), did additional filtering to remove low quality cells, removed batch effects in the different libraries, performed Louvain clustering, and generated an initial UMAP plot. All of this (as well as contents of the next section) was done within one large R script (initGenomeDsSeurat.R). For the purposes of this document, we have broken this script into chunks to simplify the explanation process.

After some initial setup for the R session, our first step was to import the DGE matrices from each Drop-seq library as individual Seurat objects. During this step, we also performed some additional filtering to remove low quality cells by removing barcodes with fewer than 300 or greater than 7500 unique molecular identifiers (UMIs), or fewer than 500 or greater than 75000 reads.

(snippet from 02_initClust/initGenomeDsSeurat.R)

We then integrated these separate Seurat objects using a reciprocal PCA analysis. This removed most batch effects that could cause cells to cluster because of technical reasons (e.g., differential expression of stress genes).

(snippet from 02_initClust/initGenomeDsSeurat.R)

We next calculated the first 80 principal components of the integrated dataset and plotted the variance explaned by each component. This allowed us to estimate the approximate dimensionality of the dataset.

(snippet from 02_initClust/initGenomeDsSeurat.R)

 

elbowWithDubs

We opted to use a relatively high number of principle components for our initial clustering (60). We used these first 60 PCs to generate a UMAP plot and to find clusters using the Louvain algorithm

(snippet from 02_initClust/initGenomeDsSeurat.R)

umapWithDub

To get a sense of the cellular identity of these clusters, we used a panel of cell type markers (listed in markerPanel.csv) that were validated when the Hydra atlas was first published.

(snippet from 02_initClust/initGenomeDsSeurat.R)

markerPanelWithDubs

Based on the expression patterns of these markers, we then annotated this UMAP by cell type:

labeledDubUmap

 

Identifying Gene Co-Expression Programs Using Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) is a dimensionality reduction technique that breaks large and complex datasets into a relatively small number of 'parts' that can be combined in an additive fashion to represent any of the individual samples within the original dataset. In the context of gene expression data, NMF describes the transcriptomes of individual samples as a mixture of metagenes (i.e., a group of genes with correlated expression) that, when combined in a particular way, can be used to construct a given sample's specific transcriptional profile.

To generate this parts-based representation, NMF assumes that a data matrix (in this case a digital gene expression matrix) can be represented as the product of two matrices, called W and H, that are constrained to only contain positive values. For single-cell RNA expression data, W encodes how strongly each gene contributes to each metagene (i.e., gene weights), and H encodes how much each metagene contributes to each cell transcriptome (i.e., cell weights).

If the original dataset is a matrix of n rows (genes) and m columns (cells), W will be a n by k matrix and H will be a k by m matrix. k is the rank value, which determines how many parts the original dataset will be broken into. There is no objectively optimal value of k, so k needs to be estimated empirically for every dataset by evaluating results across a range of values.

All that is required as input into the NMF algorithm is a gene expression matrix and a value for k. NMF then initializes W and H matrices and iteratively adjusts the values within each matrix until the product of W x H produces as close an approximation of the original expression matrix as possible.

To generate the initial input into the cNMF algorithm, we exported the raw and normalized gene expression matrices from the Hydra atlas Seurat object. We also exported the list of variable genes identified by Seurat. The variable gene list is needed because cNMF restricts the actual NMF analysis to only variable genes to reduce computation time, and then extrapolates the results to the remaining genes.

(snippet from 02_initClust/initGenomeDsSeurat.R)

Performing a Course Sweep of K Values

As k needs to be determined empirically for each dataset, we performed a broad sweep of k values from 15-90 in steps of five so we could identify general k values that gave good results. We used the prepare function within the cnmf.py script to set up this initial run:

(03_cnmf/runPrep.sh)

We then ran the analysis using the following script:

(03_cnmf/runFactorize.sh)

The results of NMF are sensitive to how the W and H matrices are initialized. To ensure reproducibility, cNMF performs the NMF analysis for each k value multiple times (in the case of this analysis, we set this to 200 runs). This means that the files from each of these independent runs need to be combined for further analysis, which is done with the combine function:

(03_cnmf/runCombine.sh)

The pooled results can then be evaluated for how well and how reliably each k value approximated the original expression data. Plots of such metrics are generated with the k_selection_plot function:

(03_cnmf/runKselect.sh)

courseNMFk

The blue line represents the variability in results, with higher numbers indicating more consistent results from run to run for a given k value. The red line indicates how well the NMF results recapitulated the original expression matrix. Generally, lower error values and higher stability values are desirable.

Often with NMF there are multiple local maxima for the stability metric, essentially representing different possible resolutions for looking at the data. In the case of this analysis, we saw that there were two stability maxima, one for a k value of 25 and another for a k value of 55. We opted to use the higher resolution analysis for downstream analysis.

Performing a Fine Sweep of k Values

Because our initial k sweep moved in steps of five, we needed to perform a second higher resolution sweep to identify the exact k value that gave optimal results. We therefore repeated the cNMF analysis using k values from 50-60.

Prepping the analysis:

(03_cnmf/runPrep2.sh)

Running the NMF analysis:

(03_cnmf/runFactorize2.sh)

Combining separate run files:

(03_cnmf/runCombine2.sh)

Generating the error and stability plot:

(03_cnmf/runKselect2.sh)

fineNMFk

Based on this plot, we selected a k value of 56.

The final step of the cNMF analysis is to combine the results from all 200 runs for a desired k value to generate consensus results matrices. This step involves filtering out individual outlier results that don't closely resemble the overall consensus. To select the cutoff for removing outliers, we inspected the distribution of a score that measured the distance of the results from one run to the other runs that were most similar to it. We generated a plot to examine this distribution using the consensus function, selecting an initial threshold of 2.00 (an arbitrarily high value that wouldn't exclude anything):

(03_cnmf/runConseunsus2.sh)

initNMFconsensus

Based on the distribution of distances, we selected a cutoff of 0.13, which captured the bulk of results while excluding the long tail of outliers:

(03_cnmf/runConseunsus2.sh)

nmfConsensus

This script generated a number of results files. These included the W matrix whole_unfilt_fine_narrow.gene_spectra_tpm.k_56.dt_0_13.txt and the H matrix whole_unfilt_fine_narrow.usages.k_56.dt_0_13.consensus.txt. The other notable output is the file whole_unfilt_fine_narrow.gene_spectra_score.k_56.dt_0_13.txt, which assigns genes Z scores based on how strongly they're associated with each metagene.

Identifying Doublets

During our initial annotation of the re-mapped Drop-seq data, we identified several clusters that expressed transcripts known to have mutually exclusive expression patterns. For example, the EN_NC_Dubs cluster expresses both nas14, which is expressed in endodermal tentacles cells, as well as nematocilin a, which is expressed in nematocytes. This dual expression is a hallmark of doublets, a phenomenon that arises when transcripts from two cells end up associated with a single bead. This can happen either for biological or technical reasons. Typically, doublets are removed during standard scRNA-seq data processing; however, in Hydra, the inclusion of doublets can in some ways provide greater accuracy. This is because certain cell types, specifically tentacle battery cells, are actually multi-cell complexes, where multiple interstitial cells (both neurons and nematocytes) are embedded within an epithelial cell. For this reason, we provide the Seurat object used to make the above plots for those researchers particularly interested in these naturally occuring multi-cell complexes (found in the file 02_initClust/labeledDubDs.rds).

However, for the sake of simplicity and clarity, we opted to remove doublets in the main published version of our AEP-mapped atlas. To identify doublets that needed to be removed, we scored cells based on how highly they expressed different sets of cell-type-specific markers. We then defined a cell as a doublet if they scored highly for two or more of these cell-type-specific gene modules.

To do this, we first use the Seurat FindMarkers function to find markers for ectodermal, endodermal, nematocyte, neuronal, germline, and gland cells.

(snippet from 02_initClust/initGenomeDsSeurat.R)

We then used the Seurat AddModuleScore function to calculate a holistic score for each of set of markers.

(snippet from 02_initClust/initGenomeDsSeurat.R)

dubScoresUmap

We were primarily concerned with epithelial/interstitial doublets, as interstitial/interstitial doublets are comparatively rare. We therefore defined a doublet as any cell with a module score > 0.2 for both an epithelial module and any other gene module.

(snippet from 02_initClust/initGenomeDsSeurat.R)

dubsTest

We exported a list of all non-doublet cell IDs for the next data processing step.

(snippet from 02_initClust/initGenomeDsSeurat.R)

Reclustering the Doublet-Free Dataset

Removing Some Lingering Problematic Clusters

After removing (most) doublets, we repeated the clustering process from the beginning, using essentially the exact same steps as before; however, we included a step where we removed any previously identified doublet cells from the imported DGE matrices before building the initial Seurat objects.

(snippet from 04_finalize/nonDub.R)

almostNoDubUMAP

We then plotted our standard panel of markers

(snippet from 04_finalize/nonDub.R)

almostNonDubMarks

We noticed one small cluster of endodermal cells (cluster 41) branching off the main body column/SC cluster that appeared to have high expression of a number of interstitial genes (e.g., periculin and dkk1/2/4a), suggesting it might contain some residual interstitial doublets. To explore this further, we used the FindMarkers function to find genes that distinguish clutser 41 from other body column stem cells.

(snippet from 04_finalize/nonDub.R)

c41Marks

The top nine markers we found were all primarily expressed in interstitial cells, strongly suggesting this cluster is made up of endoderm/interstitial doublets.

The identity of cluster 37 was less clear, but when we identified markers for the cluster (using the same approach as above), caspase-like and heat shock-like genes were among the output. We therefore hypothesized that this cluster reflected stress-related batch effects.

Based on this exploration, we decided to drop clusters 37 and 41 from the final object

(snippet from 04_finalize/nonDub.R)

Final Clustering and Annotation

After arriving at the final set of cells, we performed one last clustering and UMAP analysis

(snippet from 04_finalize/nonDub.R)

unlabeledNonDubUMAP

We again plotted our panel of marker genes

(snippet from 04_finalize/nonDub.R)

nonDubMarks

Based on these plots, we annotated the clusters by cell type

(snippet from 04_finalize/nonDub.R)

nonDubLabeledUMAP

We generated marker lists for these clusters using the following command:

(snippet from 04_finalize/nonDub.R)

Plotting Metagene Usage in the Hydra Atlas

In order to understand the biological function associated with the metagenes from our NMF analysis, we plotted cell score values (from the H matrix) on our finalized Hydra atlas UMAP.

(03_cnmf/visNMF.R)

nmfUMAPs

Files Associated with This Document