Aligning and Processing Clytia Single Cell Data

This document covers our re-mapping of the Clytia hemisphaerica medusa single-cell atlas. This entailed generating a new set of gene models for the most recent Clytia genome assembly, mapping the raw reads to the new gene models, generating new cell-type clusters and UMAP plot, and visualizing the expression patterns of Clytia genes that were lost in the Hydra lineage.

Generating a New Set of Clytia Gene Models

The original Clytia single cell atlas publication created a new set of gene models for mapping their scRNA-seq data by generating a new de novo transcriptome using Trinity, merging those transcripts with transcripts from the original Clytia genome annotation, and then mapping the merged transcripts to the v1 Clytia genome with GMAP. However, we used an updated, more contiguous version of the Clytia genome for our whole-genome alignments (described in 07_genomeConservation.md). Because we wanted to use the same Clytia reference genome throughout the study, this motivated us to use gene models for this updated assembly. Gene models are available for the updated genome, but they have relatively low BUSCO scores (81.1%). We wanted to combine multiple sources of gene predictions to try and maximize gene model completeness.

To do this, we started with the Trinity transcriptome from the Clytia single-cell paper (called `transcripts_TRINITY_20201021.fa, downloaded here) and the v1 transcript models from the initial Clytia genome annotation (called transcripts.fa, downloaded here). We combined these two reference files:

We then aligned the resulting fasta file to the updated Clytia genome (file renamed to clytiaG.fa, downloaded from here) using PASA. This required that we first prep the sequence file using the PASA seqclean command:

(01_newGenes/runCleanup.sh)

We then ran the standard PASA annotation pipeline:

(01_newGenes/runPipeline.sh)

The PASA pipeline mainly just aligned the transcripts we provided to the genome and tried to resolve those aligned transcripts into model 'assemblies' of transcribed regions across the genome. To get candidate ORFs from these assemblies we ran the following command:

(01_newGenes/runCDS.sh)

The output from this step was a bit messy. The problem appeared to arise from the fact that in some cases PASA predicted multiple overlapping ORFs from a single assembly, and instead of treating them like different isoforms of the same gene, labeled them as separate genes in the gff. These resulted in cases where two genes would have the same root name, but with different suffixes (e.g., asmbl_100.p1 and asmbl_100.p2). For the sake of simplicity, we just wanted to collapse these alternate ORFs into a single gene with a single unique identifier.

We attempted to clean up this problem using AGAT:

However, we found that this approach didn't work very well and still left us with multiple redundant, overlapping genes. We therefore created a custom R script to address the issue:

(01_newGenes/pasaGffFix.R)

Finally, we also brought in the publicly available gene models for the updated assembly (called Clytia_hemisphaerica_gca902728285.GCA902728285v1.51.gff3, available here) to try and maximize the completeness of our hybrid gene models.

We then used a custom R script to drop non-coding features from the merged gtf file (mainly from the Ensembl GFF) and simplify the gene names:

(01_newGenes/clGffFix.R)

We then exported the transcript and protein sequences from these gene models:

The resulting BUSCO stats of these gene models were pretty good, about 6% more complete BUSCOs than the currently available gene models for the updated genome:

Mapping scRNA-seq Data to the Updated Clytia Genome

As we did for our H. vulgaris scRNA-seq analysis, we opted to use transcripome sequence for mapping the scRNA-seq data. This required that we prep a modified gtf file for the mapping pipeline (this file treat each transcript in the transcriptome as a contig that has a single gene on it that spans the entire sequence). We did this with the following R script:

(02_remapping/makeTranGtf.R)

For mapping, we used the cell ranger pipeline (v6.0.2). First, we prepped the reference metadata prior to running the actual mapping pipeline:

(02_remapping/runMakeRef.sh)

The Clytia data was sequenced over two lanes, and each lane needed to be initially mapped separately. Lane 1 was comprised of samples FT-SA16888 (read1 and read2), FT-SA16889 (read1 and read2), FT-SA16890 (read1 and read2), and FT-SA16891 (read1 and read2) and lane 2 was comprised of samples FT-SA16892 (read1 and read2), FT-SA16893 (read1 and read2), FT-SA16894 (read1 and read2), and FT-SA16895 (read1 and read2).

We used this script to map lane 1 reads:

(02_remapping/runCount1.sh)

And this script to map lane 2 reads:

(02_remapping/runCount2.sh)

To pool the samples after mapping, we prepared the file aggr.csv, which included the paths to the read counts produced by the initial mapping steps:

(02_remapping/aggr.csv)

We then aggregated the single cell expression matrices

(02_remapping/runAggr.sh)

This produced the file raw_feature_bc_matrix.h5 containing the gene-by-cell read count matrix that was used for subsequent analyses.

Initial Clustering Analysis

In the original Clytia scRNA-seq publication, the authors conducted an extensive analysis to identify and exclude low quality cells. We opted to reuse the results from that analysis by simply retaining the cells that were present in the processed dataset from the original publication (available here) in our re-mapped data. After dropping the previously identified problematic cells, we initialized a Seurat object and performed some additional filtering:

(snippet from 03_clustering/initClRemapSeurat.R)

clytiaVln

Based on this distribution we dropped cells with more than 4000 UMIs, 500 or fewer reads, or greater than 100000 reads.

(snippet from 03_clustering/initClRemapSeurat.R)

We then ran the standard normalization, clustering, and plotting steps

(snippet from 03_clustering/initClRemapSeurat.R)

clytiaElbow

We used 45 PCs for clustering and plotting:

(snippet from 03_clustering/initClRemapSeurat.R)

clytiaRemapUMAP

We wanted to validate that our re-mapping and re-clustering recapitulated the results from the original atlas publication, so we imported the fully processed and annotated single cell data from the original publication (available here) and converted it to a Seurat object (easier to work with in R, the initial analysis was done in Python).

(snippet from 03_clustering/initClRemapSeurat.R)

clOrigUmap

We propagated the cell cluster labels from the original publication clusters to our newly clustered data

(snippet from 03_clustering/initClRemapSeurat.R)

clRemapUmapOrigAnnot

The previous cell type labels largely recapitulated our own clustering results, indicating that our analysis effectively recapitulated the results from the original publication, validating our approach.

Visualizing Expression of Genes Lost in Hydra

The Hydra genus is notable for it's simplified life cycle, lacking the planula and medusa stages found in other hydrozoans such as Clytia. This has been correlated with substantial gene loss in the Hydra lineage, but the function of these lost genes is not well understood. We combined our Orthofinder analysis (described in 03_aepGenomeAnnotation.md) with the Clytia medusa single-cell atlas to determine where genes lost in Hydra are expressed.

We first identified genes that were lost in the Hydra genus. We identified such genes using two criteria: 1) the genes needed to be absent from all Hydra proteomes in our OrthoFinder analysis (H. viridissima, H. circumcincta, H. oligactis, H. vulgaris strain 105, and H. vulgaris strain AEP), and 2) the genes needed to be present in both the Hydractinia echinata and Clytia hemispherica proteomes.

(snippet from 04_geneLoss/geneLoss.R)

After we generated our list of lost genes, we wanted to determine where they were expressed in Clytia medusae. Because we used our new Clytia gene models for the Orthofinder analysis, we needed to use our re-mapped version of the Clytia atlas, but we also wanted to make use of the original atlas publication's UMAP and cell type annotations. To do this, we incorporated the UMAP and cluster annotations from the original publication into our re-mapped Seurat object. To have a more resolved view of the neuronal subpopulation, we also incorporated the cluster annotations from the neuronal sub-clustering analysis in the original publication (downloaded here).

(snippet from 04_geneLoss/geneLoss.R)

fullLabClUMAP

We then passed the list of genes lost in Hydra to the Seurat AddModuleScore function to calculate a holistic score representing how highly these lost genes are expressed in each Clytia cell transcritome.

(snippet from 04_geneLoss/geneLoss.R)

clLostGeneScores

To more clearly visualize how the lost gene module score varied across different cell types, we also generated a box plot that grouped module scores by cell type.

(snippet from 04_geneLoss/geneLoss.R)

lostScores

 

Files Associated with This Document