Characterizing Cell-Type-Specific Variations in the Ages of Transcribed Genes

This document outlines our approach for characterizing the ages of protein coding genes in the AEP genome, with the goal of exploring the ways in which the unique transcriptional signatures of different cell types reflect evolutionary history. This analysis included estimating the age of all AEP genes using the output from an Orthofinder analysis (described in 03_aepGenomeAnnotation.md), characterizing the differences in the relative proportions of different gene ages in different cell types in our single cell atlas, and calculating a holistic score of transcriptome age at a single-cell resolution.

Estimating Gene Age

Our first step was to estimate when each gene in the Hydra genome originated. To do this, we needed to identify the most recent phylogenetic clade that contained all orthologs of each Hydra gene. The age of a gene in question was assumed to be the amount of time that has passed since the most recent common ancestor of that clade existed.

We had previously performed an Orthofinder analysis that included 44 different proteomes spanning diverse metazoan clades. This analysis grouped all the protein sequences in the analysis into orthogroups, which are sets of protein sequences that are all predicted to have originated from a single ancestral gene.

Our goal was to use the orthogroup assignments to identify all species whose proteomes included at least one ortholog of each Hydra gene. We could then identify the smallest clade (i.e. the least speciose node in the Orthofinder species tree) that encompassed all orthologs. The node name corresponding to that clade (formatted N#, with smaller numbers indicating more basal nodes) was then used to represent gene age.

To determine which nodes contained which orthologs, we needed to generate a list of species contained within each node. This is not information provided explicitly in the Orthofinder output, so we needed to extract this information ourselves. Orthofinder generates tables of 'Phylogenetic Hierarchical Orthogroups', which convey orthology relationships relative to a specific node. These tables implicitly specify the species contained within the node they are describing, so we used them to generate node species lists for all nodes that contained Hydra (11 nodes in total).

(snippet from 06_geneAge/geneAge.R)

In the above code, we made use of an accessory script to fix a formatting issue in some of the orthofinder output that caused parsing issues:

(06_geneAge/fixOTab.sh)

Once we had defined all the species included at each level of the phylogenetic hierarchy in our orthofinder analysis, we could then isolate all orthogroups that included a gene from the AEP assembly and identify the youngest node that contained all of that orthogroup's member genes.

We did this by iteratively subsetting genes in an orthogroup by species, starting with species belonging to the most recent node and then moving outward. Each time we increased the number of species, we checked to see if the number of of total orthologs also increased. The node of origin was identified as the last point in this step-wise outward expansion that added new orthologs. This node of origin was then assigned to all AEP genes belonging to the orthogroup in question.

(snippet from 06_geneAge/geneAge.R)

Visualizing Variability in Gene Age Distribution in Different Cell Types

Once we had determined an age (i.e., node of origin) for all the AEP genes that were assigned an orthogroup in our Orthofinder analysis, we investigated the relationship between gene age and cell-type-specific transcription. Specifically, we were interested in exploring if genes of a particular age were overrepresented in certain cell types, possibly indicating a period of innovation during which new genes specific to that cell type arose. To do this, we characterized the distribution of gene ages in the transcriptomes of each cell type in our single-cell atlas.

First, we identified the genes that were expressed in each cell type by calculating the average transcriptional profile for a cell type and selecting genes that were above an arbitrary, low threshold. We then subset these cell type gene lists to only include genes that were assigned an age through our Orthofinder analysis. We also excluded genes that were ubiquitously expressed, as these would not provide any insight into the transcriptional signatures that make cell types unique. We did this by using the FindVariableFeatures function from Seurat, which is designed to find the top N most variable genes in a single-cell data set. We used the function to identify a relatively large number of variable genes (N=7500) in order to retain both highly and moderately variable genes in our downstream analysis.

After identifying the genes expressed in each cell type, we calculated a frequency table of gene ages to determine the odds that a gene expressed in a certain cell type will be of a certain age. We then visualized the differences in these odds across different cell types using a heat map.

When generating the heat map, we noted that if we didn't normalize the odds for a given gene age across all the different cell types, all cell types appeared to have nearly identical distributions that heavily favored ancient genes that predate Metazoa. This likely reflects the essential and deeply conserved functions of ancient genes. It was only after we normalized the data to account for this general trend that cell-type specific patterns emerged.

(snippet from 06_geneAge/geneAge.R)

geneAgeTimelineMatrixNonNorm

(snippet from 06_geneAge/geneAge.R)

geneAgeTimelineMatrix

Calculating a Holistic Score for a Cell Type's Transcriptome

As an alternative way of looking at cell-type-specific differences in gene age distribution, we calculated single-cell transcriptomic age index (TAI) scores. The TAI is a weighted average of gene age for a transcripome that adjusts the contribution of a gene's age to the final average based on how highly that gene is expressed (i.e., genes with higher expression are prioritized). Ultimately, the TAI metric is intended to be a holistic measure of the age of a transcriptome, with lower TAI values indicating that the transcripts in a sample skewed towards being more ancient and higher TAI values indicating the transcripts skewed younger.

In the analysis from the previous section, we described gene age using a node label (N#). Calculating TAI scores required that we convert these to numeric values, which we did by simply numbering the nodes 1 through 11, with 1 being the most basal node and 11 being the youngest. We then implemented the TAI formula for each single cell transcriptome. The formula, as described here, is:

TAIS=i=1ni=1npsieii=1nei

"...where ps i is an integer that represents the phylostratum of the gene i (for example, 1, the oldest; 14, the youngest), ei is the [level of gene expression] of the gene i that acts as weight factor and n is the total number of genes analysed"

We then visualized our single-cell TAI scores using the atlas UMAP.

(snippet from 06_geneAge/geneAge.R)

tai

Finally, we grouped cells by cell type and visualized the distribution of TAI scores across our the different clusters in the atlas.

(snippet from 06_geneAge/geneAge.R)

 

taiBox

Files Associated with This Document