A unified analysis of atlas single-cell data

Hao Chen; Nam D. Nguyen; Matthew Ruffalo; Ziv Bar-Joseph

doi:10.1101/gr.279631.124

A unified analysis of atlas single-cell data

¹Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
²Department of Computer Science, University of Illinois Chicago, Chicago, Illinois 60607, USA;
³Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA

Corresponding author: zivbj{at}cs.cmu.edu

Next Section

Abstract

Recent efforts to generate atlas-scale single-cell data provide opportunities for joint analysis across tissues and modalities. Existing methods use cells as the reference unit, hindering downstream gene-based analysis and removing genuine biological variation. Here we present GIANT, an integration method designed for atlas-scale gene analysis across cell types and tissues. GIANT converts data sets into gene graphs and recursively embeds genes without additional alignment. Applying GIANT to two recent atlas data sets yields unified gene-embedding spaces across human tissues and data modalities. Further evaluations demonstrate GIANT's usefulness in discovering diverse gene functions and underlying gene regulation in cells from different tissues.

Several recent efforts have attempted to generate atlas-scale single-cell data. Examples include The Human BioMolecular Atlas Program (HuBMAP) (HuBMAP Consortium 2019; Jain et al. 2023), the Human Cell Atlas (HCA) (Rozenblatt-Rosen et al. 2017), the Cellular Senescence Network (SenNet) (Lee et al. 2022), the Human Lung Cell Atlas (HLCA) (Sikkema et al. 2023), and many more. In all of these efforts, researchers aim to collect and analyze data from multiple tissues, modalities, conditions, and more. Although the initial phases of the programs often focused on the coverage of different tissues and on the separate, modality-specific analysis of each data set, we are now at a point at which, for the first time, we can perform joint cross-tissue and cross-modality analysis, on single-cell data at the atlas scale.

The joint analysis is challenging for many reasons. First, one needs to decide on a common reference unit for the analysis. Although most of the prior methods for atlas analysis focused on cells (Butler et al. 2018; Stuart et al. 2019; Hao et al. 2021; Domínguez Conde et al. 2022), such analyses often fail to combine data across modalities. In many cases, the large technical variation between data modalities can lead to inappropriate matching cells being selected between data sets, which results in ineffective integration (Haghverdi et al. 2018). In addition, cell-based analysis makes it more difficult to identify pathways and genes that are active across multiple cell types, in part because it can severely distort or lose the original gene space (Argelaguet et al. 2021). Another issue is the type of information the method can handle. Cell embedding and clustering based on gene expression similarity, which has been commonly used (Kiselev et al. 2017; Hasanaj et al. 2022; Quah and Hemberg 2022), may not be appropriate for spatial technologies, which provide information on cell relationships in space (Yuan and Bar-Joseph 2020; Li et al. 2021; Teng et al. 2022). Finally, different signal-to-noise levels across data sets (Argelaguet et al. 2021) and the large volume of data in atlas studies create considerable challenges for joint analysis.

To date, there have been some attempts at a large-scale joint analysis of data from different modalities (not just in biology). The main approach used for such integration utilizes alignment-based multiview learning (Welch et al. 2017; Butler et al. 2018; Hie et al. 2019; Nguyen and Wang 2020). The goal is to transform multiview (multi–data set) inputs, represented as a multimodal distribution, into a homogeneous unimodal distribution for downstream learning tasks. The key idea of these methods is to project the manifolds of multiple data sets into a common space such that both the locality of each data set is preserved and the distance between the same entities in different data sets (e.g., the same individuals in different social networks) is minimized. Although this may work for some biological data sets, we often observe overalignment problems with this approach (Argelaguet et al. 2021). Some other methods require paired data (multimodal measurements derived from the same set of samples) to learn the mapping between different modalities (Argelaguet et al. 2020; Ashuach et al. 2023), which are usually not available in real cases.

To improve on these issues and to enable large-scale joint analyses of atlas-level data, we developed a new method, gene-based data integration and analysis technique (GIANT), that focuses on genes rather than on cells. Genes provide a level of granularity, which, on the one hand, can serve as anchor points common to all modalities but, on the other hand, make it easy to perform higher-level analysis. We first convert cell clusters from each data set and each modality into gene graphs based on the specific characteristics of that modality. Genes are more basic building blocks for joint analysis than cells or pathways, and as we show, such graphs can be extracted from different high-throughput methods that are being profiled in large atlas studies. To integrate these graphs across modalities and tissues, we project genes from all the graphs into a latent space (embedding) using recursive projections. In recursive projection (Gopal and Yang 2013; Zitnik and Leskovec 2017), a dendrogram is used to enforce similarity constraints across graphs while still allowing genes with multiple functions to be projected to different locations in the embedding space. In other words, the recursive method can be seen as a hierarchical alignment along embedding, in which graphs nearby in the tree structure are embedded closely in the latent space.

We applied the method to the HuBMAP data from 10 different tissues, three different major modalities, and nine different single-cell platforms. We also validated it on another human development cell atlas data set (Cao et al. 2020; Domcke et al. 2020). As we show, for both data sets, despite the considerable heterogeneity, our joint processing and analysis method was able to successfully group genes based on their functions and pathways, to associate specific pathways with tissues, to discover new functions for a large number of genes, and to identify tissue-specific regulations of genes based on their cross-modality clustering.

Previous Section Next Section

Results

We developed GIANT to construct a unified embedding for genes from 10 different tissues, including the heart, kidney, large intestine, liver, lung, lymph node, pancreas, small intestine, spleen, and thymus, and three different data modalities, which include single-cell RNA-seq (scRNA-seq), single-cell ATAC-seq (scATAC-seq), and spatial transcriptomics (Slide-seq). Figure 1, A through C, presents an overview of GIANT. We first construct gene graphs for cell clusters from each tissue and data modality based on expression or epigenetic correlations (Methods). Next, all gene graphs are connected in a cell cluster dendrogram. A recursive graph-embedding method that takes into account both the dendrogram and gene identity connections is applied. The final result is an embedding (a location in a latent space) for each gene in each of the graphs. For details, see Methods.

View larger version:

Download as PowerPoint Slide

Figure 1.

Schematic illustration of the method. (A) Cell clusters are identified from each tissue of each data modality. For each scRNA-seq cell cluster, a gene coexpression matrix is constructed for all the gene pairs in the cell cluster. For each Slide-seq cell cluster, the same gene coexpression matrix is also constructed, and another spatial coexpression matrix is built for gene pairs between the cells of this cell cluster and their neighboring cells. For each scATAC-seq cell cluster, identified peaks from the bin matrix are associated with TFs and genes, which forms a gene–TF matrix. (B) We obtain gene coexpression or coregulation graphs for each of these modalities based on the specific matrices computed in A. (C) A dendrogram for graphs in B is constructed by hierarchical clustering of cell clusters, and a multilayer graph learning algorithm is applied using the tree structure. Each gene in each graph is then encoded into an embedding space, where genes with similar functions are embedded together.

GIANT integrates multimodality and multitissue data

The gene-embedding space produced by GIANT enables joint functional analyses across data modalities, tissues, and cell clusters. Therefore, the location of a gene in the space is expected to reflect the function the gene performs (which could be the same across several different cell types and tissues) instead of the specific data source that generates the value for that gene. For example, genes from different tissues can be closely embedded in space as long as they perform similar functions in these tissues. To assess the integration of various data sources, we visualized the entire embedding space and highlighted the locations of genes based on their respective data sources. Figure 2 presents the distribution of genes in the map based on their modalities (Fig. 2A), tissues (Fig. 2B), and cell clusters (Fig. 2C), respectively.

View larger version:

Download as PowerPoint Slide

Figure 2.

Distributions of data modalities, tissues, and cell clusters in the entire embedding space. (A) Distributions of data modalities in the embedding space; genes are colored by data modalities. (B) Distributions of tissues in the embedding space; genes are colored by tissues. (C) Distributions of cell clusters in the embedding space; genes are colored by cell clusters. (D) The embedding space is split into components; genes in different components are shown in different colors. (E) Cross-clustering between data modalities. (F) Cross-clustering between tissues. As can be seen, genes from several different tissues, and from multiple modalities, are clustered together, allowing for a joint analysis of this large data set.

As can be seen, the locations of genes are not determined by their data modalities or tissues, allowing us to use the embeddings to investigate shared functionalities across tissues and mechanisms being studied by each of the data modalities we used. This observation is confirmed by our quantitative evaluation. GIANT obtains a very low silhouette coefficient (0.016) for data modalities (Supplemental Note 1). We next compared GIANT's gene embedding with the embeddings of six cell-based data integration methods, which, similar to GIANT, can be used to integrate different modalities and tissues (Supplemental Note 2; Supplemental Figs. S1, S2). GIANT achieves a substantially lower silhouette coefficient of data modality compared with Harmony (Korsunsky et al. 2019), LIGER (Liu et al. 2020), Scanorama (Hie et al. 2019), scVI (Lopez et al. 2018), and GLUE (Cao and Gao 2022) and achieves a similar silhouette coefficient to that of Seurat (Supplemental Fig. S1H; Hao et al. 2021). In addition, for the scRNA-seq data used, which were obtained from five different platforms, we see very good integration by GIANT (silhouette coefficient −0.03) (Supplemental Fig. S3).

To further investigate which data sources tend to cocluster, we clustered the entire embedding space (Fig. 2D) and then looked at the data sources assigned to each of the resulting clusters (embedding components). Specifically, we applied the Leiden algorithm with a resolution of one (Traag et al. 2019) to cluster all genes in the embedded space into 172 components (Methods). We then determined the data modality or tissue for each gene in each component. Figure 2E shows the mapping from data modalities to components. Genes from Slide-seq expression and Slide-seq spatial graphs are always assigned to the same components, as these represent different measurements for the same gene. For the other modalities, we found that 14.5% (25 out of 172) of the components contain genes from at least two of the three modalities, scRNA-seq, scATAC-seq, and Slide-seq. To obtain baselines for demonstrating the integration of different modalities, we applied node2vec (Grover and Leskovec 2016) and Gene2vec (Supplemental Note 3; Du et al. 2019) on the same gene graphs used as input for GIANT and generated gene embeddings. We performed Leiden clustering on their embeddings, using a resolution parameter of one (same as we used for GIANT), to identify embedding components and assess how many contained genes from at least two modalities. In both node2vec and Gene2vec embeddings, the majority of components consisted of gene nodes from a single graph (Supplemental Fig. S4). Specifically, only 0.18% (one out of 554) of node2vec components and 0.91% (five out of 548) of Gene2vec components included genes from at least two of the three modalities. The results indicate that GIANT's embedding algorithm with the hierarchical alignment leads to better integration of data from different modalities compared with the baseline methods. Figure 2F shows the mapping from tissues to components. Genes from the heart (blue) and lung (orange), both closely related to the circulatory and the respiratory systems (Fessler 1997), frequently appear in the same components. Tissues from the immune system, including the lymph node (purple), thymus (pink), and spleen (brown) (Felten and Felten 1994), are also frequently coclustering. To quantify the clustering of functionally related tissues in our gene embeddings, we computed the tissue mean average precision (MAP) score (Cao and Gao 2022). In this analysis, the heart and lung were considered as one class and the lymph node, thymus, and spleen as another class, and each of the other tissues was treated as a separate class. MAP calculates for each gene the average precision within its K ordered nearest neighbors (Supplemental Note 4). We obtained a tissue average MAP score of 0.71 for GIANT's embeddings. The tissue average MAP scores derived from the other embeddings (Harmony, LIGER, Scanorama, scVI, Seurat, and GLUE) are much lower than that of GIANT (0.59 for Scanorama, which is the best among compared methods) (Supplemental Fig. S5). These results demonstrate that GIANT effectively integrated data across tissues and cell clusters and is able to group genes based on shared functions. However, gene nodes from graphs of different modalities (e.g., scATAC-seq, scRNA-seq) are not always merged in this data set. This is likely because of incomplete coverage of tissues for different modalities in the HuBMAP data set, leading to unmatched cell clusters across modalities. Additionally, biological variations in cell clusters collected in different modalities of the same tissues further contribute to this limitation (for details, see Discussion).

To further test the ability of GIANT on integrating data from different modalities, we collected single-cell mouse brain data from the Allen Brain Cell Atlas (Yao et al. 2023), which contains 2,341,350 cells from scRNA-seq and 3,938,808 cells from MERFISH spatial transcriptomics (Chen et al. 2015). Both modalities were derived from entire adult mouse brains, encompassing the same tissue regions and cell populations, serving as a good data set for testing the integration of different data modalities. In each modality, cells were grouped into 34 clusters, with each cell cluster annotated as a cell type. We built a gene coexpression graph for each scRNA-seq cell cluster and MERFISH cell cluster. We then ran GIANT to embed genes from the gene graphs (Supplemental Note 5). The embeddings demonstrated clear preservation of cell types, with genes from the same cell types clustered and genes from different modalities well mixed in the embedding space (Supplemental Figs. S6, S7). To compare GIANT with other methods, we again ran the cell embedding methods (Harmony, LIGER, scVI, Scanorama, Seurat, and GLUE) on this data set. We assessed cell-type preservation using cell-type average MAP scores (Supplemental Note 4) and modality mixing using the local inverse Simpson's index (LISI) (Supplemental Note 6; Korsunsky et al. 2019). LISI measures the diversity of neighboring nodes, with a value of two indicating perfect mixing for two modalities. GIANT achieves a cell-type average MAP score of 0.999, outperforming all other methods (Harmony: 0.913; LIGER: 0.864; scVI: 0.962; Scanorama: 0.935; Seurat: 0.542; and GLUE: 0.932). GIANT also obtains a LISI score of 1.91 for the mixing results, closer to the ideal value of two than those of competing methods (Harmony: 1.60; LIGER: 1.19; scVI: 1.38; Scanorama: 1.52; Seurat: 1.32; and GLUE: 1.63).

Genes closely embedded in the space show tissue-specific GO enrichment

We next tested the functional enrichment of nearby genes in the embedding space (Methods). Because of the large number cells of different tissues we analyzed, such analysis allowed us to identify functional groups spanning a large portion of the Gene Ontology (GO) categories in the embedding components. Specifically, this analysis identified 272,848 enriched biological process (BP) GO terms, 44,952 enriched cellular component (CC) GO terms, and 56,199 enriched molecular function (MF) GO terms (adjusted P ≤ 0.01) in the 1199 embedding components identified by the Leiden algorithm with a resolution of 50 (Supplemental Table S1).

We next analyzed the GO terms discovered in the embedding components associated with different tissues. Specifically, for each component, we looked at the top three tissues for genes assigned to that component. Next, we selected the top 5% enriched BP GO terms with the largest fold changes (ratio of the observed proportion of genes associated with a GO term in the embedding component to the expected proportion of genes associated with that GO term in the population) and adjusted P ≤ 1 × 10⁻⁵, and associated these GO terms with each of the top three tissues. To summarize the enriched GO terms for each tissue, we propagated these GO terms to higher levels along the GO hierarchies. The top GO categories enriched in each tissue are shown in Figure 3A, in which the size of each dot indicates the ratio of child terms of the noted GO category in all the enriched terms of the tissue, and the color of each dot indicates the fold change of the ratio over the average value of the ratios across tissues.

View larger version:

Download as PowerPoint Slide

Figure 3.

GO terms enriched in different embedding components. (A) The top categories of GO terms enriched in the embedding components associated with different tissues. The size of each dot indicates the ratio of child terms of the GO category in all the enriched terms of the tissue. The color of each dot indicates the fold change of the ratio over the average value of ratios in the column. (B) GO enrichment in different embedding components. Components with the GO term enriched (adjusted P < 0.01) are colored by the enrichment levels of the GO term, and the other components are shown in gray. Genes from tissues related to the GO term are highlighted by dark gray, and genes in the other tissues are colored by light gray.

We found that most of the top GO categories for each tissue (Fig. 3A, dark red dots) are strongly consistent with the known cellular functions of that tissue. For example, muscle-related GO categories (e.g., GO:0003012 muscle system process, GO:0006936 muscle contraction) are prominent for the heart, lung, and large intestine, which contain abundant cells of cardiac muscles, respiratory muscles, and smooth muscles (Severs 2000; Ratnovsky et al. 2008; Sanders et al. 2012). For immune tissues (lymph node, thymus, and spleen), immunity-related GO categories dominated the assignments, including “GO: 0002376 immune system process” and “GO:0006955 immune response.” Lipid metabolism-related GO categories (e.g., GO:0010876 lipid localization) are prominently discovered for the liver (Nguyen et al. 2008). In addition, metabolic process–related GO categories are significantly found for the pancreas, which could be explained by its key metabolic functions in the regulation of blood glucose levels (Röder et al. 2016). Examples of some tissue-specific GO terms are shown in Figure 3B. Each of them is enriched in different components in the embedding space (reddish regions), whereas they are more likely found in the components in which the related tissues reside (dark gray regions). In addition, we identified a list of GO terms enriched in tissue-specific embedding components, which are not directly related to the primary functions of the tissues but are likely associated with specific cell populations supporting the diverse activities of the tissues. These examples are listed in Supplemental Table S2. For example, “GO:0031344 regulation of cell projection organization” was found in a large intestine–specific embedding component. Although cell projection is commonly associated with neuronal cell types, this enrichment may relate to the formation of microvilli in absorptive cells of the large intestine (Walton et al. 2016). In another example, “GO:0019226 transmission of nerve impulse” was enriched in a kidney-specific embedding component. Although the kidney is not primarily a nerve-driven organ like muscle tissues or neurons, it does receive regulation from nervous systems that help control its functions (Neyra et al. 2021).

To confirm the GO findings, we further performed the same enrichment analysis using reactome pathways (Supplemental Note 7; Gillespie et al. 2022). Similar to what we observed for GO, pathways such as “R-HSA-1280218 adaptive immune system” in the lymph node, “R-HSA-1280215 cytokine signaling in immune system” in the thymus and spleen, and “R-HSA-390522 striated muscle contraction” in the heart were identified. In addition, we also observed “R-HSA-5663205 infectious disease” for both the thymus and lymph node (Supplemental Fig. S8).

We again compared GIANT with the six aforementioned cell-based embedding methods. The WGCNA module detection method (Langfelder and Horvath 2008) was applied in each cell cluster identified from each cell-based embedding (Supplemental Note 2). We found that compared with GIANT, there were much fewer unique GO terms enriched in the gene modules discovered from cell-based embeddings (Supplemental Fig. S1H). We also compared GIANT to a general method for clustering genes using all data. For this, we performed hierarchical clustering on the concatenated data matrix from different data modalities for each tissue (Supplemental Note 8). We found much fewer enriched unique GO terms in the identified gene clusters (2467 vs. 6972 of GIANT, P-value < 0.05). In addition, we observed fewer tissue-specific GO categories enriched in gene clusters associated with different tissues (Supplemental Fig. S9). Finally, we also compared GIANT with Limma (Ritchie et al. 2015), which is usually applied to correct for batch effects and so can be used for data integration. We performed the same cell clustering and gene clustering methods as used for GIANT to identify gene modules from the Limma-corrected data matrix (Supplemental Note 9). Again, there are much fewer enriched unique GO terms in the identified gene modules (3939 vs. 6972 of GIANT, P-value < 0.05). Additionally, there are fewer tissue-specific GO categories or GO categories not typically associated with tissue-specific functions found in gene modules associated with different tissues (Supplemental Fig. S10).

Testing GIANT integration on a developmental atlas data set

To further test the ability of GIANT to integrate data from different modalities, we used a human fetal cell atlas data set (Cao et al. 2020; Domcke et al. 2020). This data set contains single-cell data from 15 tissues and two data modalities, including the adrenal, cerebellum, cerebrum, eye, heart, intestine, kidney, liver, lung, muscle, pancreas, placenta, spleen, stomach, and thymus of scRNA-seq and scATAC-seq. Unlike the HuBMAP data, the fetal atlas provides cell-type annotations for cell clusters, allowing us to validate our embeddings at a more refined level. Results are presented in Figure 4, A and B (Supplemental Note 10). As the figure shows, GIANT is able to merge genes from different modalities and tissues. Specifically, several components in the embedding space are dominated by one cell type (Fig. 4C), although genes in these components are often from several tissues and both modalities. To quantify the conservation of cell-type information in our gene embeddings, we computed the cell-type average MAP score (Supplemental Note 4). We obtained a MAP score of 0.72 for GIANT's embeddings. For comparative analysis, we again applied the aforementioned cell-based embedding methods (i.e., Harmony, LIGER, Scanorama, scVI, Seurat, and GLUE) to this data set (Supplemental Figs. S11–S13). The cell-type average MAP scores derived from the embedding of these methods (0.39 of scVI, which is the best among compared methods) are much lower than that of GIANT (Supplemental Fig. S13G). Analysis using only scRNA-seq data from different tissues further supports our findings (Supplemental Note 10). GIANT achieved the highest cell-type MAP score of 0.73, outperforming all other methods (Harmony: 0.63; LIGER: 0.49; scVI: 0.65; Scanorama: 0.65; Seurat: 0.31, and GLUE: 0.62), demonstrating better preservation of cell-type information (Supplemental Fig. S14).

View larger version:

Download as PowerPoint Slide

Figure 4.

Application of GIANT on the human fetal atlas. (A) Distributions of data modalities in the embedding space. (B) Distribution of tissues in the embedding space. (C) Distribution of cell types in the embedding space. Each point in the plots represents a gene that is colored by its data source. (D) Top GO categories for different tissues. The size of each dot indicates the ratio of child terms of the GO category in all the enriched terms of the tissue. The color of each dot indicates the fold change of the ratio over the average value of ratios in the column. Shown cell types are those with at least 50 GO terms. (E) The enrichment level of different developmental GO terms in the embedding components associated with different cell types shown in D. Colors show the negative logarithm of the corrected P-values.

We also performed the same GO enrichment analysis in the embedding components, as done for Figure 3A, this time using cell types instead of tissues. Our analysis revealed a strong consistency between the enriched GO terms found within the embedding components of each cell type and the well-established cellular functions associated with those specific cell types (Fig. 4D; for the P-value for each GO term, see Supplemental Table S3). For example, signal transduction–related GO categories (e.g., GO:0007267 cell–cell signaling, GO:0009966 regulation of signal transduction) are mainly assigned to nerve cells, such as inhibitory interneurons (Roux and Buzsáki 2015), and enteric nervous system (ENS) neurons (Furness 2012), whereas immune-related GO categories (e.g., GO:0002682 regulation of immune system process, GO:0046649 lymphocyte activation) are mostly associated with immune cells, such as myeloid cells (Gupta et al. 2014), lymphoid cells (Blom and Spits 2006), and microglia (Graeber and Streit 2010). We also found many development-related GO terms enriched in embedding components for this fetal atlas data set. A closer examination revealed that the cell-type-specific developmental pathways are highly enriched in the embedding components associated with the corresponding cell types (Fig. 4E; for gene ratios and fold changes of these GO terms, see Supplemental Fig. S15), such as “neuron projection development” (GO:0031175) for inhibitory interneurons, “hematopoietic or lymphoid organ development” (GO:0048534) for lymphoid cells, and “erythrocyte development” (GO:0048821) for erythroblasts.

In addition, we found that 88.5% (1061 out of 1199) of the HuBMAP embedding gene components are enriched in the embedding components of the human fetal data set (Supplemental Fig. S16), indicating the ability of large-scale atlas data sets to reveal most of the significant biological processes.

To further explore the different gene–gene relationships profiled in the HuBMAP (adult human) and human fetal atlases, for each tissue that is common between the two atlases, we identified unique gene modules in the gene-embedding components of each atlas (Supplemental Note 11). Notably, we observed enrichment of genes associated with the function “regulation of cell differentiation (GO:0045595)” (P-value < 0.05) in fetal-specific gene modules across six out of seven tissues, compared with two tissues in adult-specific gene modules (Supplemental Fig. S17). In addition, genes associated with the function of “regulation of cell cycle (GO:0051726)” were also enriched in six out of seven tissues for fetal-specific gene modules but were only enriched in a single tissue for adult-specific gene modules. This suggests that, as expected, fetal cells are more proliferative compared with adult cells (Bunis et al. 2021). Conversely, GO terms “metabolic process (GO:0008152)” and “biosynthetic process (GO:0009058)” usually have higher enrichment levels in the adult-specific gene modules compared with fetal-specific gene modules across all tissues (Supplemental Fig. S17), suggesting varied metabolism of some substances at different developmental stages (Piquereau and Ventura-Clapier 2018). These findings imply distinct gene regulations depicted in the two atlases, which are captured by our gene embeddings.

Inferring gene functions from multimodality multitissue embeddings

The precise function of a specific protein (and the genes that code for it) may differ between tissues and cell types (Yeger-Lotem and Sharan 2015). Given the observations above that GIANT can group functionally similar genes together across data sources, we used the embeddings to infer novel gene functions for all the genes we embedded. We first clustered each gene's embeddings and then inferred the gene's functions in each cluster based on other coclustered genes with known functions in the graphs of this cluster (Methods).

We tested the coverage and precision of the functions that we can assign using GO. We found that 47.4% of all known GO functions for the genes are identified in at least one of their clusters, which means that our predictions cover roughly half of all known functions for these genes (recall). This number is much higher than the expected number given the number of GO assignments (compared with 6.9% expected for random assignments of 671,215 GO terms for the 1822 genes analyzed here). On the other hand, the ratio of known functions in our GO assignments for genes (precision) is 25.1%, which is also much higher compared with the ratio of known functions in the whole prediction space (1.5%) (Fig. 5A). Precision and recall of our predictions considering only child terms of other GO categories are shown in Figure 5A. We note that the data sets we used were healthy controls. Thus, they did not contain information about stress response, diseases, etc., which explains why several GO functions were not recovered for many genes. In addition, our gene function assignments rely on GO enrichment in neighboring genes within the embedding space. GO terms with very few genes do not reach the significance threshold to be called in such setups. For the precision, we considered the broadest predictions with a P-value threshold of 0.05, encompassing a substantial portion of novel predictions.

View larger version:

Download as PowerPoint Slide

Figure 5.

Inference of gene functions in cell clusters. (A) Blue circles show the ratios of known gene functions that are covered by our predictions (recall), and their outside circles indicate the ratios of our predictions in the whole prediction space (all the possible GO assignments). Green circles show the ratios of known gene functions in our predictions (precision), and their outside circles indicate the ratios of known functions in the whole prediction space. Titles indicate child GO terms of which categories are considered. Besides the whole GO space and three main GO branches, some representative GO categories in Figure 3A are also shown here. (B) The number of bimodal genes and the number of bimodal genes for which one or both functions are confirmed. (C) Inferred functions of two genes in different cell clusters.

Given this large coverage and precision, we next investigated genes that are assigned very different functions in at least two different clusters, which are termed as “bimodal genes” here. In particular, genes are selected as bimodal that met the following criteria: Among all the gene's clusters with at least five assigned GO terms, which are BP terms associated with no more than 200 genes to focus on specific GO terms that may show tissue specificity, at least two clusters do not have any common GO terms assigned. If a gene has more than two such clusters, the two with the most GO terms assigned are considered in this analysis. As a result, we identified 1063 bimodal genes, and 487 (45.8%) among them have known functions assigned in at least one cluster. For 121 (11.4%) of them, both clusters have functions that are already known even though they can be very different (Fig. 5B).

Figure 5C presents examples of bimodal genes with different known functions in different clusters. MEF2C is essential in B lymphopoiesis, as well as for B cell survival and proliferation in response to stimulation (Khiem et al. 2008; Wilker et al. 2008). On the other hand, it is known to be involved in vascular development and controls cardiac morphogenesis (Lin et al. 1997, 1998). Our method correctly assigns MEF2C the functions of B cell activation, lymphocyte activation, and regulation of immune response in the group of lymph node and thymus cell clusters, but it assigns this gene the functions of positive regulation of developmental process, as well as tube morphogenesis in a group of heart and lung cell clusters. MEF2C is also known for its important role in neuronal cell types (Li et al. 2008). Even though the HuBMAP data set does not contain brain tissue, which impacts the ability of the method to assign neurological functions to the gene, we did find a cluster of MEF2C that is associated with the function “GO:0010975 regulation of neuron projection development” in cell clusters of the large intestine. It was previously reported (Li et al. 2008) that MEF2C plays a role in the development of neuronal cell types in the ENS (located in the gut wall). However, further experimental validation is required to confirm this prediction. Similarly, RYR2 has been found to play an important role in triggering cardiac muscle contraction and to be related to cardiac arrhythmia (Bround et al. 2012), whereas it also mediates the flow of calcium ions across cell membranes in the large intestine (O'Donnell et al. 2019). Again, RYR2 is assigned the functions of regulation of heart rate, muscle contraction, and regulation of blood circulation in the group containing a heart cell cluster, whereas in the large intestine cell cluster, it is assigned to the ion transmembrane transport function. A complete list of predicted bimodal genes and their predicted functions can be found in Supplemental Table S4.

Gene embeddings enable the discovery of tissue-specific gene regulations

Identifying the regulatory programs of cells helps elucidate their diverse gene expression patterns. We further demonstrate the ability of our gene embeddings to discover gene regulations in cell clusters of different tissues. Similar to the previous subsection, we calculated the enrichment of genes regulated by the same transcription factor (i.e., TF regulon) in each embedding component (Methods). Specifically, this analysis identified 5815 enriched TF regulons (adjusted P ≤ 0.001) in the 1199 components. To compare the TFs identified in cells of different tissues, we associated discovered TFs with tissues using the top three tissues for each component as discussed above. TF overlaps among different subsets of tissues are shown in Figure 6A. Although large numbers of TFs (104 out of 435, 23.9%) were identified as significant for all the tissues, most TFs show specificity for different tissues. These observations are consistent with previous findings in the literature (Sonawane et al. 2017; Steuernagel et al. 2019). The full list of TFs identified in different tissues can be found in Supplemental Table S5.

View larger version:

Download as PowerPoint Slide

Figure 6.

TFs discovered in different tissues. (A) TF intersections among different tissues. Bars on the left show the number of TFs identified in each tissue, and each bar on the top shows the number of TFs shared among tissues marked by the dots below it. Tissues with fewer than 20 TFs identified are omitted in this figure. The figure was drawn using UpSetR (Conway et al. 2017). (B) The percentage of tissue-specific TF markers in the corresponding tissue (blue) compared with their percentage in the other tissues (light yellow). Significant enrichment in blue groups: (*) P ≤ 0.05, (**) P ≤ 0.01; P-values are evaluated using one-sided Fisher's exact tests.

To further validate our discoveries, we examine whether TFs found in each tissue are enriched with TFs known to be specific to this tissue. For this, we used a prior list of tissue marker TFs (Xu et al. 2022). We find enrichment for tissue markers in all tissues and significant enrichment for three of the eight tissues (Fig. 6B).

Previous Section Next Section

Discussion

Recent efforts to generate atlas-scale single-cell data provide opportunities for joint analysis across tissues and modalities, which can enable new insights into cell functional plasticity, gene function, cell regulation, and disease. A key challenge for such analysis is how to effectively integrate data from different sources, such as tissues, modalities, and conditions, while enabling the discovery of system-level organizations and regulations.

We developed a new method that is the first to fully integrate atlas-level data from multiple tissues and modalities in a single unified embedding. For this, we use genes as the anchor points common to all data sets, modalities, and tissues. Our method first converts cell clusters in different tissues from different data modalities to gene graphs and then recursively embeds these graphs such that connected genes within a single graph are encouraged to have similar embeddings, whereas genes from graphs generated from similar cells are projected closely in the embedding space. Finally, a unified embedding space is produced in which the locations of genes reflect the functions they perform in the cells.

We applied our method to HuBMAP database with 10 different tissues, three different major modalities, and nine different single-cell platforms. As we have shown, despite the considerable heterogeneity of the input data, our method was able to successfully group genes based on their function and pathways. We further used the method to assign new functions to previously characterized genes and to explore tissue similarities and regulations. We compared GIANT with six cell-based embedding methods. Note that these cell-based methods are not expected to focus on genes, and these comparisons are mainly to emphasize the advantage of gene-based analysis.

An interesting finding is the number of genes that have at least two sets of functions (termed “bimodal genes”). Even with only 10 tissues, we identified more than 1000 genes with two or more diverse functions depending on the tissues. Although several studies have pointed out genes with multiple functions (Pritykin et al. 2015) mostly in different cell types or tissues (Khiem et al. 2008; Wilker et al. 2008), these prior findings were anecdotal and focused on very few genes. By analyzing much larger atlas data, we were able to determine that such bimodal genes are actually quite common. It is likely that the number of genes with multiple functions is actually much larger given the large number of tissues that were not included in our study. These findings indicate that even though the majority of genes have at least one associated function in large annotation databases (e.g., GO), there is still much work needed to fully characterize the functions of human genes.

For HuBMAP, we have mainly focused on tissue-level analysis as the first step for atlas-level integration. Although we used cell clustering (which likely represents cell types) as input, these clusters were not assigned to specific cell types, and so, our method can currently only provide information on tissue-level functions and regulations. However, as we showed with the fetal atlas data set, GIANT can be easily applied to integrate atlas data at the cell-type level as well.

Although GIANT relies on cell-based clustering for the initial processing (creating data set–specific graphs), it integrates data in the gene space rather than at the cell level. This provides several advantages as we discussed above and leads to better results as we showed. Although we used cell-type information for validation, the method does not assume such information is provided. It is therefore robust to different cell cluster assignments between modalities. The method performs clustering of cells within each modality, and so, the main requirement is that cells from the same type are similar in each of the different modalities, which is usually the underlying assumption of these technologies.

The rapid expansion of single-cell technologies has introduced various sequencing platforms that produce data with different qualities. For example, in contrast to transcript end sequencing technologies, full-length scRNA-seq covers the entire sequence of RNA molecules, enabling the detection of splice isoforms and genetic variants (Hagemann-Jensen et al. 2020; Hahaut et al. 2022). Applications of GIANT on such data, which offers richer information, could be tested in the future to determine if it yields novel biological insights.

GIANT constructs gene–TF graphs for scATAC-seq cell clusters, which connect genes to TFs who are known to bind to their promoter regions while the binding sites are open. This reduces the scATAC-seq data to the gene space, which, on the one hand, leads to information loss but, on the other, enables easier integration of data from different modalities.

GIANT successfully integrates data from three different modalities (the silhouette coefficient for data modality is 0.016, indicating very little impact of data modalities) (Supplemental Fig. S1H). This can also be observed when visualizing the gene embeddings of the kidney from both scRNA-seq and Slide-seq (Supplemental Fig. S18), in which Slide-seq data are distributed in the embedding among several different scRNA-seq gene clusters. However, gene nodes from individual modality graphs (scATAC-seq, scRNA-seq, etc.) are usually not merged because they are embedded together as part of their input graph. This is likely because graphs for different modalities have very different structures that capture gene functions from different perspectives (coexpression for scRNA-seq, spatial coexpression for Slide-seq, and c-regulation for scATAC-seq). In addition, for the HuBMAP data set, we do not have all modalities for all tissues. For example, the lymph node and thymus have only scRNA-seq data, but for the pancreas and small intestine, we only have scATAC-seq data. Slide-seq is only available for kidney data. Combined, this leads to a number of cell clusters from a modality that do not have a corresponding cluster in another modality, which limits the overall integration. Moreover, we observed that within each tissue, cell clusters between different modalities generally exhibit less overlap in their marker genes (determined by differential expression as described in Methods) compared with those from the same modality. Consistently, GO terms enriched in embedding components associated with cell clusters from different modalities show lower overlap compared with those from the same modality (Supplemental Fig. S19). The results suggest the existence of variations in cell states for data from different modalities even when they come from the same tissues. Still, the ability to integrate gene modules (or graphs in our representation) of GIANT is important and, as we show, can lead to relevant and novel biological findings. For example, for the gene function prediction task, among the 664 genes that appear in at least two of the three data modalities, 332 (50.0%) genes have predicted functions arising out of neighbors from more than one modality. Among the 1924 genes that appear in at least two tissues, 1606 (83.5%) genes have predicted functions arising out of neighbors from more than one tissue (Supplemental Fig. S20). We also note that GIANT can enforce a higher level of modality integration by increasing the regulation strength (λ); however, in such cases we observed lower-quality gene embeddings (Supplemental Fig. S21). Similarly, we found that cell-based embeddings, such as Seurat and GLUE, achieve higher levels of modality integration, but the resulting gene modules are of lower quality (Supplemental Fig. S1H).

GIANT uses a multistep approach to overcome batch effects. Specifically, rather than embedding the specific expression levels, which are batch sensitive, GIANT embeds the gene interaction graphs. These are more robust because they do not depend on specific values and combine information from multiple cells (Stuart et al. 2003). Next, GIANT projects genes from each gene graph into an embedding space, in which the positions of genes are meant to reflect their connections within the gene graph and therefore their functions. The embeddings are constrained by relationships between graphs from different data sets in the dendrogram, which is a regularizer that also reduces batch effects. This approach can be seen as a hierarchical alignment along the embedding.

There are a number of other methods that can also generate gene embeddings from single-cell data. These include GLUE (Cao and Gao 2022), DeepMAPS (Ma et al. 2023), SIMBA (Chen et al. 2024), and GSDensity (Liang et al. 2023). For example, GSDensity can associate genes with cells based on their distances within the same embedding space and therefore enables the analysis of pathway activities in cells. However, unlike GIANT, these methods only provide a single embedding for each gene, limiting their ability to capture the rewiring of gene networks under different conditions and cell types. In contrast, GIANT generates a unique embedding for each gene in each cell cluster, allowing users to study multiple functions of the same gene in different tissues or cell types, as demonstrated in our study.

GIANT generates gene embeddings for cell clusters from different tissues measured under various modalities within a unified space, providing a foundation for gene- or pathway-centric analyses in multimodal, multitissue, single-cell data. GIANT is not designed for cell embeddings. The embedding is encouraged to place gene nodes representing the same gene together. This cannot be directly applied to cells because there is no concept of the same cell in different data sets. Given this, GIANT cannot be applied to cell-level analysis. As discussed in this study, the GIANT embedding allows users to explore the diverse functional roles of genes across different conditions, cell types, and tissues. We demonstrated how GIANT embeddings captured developmental pathway differences in tissue development. Other potential usages are comparisons between healthy and disease tissues, using the GIANT embeddings generated for different cell types and cellular states, and multimodal comparisons between large-scale single-cell data sets across different species and groups.

Our embedding method is implemented in Python and is available along with the generated gene embeddings at GitHub (https://github.com/chenhcs/GIANT). We hope that as the analysis of atlas-level data becomes more common, our method would be able to be applied to additional atlases and data sets to integrate large-scale single-cell data.

Previous Section Next Section

Methods

Data sets

We integrated single-cell data from more than 60 individuals profiling 10 different tissues and three major data modalities from the HuBMAP consortium (HuBMAP Consortium 2019).

scRNA-seq data sets

scRNA-seq data sets were obtained for eight tissues: heart, kidney, large intestine, liver, lung, lymph node, spleen, and thymus. We obtained 14 data sets and a total of 30,241 cells for the heart, 15 data sets and 109,055 cells for the kidney, six data sets and 13,128 cells for the large intestine, five data sets and 7916 cells for the liver, nine data sets and 68,390 cells for the lung, four data sets and 40,449 cells for the lymph node, 10 data sets and 93,737 cells for the spleen, and three data sets and 10,838 cells for the thymus. These data were obtained from five platforms (Supplemental Table S6) and were uniformly processed. The uniform processing used the Salmon's Alevin method (Patro et al. 2017) for sequencing read mapping and abundance quantification, leading to a cell-by-gene count matrix for each data set.

scATAC-seq data sets

scATAC-seq data sets were obtained for eight tissues, including the heart, kidney, large intestine, liver, lung, pancreas, small intestine, and spleen. There are 14 data sets and in total 312,579 cells for the heart, 19 data sets and 272,526 cells for the kidney, six data sets and 8368 cells for the large intestine, two data sets and 50,748 cells for the liver, nine data sets and 205,048 cells for the lung, three data sets and 68,294 cells for the pancreas, three data sets and 7774 cells for the small intestine, and four data sets and 16,260 cells for the spleen. Again, data from three platforms (Supplemental Table S6) were combined using a uniform processing pipeline. For this, we used SnapTools (Fang et al. 2021) for read mapping and abundance quantification. SnapTools segments the genome into uniform-sized bins of 5000 bp, and a cell-by-bin matrix is created for each data set with each element indicating the count of sequencing fragments overlapping with a given bin in a certain cell.

Slide-seq data sets

Slide-seq (Rodriques et al. 2019), a spatial transcriptomics method, was used to profile kidney samples. There are in total 551,342 cells (DNA-barcoded beads) included in 19 Slide-seq data sets. Slide-seq generates short-read sequencing data that are equivalent to scRNA-seq data and were processed in the same way we processed all scRNA-seq data. In addition, the spatial location associated with each bead is provided.

The HuBMAP data from each modality (even from different platforms) were processed using the same pipeline, and the reads of different data sets were aligned to the same reference genome. It is highly recommended that other large-scale atlas analyses follow a similar procedure. Accession numbers for the HuBMAP data sets used in this study are listed in Supplemental Table S6. Data can be obtained from the HuBMAP data portal (https://portal.hubmapconsortium.org) via accession numbers.

The scRNA-seq data and the scATAC-seq data of the human fetal atlas were obtained from the Descartes interactive website (Supplemental Note 10; https://descartes.brotmanbaty.org/).

The scRNA-seq data and the MERFISH data of the Allen Brain Cell Atlas were obtained from the Allen Brain Map portal (Supplemental Note 5; https://portal.brain-map.org/atlases-and-data/bkp/abc-atlas).

Clustering cells with the HuBMAP modality-specific pipelines

For each tissue in each data modality, cells are merged by joining the data matrices of different data sets along the cell axis. The HuBMAP modality-specific data analysis pipelines, described below, are then used to cluster cells in each tissue of each data modality.

The HuBMAP scRNA-seq pipeline uses the SCANPY (Wolf et al. 2018) package to perform cell clustering. SCANPY first filters cells that express fewer than 200 genes and genes that are expressed in less than three cells. Each cell is then normalized by its total counts over all genes. The produced count matrix is log-transformed and scaled so that each gene has unit variance and zero mean, after which it is used as input to dimensionality reduction via PCA, computation of a k-nearest-neighbor (k-NN) graph in PCA space, and finally clustering with the Leiden algorithm (Traag et al. 2019). We further identify differentially expressed genes for each cell cluster against all the other clusters in the tissue using the Wilcoxon rank-sum test, allowing for estimating similarities between cell clusters.

The scATAC-seq processing pipeline uses the SnapATAC package (Fang et al. 2021) for data processing, which starts from each cell-by-bin count matrix. High-quality cells are identified for subsequent analysis based on three criteria: (1) total number of unique fragment count (>2000); (2) UMI count (>1000); and (3) mitochondrial read ratio (<50%). Bins are then filtered out if they overlap with the ENCODE blacklist (downloaded from https://mitra.stanford.edu/kundaje/akundaje/release/blacklists/), if they are in mitochondrial chromosomes, or if they belong to the top 5% invariant bins ranked by the coverage. From the filtered cell-by-bin matrix, we create a gene activity matrix. Gene coordinates are extracted and extended to include the 2000 bp upstream region (Hao et al. 2021). Gene activities are then computed based on the number of fragments that map to each of these regions, using the “calGmatFromMat” function in the SnapATAC package, and saved as a cell-by-gene matrix. Cell clusters are identified from the cell-by-gene matrix by using it as the input of the same scRNA-seq cell clustering workflow with the steps of normalization and rescaling, dimensionality reduction, k-NN graph computation, and clustering with the Leiden algorithm. After cell clustering, MACS2 (Feng et al. 2012) is used to call accessibility peaks over all cells for each cluster with at least 100 cells. Genes that are differentially active for each cell cluster against all the other clusters in the tissue are again calculated using the Wilcoxon rank-sum test from the gene activity matrix.

Because Slide-seq yields gene count matrices that are equivalent to those of scRNA-seq, the same cell clustering pipeline that was used for scRNA-seq is applied to Slide-seq data. The spatial information of cells is used to construct the gene spatial graphs as discussed below.

Constructing gene graphs for cell clusters

Gene graphs are constructed for cell clusters in all tissues and modalities. To focus on cell-type-specific genes, we filter genes based on their variance in the data sets. Specifically, the union of the 500 most variable genes from each tissue are selected, using the SCANPY package with the criteria defined in Seurat (Satija et al. 2015). The variable genes are identified from gene expression matrices for scRNA-seq and Slide-seq and from gene activity matrices for scATAC-seq. The union of these sets results in 7000 genes that are selected and used for building gene graphs and further analyses.

An unweighted undirected k-NN gene graph is computed for each scRNA-seq cell cluster from its normalized and log-transformed expression matrix. For each pair of genes, the Pearson's correlation of their expression values across cells in the data sets is computed. Each gene is then connected to the top 10 other genes with the strongest correlation.

A gene–transcription factor (TF) hypergraph is constructed for each scATAC-seq cell cluster by associating the peaks identified in each cell cluster with TFs and target genes. We identify peaks that contain TF motifs using the motifmatchr package, which provides information on 633 human TF binding motifs collected from the JASPAR 2020 (Fornes et al. 2020) database. A binary peaks-by-TF matrix is generated, in which an element “1” indicates that the TF's motifs are found in the region covered by the corresponding peak. A complimentary gene-by-peak matrix is generated, in which elements are set to “1” if the corresponding peaks are within the promoter regions of the corresponding genes. Here the promoter region is defined as the 500 bp upstream region of the transcription start site, which considers core and proximal promoters, as was previously done by Aibar et al. (2017). The dot product of the two aforementioned matrices results in a gene-by-TF matrix. We then construct a hypergraph by treating the gene-by-TF matrix as the incidence matrix, in which each column that corresponds to a TF is treated as a hyperedge, and genes with elements of “1” in this column are nodes connected by this hyperedge.

The expression matrix of Slide-seq is equivalent to that of scRNA-seq data. Therefore, for each Slide-seq cell cluster, we construct an unweighted undirected coexpression k-NN graph using the same method used for scRNA-seq data. In addition, we construct another spatial coexpression graph for each bead cluster that links coexpressed genes in neighboring beads. For this, we identify the neighbors of each bead using the Delaunay triangulation of the bead's spatial coordinates without using a distance threshold, which avoids the need to specify the parameters from users and is commonly used in other studies (Dries et al. 2021; Zhang et al. 2022). Then for each cluster, we compute the correlation of each pair of genes, in which one side is the expression of a gene in this cluster and another side is the mean expression of a gene in the neighboring clusters. Again, each gene is connected to the top 10 most coexpressed genes, and the graph is converted to a hypergraph in which each neighboring-cell gene is treated as a hyperedge.

Finally, we constructed 701 graphs with 1,318,103 nodes in total for 10 tissues and three data modalities.

Connecting gene graphs through a cell cluster dendrogram

We build a cell cluster dendrogram using hierarchical clustering (Pedregosa et al. 2011), in which cell clusters that are close together in the dendrogram are likely functionally similar. For this, we first construct a distance matrix between cell clusters using the obtained top 50 differentially expressed or active genes of each cell cluster. Each Slide-seq cell cluster has two copies, one for the expression graph and one for the spatial graph. Given two sets of differentially expressed or active genes S_A and S_B of cell clusters A and cell cluster B, their distance is defined as $\text{[math]}$ (1) where $\text{[math]}$ is the Jaccard index of the two gene sets. That is, if two cell clusters have totally different differentially expressed or active gene sets, their distance will be one, but if they have the same gene sets, their distance will be zero. We then build the dendrogram from the distance matrix using the agglomerative clustering algorithm with the criterion of minimizing the average of distances when merging pairs of branches. The clustering process results in a dendrogram in which each leaf is associated with a gene graph of a cell cluster.

Learning gene embeddings from gene graphs and the cell cluster dendrogram

To identify genes with similar functions or genes participating in the same pathway, we aim to embed each gene in each gene graph into a d-dimensional feature space in an unsupervised fashion. For this, we use the multilayer graph learning objective proposed by Zitnik and Leskovec (2017), which relies on connections between genes within a single graph and combines this with inter-graph relationships between genes in different graphs. Formally, let V be a given set of N genes {v₁, v₂, …, v_N} and G_i = (V_i, E_i) be one of the K gene graphs in the data set (i = 1, 2, …, K), where E_i denotes the edges between nodes V_i ⊆ V within the graph. The dendrogram is a full binary tree T defined over a set of 2K − 1 elements, where each of the K leaf elements is a gene graph. We define π(i) as the parent of the element i in the tree and, reversely, c(i) as the children of the element i in the tree. The problem is thus to learn mapping functions f₁, f₂, …, f_K, each attached with one graph, such that function f_i maps nodes in V_i to feature representations in ℝ^d. f_i is equivalent to a matrix of |V_i| × d parameters.

We started from the assumption that genes with similar neighborhoods in each single graph perform similar functions and therefore should be embedded closely together. Particularly, for the coexpression graph, in many cases there is a strong relationship between coexpression and functions. The idea has been widely used in the literature to study gene functions using scRNA-seq data (McKenzie et al. 2018; Cha et al. 2023). GIANT does not just look for coexpression in a single cell or tissue but rather focuses on coexpression across several tissues and modalities, which is more likely to result from real functional implications. The following objective function is used to achieve this goal: $\text{[math]}$ (2) where N_i(v) are the neighbors of node v in graph G_i, and w_v = f_i(v), which is the embedding of node v in this graph. The conditional likelihood p(N_i(v)|v) is equivalent to the product of a series of softmax functions; maximization of p(N_i(v)|v) therefore tries to maximize the classification probabilities of v’s neighbors based on learned node embeddings. With that, the objective for each graph i is set to $\text{[math]}$ (3)

We assume that nearby cell clusters in the dendrogram are functionally similar, which implies that the same gene should perform similar functions in these cell clusters, and therefore, its embeddings in the corresponding graphs should be similar. For this, we introduce a mapping function f_i (where i = K + 1, …, 2K − 1) for each internal (i.e., nonleaf) element, similar to those of leaf elements, which map nodes to feature representations in ℝ^d. Recursively, the node set in an internal element is the union of the nodes in its child elements, $\text{[math]}$ (where i = K + 1, …, 2K − 1). The following regularization term is then introduced for nodes in each element i of T: $\text{[math]}$ (4) where node v in the child elements is encouraged to share similar embeddings with the node v in the parent element under the Euclidean norm. We obtain the regularization for an element by summing over the terms of all its nodes: $\text{[math]}$ . For an internal element i, its mapping function f_i in the optimization iteration t can be solved by the following closed-form solution given the mapping functions of all the other elements in T computed from the previous iteration t − 1, which minimizes the regularization terms involving f_i: $\text{[math]}$ (5)

With the objective function and regularization term defined, the model tries to solve the following optimization problem to find the best mapping functions f₁, f₂, …, f_K of leaf elements: $\text{[math]}$ (6) where λ is the hyperparameter of the regularization strength. To optimize this function, an iterative learning process is performed, in which leaf mapping functions and internal mapping functions are updated alternately. The leaf mapping functions f₁, f₂, …, f_K are updated by performing one epoch of stochastic gradient descent that optimizes Equation 6, whereas the internal mapping functions f_K+1, f_K+2, …, f_2K−1 are updated using Equation 5. At each iteration t, the optimization of a mapping function depends on the results of its parent or child elements from the previous iteration, t − 1.

In practice, we set the dimension hyperparameter d as 128 and the regularization strength λ as 1.0, which makes the model achieve the best performance. When using a λ in the range between 1.0 to 10.0, the model performs better by merging information between gene graphs instead of focusing on single gene graphs (λ in the range of 0.0 to 0.1). In addition, we observed that the results stay about the same across a range of choices of hyperparameter values for d and λ (Supplemental Fig. S22). The neighborhood N(v) for each node in each graph is constructed using Node2vec's random walk algorithm (Grover and Leskovec 2016). We implemented distributed training that applies gradients on each node parallelly based on the Gensim package (Rehurek and Sojka 2010) written in Cython. With this, the algorithm is scalable to hundreds of graphs and millions of nodes. The runtime and memory usage for running GIANT on both the HuBMAP and human fetal data sets, along with the comparison to other methods, are reported in Supplemental Table S7. The runtime and memory usage on querying the nearest neighbors from the GIANT embedding are reported in Supplemental Table S8.

Splitting embedding space into components and identifying enriched GOs and TF regulons in them

The learned mapping functions map each gene in each cell cluster to a point in the embedding space. To identify overlapping data sources and discover functions and pathways enriched in different regions of the space, we split the embedding space into embedding components by clustering gene points in it. To do the clustering, we generate another graph in which each point in the space (gene) is connected to its 30 nearest neighbors. The Leiden algorithm (Traag et al. 2019) with resolutions of one (for identifying overlapping data sources) or 50 (for discovering enriched functions) is then applied to cluster the genes. GO enrichment is performed for genes in each component against all the genes clustered using the GOATOOLS package (Klopfenstein et al. 2018), in which a gene from a cell cluster is considered as an object, and all the genes from all the cell clusters form the population. P-values from randomization experiments follow a uniform distribution, indicating the validity of this statistical test (Supplemental Fig. S23). GO annotations with the evidence code “IEA,” namely, electronic annotation without manual review, are ignored in the analysis to avoid circular reasoning. GO annotations are propagated through the GO hierarchies with “is_a” and “part_of” relationships. False-discovery rate (FDR) correction with Benjamini–Hochberg is applied to adjust the enrichment P-values.

We also download manually curated TF–target regulatory relationships from the TRUST v2 (Han et al. 2018) database. Enrichment of TF regulons (genes regulated by the same TF) in each component is computed using the same method described above.

Inferring multicellular functions of genes

To identify new functions for genes, we relied on genes with known functions that are coembedded with them. Specifically, for one gene, we calculate the Euclidean distance between each pair of the gene's embeddings of different graphs and then grouped all its graphs into clusters using agglomerative clustering, in which two clusters will stop being merged if the distance between them is above 1.5. Then, for each graph in a cluster, the gene's 50 nearest neighbors in the graph are identified based on the Euclidean distance between gene embeddings. A “common neighbor” set for the gene in this cluster is finally selected based on two criteria: (1) the neighbor gene must appear in the nearest neighbors of at least half of the graphs in the cluster and (2) the neighbor gene is in the 50 most frequent neighbors based on the number of its appearance in nearest neighbors. Next, GO enrichment analysis is performed using the common neighbors. GO terms that are enriched in the neighbors (P < 0.05) are assigned to the gene as predicted functions. In this analysis, we only predict functions of genes expressed in at least 30% of the cells in each cell cluster. Using “common neighbors” for performing this analysis as opposed to analyzing individual cell clusters increases robustness. By first performing the clustering of gene graphs and then the identifying common neighbors and enriched functions, we reduce the amount of false positives.

Software availability

The source code of GIANT and the gene embeddings generated from GIANT for both the HuBMAP data set and the human fetal atlas data set are available at GitHub (https://github.com/ chenhcs/GIANT), at Zenodo (https://doi.org/10.5281/zenodo.10515930), and as Supplemental Code.

Previous Section Next Section

Competing interest statement

The authors declare no competing interests.

Previous Section Next Section

Acknowledgments

The work in this article was partially supported by National Institutes of Health (NIH) grants OT2OD026682, 1U54AG075931, and 1U24CA268108 to Z.B.-J.

Author contributions: H.C., N.D.N., and Z.B.-J. conceptualized the study. H.C., N.D.N., and Z.B.-J. designed the algorithm and methodology. M.R. and H.C. performed data preprocessing. H.C. and N.D.N. developed the software. H.C. and Z.B.-J. performed the analysis. H.C., N.D.N., and Z.B.-J. wrote the manuscript. All authors read and approved the final manuscript.

Previous Section Next Section

Footnotes

[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279631.124.
Freely available online through the Genome Research Open Access option.

Received May 28, 2024.
Accepted February 3, 2025.

© 2025 Chen et al.; Published by Cold Spring Harbor Laboratory Press

This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

Previous Section

References

↵

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, et al. 2017. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 14: 1083–1086. doi:10.1038/nmeth.4463

CrossRef Medline Google Scholar
↵

Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, Stegle O. 2020. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol 21: 111. doi:10.1186/s13059-020-02015-1

CrossRef Medline Google Scholar
↵

Argelaguet R, Cuomo AS, Stegle O, Marioni JC. 2021. Computational principles and challenges in single-cell data integration. Nat Biotechnol 39: 1202–1215. doi:10.1038/s41587-021-00895-7

CrossRef Google Scholar
↵

Ashuach T, Gabitto MI, Koodli RV, Saldi GA, Jordan MI, Yosef N. 2023. MultiVI: deep generative model for the integration of multimodal data. Nat Methods 20: 1222–1231. doi:10.1038/s41592-023-01909-9

CrossRef Google Scholar
↵

Blom B, Spits H. 2006. Development of human lymphoid cells. Annu Rev Immunol 24: 287–320. doi:10.1146/annurev.immunol.24.021605.090612

CrossRef Medline Google Scholar
↵

Bround MJ, Asghari P, Wambolt RB, Bohunek L, Smits C, Philit M, Kieffer TJ, Lakatta EG, Boheler KR, Moore ED, et al. 2012. Cardiac ryanodine receptors control heart rate and rhythmicity in adult mice. Cardiovasc Res 96: 372–380. doi:10.1093/cvr/cvs260

CrossRef Medline Google Scholar
↵

Bunis DG, Bronevetsky Y, Krow-Lucal E, Bhakta NR, Kim CC, Nerella S, Jones N, Mendoza VF, Bryson YJ, Gern JE, et al. 2021. Single-cell mapping of progressive fetal-to-adult transition in human naive t cells. Cell Rep 34: 108573. doi:10.1016/j.celrep.2020.108573

CrossRef Medline Google Scholar
↵

Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. 2018. Integrating single-cell transcriptomic data across different conditions technologies and species. Nat Biotechnol 36: 411–420. doi:10.1038/nbt.4096

CrossRef Medline Google Scholar
↵

Cao ZJ, Gao G. 2022. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat Biotechnol 40: 1458–1466. doi:10.1038/s41587-022-01284-4

CrossRef Medline Google Scholar
↵

Cao J, O'Day DR, Pliner HA, Kingsley PD, Deng M, Daza RM, Zager MA, Aldinger KA, Blecher-Gonen R, Zhang F, et al. 2020. A human cell atlas of fetal gene expression. Science 370: eaba7721. doi:10.1126/science.aba7721

Abstract/FREE Full Text
↵

Cha J, Yu J, Cho JW, Hemberg M, Lee I. 2023. scHumanNet: a single-cell network analysis platform for the study of cell-type specificity of disease genes. Nucleic Acids Res 51: e8. doi:10.1093/nar/gkac1042

CrossRef Medline Google Scholar
↵

Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. 2015. Spatially resolved highly multiplexed RNA profiling in single cells. Science 348: aaa6090. doi:10.1126/science.aaa6090

Abstract/FREE Full Text
↵

Chen H, Ryu J, Vinyard ME, Lerer A, Pinello L. 2024. SIMBA: single-cell embedding along with features. Nat Methods 21: 1003–1013. doi:10.1038/s41592-023-01899-8

CrossRef Medline Google Scholar
↵

Conway JR, Lex A, Gehlenborg N. 2017. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33: 2938–2940. doi:10.1093/bioinformatics/btx364

CrossRef Medline Google Scholar
↵

Domcke S, Hill AJ, Daza RM, Cao J, O'Day DR, Pliner HA, Aldinger KA, Pokholok D, Zhang F, Milbank JH, et al. 2020. A human cell atlas of fetal chromatin accessibility. Science 370: eaba7612. doi:10.1126/science.aba7612

Abstract/FREE Full Text
↵

Domínguez Conde C, Xu C, Jarvis L, Rainbow D, Wells S, Gomes T, Howlett S, Suchanek O, Polanski K, King H, et al. 2022. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376: eabl5197. doi:10.1126/science.abl5197

CrossRef Medline Google Scholar
↵

Dries R, Chen J, del Rossi N, Khan MM, Sistig A, Yuan GC. 2021. Advances in spatial transcriptomic data analysis. Genome Res 31: 1706–1718. doi:10.1101/gr.275224.121

Abstract/FREE Full Text
↵

Du J, Jia P, Dai Y, Tao C, Zhao Z, Zhi D. 2019. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20: 7–15. doi:10.1186/s12864-018-5370-x

CrossRef Medline Google Scholar
↵

Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, Motamedi A, Shiau AK, Zhou X, Xie F, et al. 2021. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12: 1337. doi:10.1038/s41467-021-21583-9

CrossRef Medline Google Scholar
↵

Felten SY, Felten DL. 1994. Neural-immune interactions. Prog Brain Res 100: 157–162. doi:10.1016/S0079-6123(08)60781-5

CrossRef Medline Google Scholar
↵

Feng J, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying chip-seq enrichment using MACS. Nat Protoc 7: 1728–1740. doi:10.1038/nprot.2012.101

CrossRef Medline Google Scholar
↵

Fessler H. 1997. Heart-lung interactions: applications in the critically ill. Eur Respir J 10: 226–237. doi:10.1183/09031936.97.10010226

Abstract/FREE Full Text
↵

Fornes O, Castro-Mondragon JA, Khan A, Van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, et al. 2020. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 48: D87–D92. doi:10.1093/nar/gkz1001

CrossRef Medline Google Scholar
↵

Furness JB. 2012. The enteric nervous system and neurogastroenterology. Nat Rev Gastroenterol Hepatol 9: 286–294. doi:10.1038/nrgastro.2012.32

CrossRef Medline Google Scholar
↵

Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, Matthews L, Gong C, et al. 2022. The reactome pathway knowledgebase 2022. Nucleic Acids Res 50: D687–D692. doi:10.1093/nar/gkab1028

CrossRef Medline Google Scholar
↵

Gopal S, Yang Y. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp. 257–265. Association for Computing Machinery, New York. doi:10.1145/2487575.2487644

CrossRef Google Scholar
↵

Graeber MB, Streit WJ. 2010. Microglia: biology and pathology. Acta Neuropathol 119: 89–105. doi:10.1007/s00401-009-0622-0

CrossRef Medline Google Scholar
↵

Grover A, Leskovec J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, pp. 855–864. Association for Computing Machinery, New York. doi:10.1145/2939672.2939754

CrossRef Medline Google Scholar
↵

Gupta D, Shah HP, Malu K, Berliner N, Gaines P. 2014. Differentiation and characterization of myeloid cells. Curr Protoc Immunol 104: 22F.5.1–22F.5.28. doi:10.1002/0471142735.im22f05s104

CrossRef Medline Google Scholar
↵

Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsköld D, Hendriks GJ, Larsson AJ, Faridani OR, Sandberg R. 2020. Single-cell RNA counting at allele and isoform resolution using smart-seq3. Nat Biotechnol 38: 708–714. doi:10.1038/s41587-020-0497-0

CrossRef Medline Google Scholar
↵

Haghverdi L, Lun AT, Morgan MD, Marioni JC. 2018. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36: 421–427. doi:10.1038/nbt.4091

CrossRef Medline Google Scholar
↵

Hahaut V, Pavlinic D, Carbone W, Schuierer S, Balmer P, Quinodoz M, Renner M, Roma G, Cowan CS, Picelli S. 2022. Fast and highly sensitive full-length single-cell RNA sequencing using flash-seq. Nat Biotechnol 40: 1447–1451. doi:10.1038/s41587-022-01312-3

CrossRef Medline Google Scholar
↵

Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, Yang S, Kim CY, Lee M, Kim E, et al. 2018. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res 46: D380–D386. doi:10.1093/nar/gkx1013

CrossRef Medline Google Scholar
↵

Hao Y, Hao S, Andersen-Nissen E, Mauck WM III., Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, et al. 2021. Integrated analysis of multimodal single-cell data. Cell 184: 3573–3587.e29. doi:10.1016/j.cell.2021.04.048

CrossRef Medline Google Scholar
↵

Hasanaj E, Wang J, Sarathi A, Ding J, Bar-Joseph Z. 2022. Interactive single-cell data analysis using Cellar. Nat Commun 13: 1998. doi:10.1038/s41467-022-29744-0

CrossRef Medline Google Scholar
↵

Hie B, Bryson B, Berger B. 2019. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat Biotechnol 37: 685–691. doi:10.1038/s41587-019-0113-3

CrossRef Medline Google Scholar
↵

HuBMAP Consortium. 2019. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574: 187–192. doi:10.1038/s41586-019-1629-x

CrossRef Medline Google Scholar
↵

Jain S, Pei L, Spraggins JM, Angelo M, Carson JP, Gehlenborg N, Ginty F, Gonçalves JP, Hagood JS, Hickey JW, et al. 2023. Advances and prospects for the human biomolecular atlas program (HuBMAP). Nat Cell Biol 25: 1089–1100. doi:10.1038/s41556-023-01194-w

CrossRef Medline Google Scholar
↵

Khiem D, Cyster JG, Schwarz JJ, Black BL. 2008. A p38 MAPK-MEF2C pathway regulates B-cell proliferation. Proc Natl Acad Sci 105: 17067–17072. doi:10.1073/pnas.0804868105

Abstract/FREE Full Text
↵

Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. 2017. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14: 483–486. doi:10.1038/nmeth.4236

CrossRef Medline Google Scholar
↵

Klopfenstein D, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, Mungall CJ, Yunes JM, Botvinnik O, Weigel M, et al. 2018. GOATOOLS: a Python library for Gene Ontology analyses. Sci Rep 8: 10872. doi:10.1038/s41598-018-28948-z

CrossRef Medline Google Scholar
↵

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh P, Raychaudhuri S. 2019. Fast sensitive and accurate integration of single-cell data with harmony. Nat Methods 16: 1289–1296. doi:10.1038/s41592-019-0619-0

CrossRef Medline Google Scholar
↵

Langfelder P, Horvath S. 2008. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9: 559. doi:10.1186/1471-2105-9-559

CrossRef Medline Google Scholar
↵

Lee P, Blood P, Börner K, Campisi J, Chen F, Daldrup-Link H, Jager PD, Ding L, et al. 2022. NIH SenNet Consortium to map senescent cells throughout the human lifespan to understand physiological health. Nat Aging 2: 1090–1100. doi:10.1038/s43587-022-00326-5

CrossRef Medline Google Scholar
↵

Li H, Radford JC, Ragusa MJ, Shea KL, McKercher SR, Zaremba JD, Soussou W, Nie Z, Kang YJ, Nakanishi N, et al. 2008. Transcription factor MEF2C influences neural stem/progenitor cell differentiation and maturation in vivo. Proc Natl Acad Sci 105: 9397–9402. doi:10.1073/pnas.0802876105

Abstract/FREE Full Text
↵

Li D, Ding J, Bar-Joseph Z. 2021. Identifying signaling genes in spatial single-cell expression data. Bioinformatics 37: 968–975. doi:10.1093/bioinformatics/btaa769

CrossRef Medline Google Scholar
↵

Liang Q, Huang Y, He S, Chen K. 2023. Pathway centric analysis for single-cell RNA-seq and spatial transcriptomics data with GSDensity. Nat Commun 14: 8416. doi:10.1038/s41467-023-44206-x

CrossRef Medline Google Scholar
↵

Lin Q, Schwarz J, Bucana C, Olson EN. 1997. Control of mouse cardiac morphogenesis and myogenesis by transcription factor MEF2C. Science 276: 1404–1407. doi:10.1126/science.276.5317.1404

Abstract/FREE Full Text
↵

Lin Q, Lu J, Yanagisawa H, Webb R, Lyons GE, Richardson JA, Olson EN. 1998. Requirement of the MADS-box transcription factor MEF2C for vascular development. Development 125: 4565–4574. doi:10.1242/dev.125.22.4565

Abstract
↵

Liu J, Gao C, Sodicoff J, Kozareva V, Macosko EZ, Welch JD. 2020. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat Protoc 15: 3632–3662. doi:10.1038/s41596-020-0391-8

CrossRef Google Scholar
↵

Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. 2018. Deep generative modeling for single-cell transcriptomics. Nat Methods 15: 1053–1058. doi:10.1038/s41592-018-0229-2

CrossRef Medline Google Scholar
↵

Ma A, Wang X, Li J, Wang C, Xiao T, Liu Y, Cheng H, Wang J, Li Y, Chang Y, et al. 2023. Single-cell biological network inference using a heterogeneous graph transformer. Nat Commun 14: 964. doi:10.1038/s41467-023-36559-0

CrossRef Medline Google Scholar
↵

McKenzie AT, Wang M, Hauberg ME, Fullard JF, Kozlenkov A, Keenan A, Hurd YL, Dracheva S, Casaccia P, Roussos P, et al. 2018. Brain cell type specific gene expression and co-expression network architectures. Sci Rep 8: 8868. doi:10.1038/s41598-018-27293-5

CrossRef Medline Google Scholar
↵

Neyra JA, Hu MC, Moe OW. 2021. Klotho in clinical nephrology: diagnostic and therapeutic implications. Clin J Am Soc Nephrol 16: 162–176. doi:10.2215/CJN.02840320

Abstract/FREE Full Text
↵

Nguyen ND, Wang D. 2020. Multiview learning for understanding functional multiomics. PLoS Comput Biol 16: e1007677. doi:10.1371/journal.pcbi.1007677

CrossRef Medline Google Scholar
↵

Nguyen P, Leray V, Diez M, Serisier S, Bloc'h JL, Siliart B, Dumon H. 2008. Liver lipid metabolism. J Anim Physiol Anim Nutr (Berl) 92: 272–283. doi:10.1111/j.1439-0396.2007.00752.x

CrossRef Medline Google Scholar
↵

O'Donnell AM, Nakamura H, Puri P. 2019. Altered ryanodine receptor gene expression in Hirschsprung's disease. Pediatr Surg Int 35: 923–927. doi:10.1007/s00383-019-04504-2

CrossRef Medline Google Scholar
↵

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14: 417–419. doi:10.1038/nmeth.4197

CrossRef Medline Google Scholar
↵

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. 2011. Scikit-learn: machine learning in Python. J Mach Learn Res 12: 2825–2830.

CrossRef Medline Google Scholar
↵

Piquereau J, Ventura-Clapier R. 2018. Maturation of cardiac energy metabolism during perinatal development. Front Physiol 9: 959. doi:10.3389/fphys.2018.00959

CrossRef Google Scholar
↵

Pritykin Y, Ghersi D, Singh M. 2015. Genome-wide detection and analysis of multifunctional genes. PLoS Comput Biol 11: e1004467. doi:10.1371/journal.pcbi.1004467

CrossRef Medline Google Scholar
↵

Quah FX, Hemberg M. 2022. SC3s: efficient scaling of single cell consensus clustering to millions of cells. BMC Bioinformatics 23: 536. doi:10.1186/s12859-022-05085-z

CrossRef Medline Google Scholar
↵

Ratnovsky A, Elad D, Halpern P. 2008. Mechanics of respiratory muscles. Respir Physiol Neurobiol 163: 82–89. doi:10.1016/j.resp.2008.04.019

CrossRef Medline Google Scholar
↵

Rehurek R, Sojka P. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Valletta, Malta, pp. 45–50. European Language Resources Association, Paris. doi:10.13140/2.1.2393.1847

CrossRef Google Scholar
↵

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43: e47. doi:10.1093/nar/gkv007

CrossRef Medline Google Scholar
↵

Röder PV, Wu B, Liu Y, Han W. 2016. Pancreatic regulation of glucose homeostasis. Exp Mol Med 48: e219. doi:10.1038/emm.2016.6

CrossRef Medline Google Scholar
↵

Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ. 2019. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363: 1463–1467. doi:10.1126/science.aaw1219

Abstract/FREE Full Text
↵

Roux L, Buzsáki G. 2015. Tasks for inhibitory interneurons in intact brain circuits. Neuropharmacology 88: 10–23. doi:10.1016/j.neuropharm.2014.09.011

CrossRef Medline Google Scholar
↵

Rozenblatt-Rosen O, Stubbington MJ, Regev A, Teichmann SA. 2017. The human cell atlas: from vision to reality. Nature 550: 451–453. doi:10.1038/550451a

CrossRef Medline Google Scholar
↵

Sanders KM, Koh SD, Ro S, Ward SM. 2012. Regulation of gastrointestinal motility—insights from smooth muscle biology. Nat Rev Gastroenterol Hepatol 9: 633–645. doi:10.1038/nrgastro.2012.168

CrossRef Medline Google Scholar
↵

Satija R, Farrell JA, Gennert D, Schier AF, Regev A. 2015. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33: 495–502. doi:10.1038/nbt.3192

CrossRef Medline Google Scholar
↵

Severs NJ. 2000. The cardiac muscle cell. Bioessays 22: 188–199. doi:10.1002/(SICI)1521-1878(200002)22:2<188::AID-BIES10>3.0.CO;2-T

CrossRef Medline Google Scholar
↵

Sikkema L, Ramírez-Suástegui C, Strobl DC, Gillett TE, Zappia L, Madissoon E, Markov NS, Zaragosi LE, Ji Y, Ansari M, et al. 2023. An integrated cell atlas of the lung in health and disease. Nat Med 29: 1563–1577. doi:10.1038/s41591-023-02327-2

CrossRef Medline Google Scholar
↵

Sonawane AR, Platig J, Fagny M, Chen CY, Paulson JN, Lopes-Ramos CM, DeMeo DL, Quackenbush J, Glass K, Kuijjer ML. 2017. Understanding tissue-specific gene regulation. Cell Rep 21: 1077–1088. doi:10.1016/j.celrep.2017.10.001

CrossRef Medline Google Scholar
↵

Steuernagel L, Meckbach C, Heinrich F, Zeidler S, Schmitt AO, Gültas M. 2019. Computational identification of tissue-specific transcription factor cooperation in ten cattle tissues. PLoS One 14: e0216475. doi:10.1371/journal.pone.0216475

CrossRef Medline Google Scholar
↵

Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302: 249–255. doi:10.1126/science.1087447

Abstract/FREE Full Text
↵

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III., Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive integration of single-cell data. Cell 177: 1888–1902.e21. doi:10.1016/j.cell.2019.05.031

CrossRef Medline Google Scholar
↵

Teng H, Yuan Y, Bar-Joseph Z. 2022. Clustering spatial transcriptomics data. Bioinformatics 38: 997–1004. doi:10.1093/bioinformatics/btab704

CrossRef Google Scholar
↵

Traag VA, Waltman L, Van Eck NJ. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9: 5233. doi:10.1038/s41598-019-41695-z

CrossRef Medline Google Scholar
↵

Walton KD, Freddo AM, Wang S, Gumucio DL. 2016. Generation of intestinal surface: an absorbing tale. Development 143: 2261–2272. doi:10.1242/dev.135400

Abstract/FREE Full Text
↵

Welch JD, Hartemink AJ, Prins JF. 2017. MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol 18: 138. doi:10.1186/s13059-017-1269-0

CrossRef Medline Google Scholar
↵

Wilker PR, Kohyama M, Sandau MM, Albring JC, Nakagawa O, Schwarz JJ, Murphy KM. 2008. Transcription factor Mef2c is required for B cell proliferation and survival after antigen receptor stimulation. Nat Immunol 9: 603–612. doi:10.1038/ni.1609

CrossRef Medline Google Scholar
↵

Wolf FA, Angerer P, Theis FJ. 2018. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19: 15. doi:10.1186/s13059-017-1382-0

CrossRef Medline Google Scholar
↵

Xu M, Bai X, Ai B, Zhang G, Song C, Zhao J, Wang Y, Wei L, Qian F, Li Y, et al. 2022. TF-Marker: a comprehensive manually curated database for transcription factors and related markers in specific cell and tissue types in human. Nucleic Acids Res 50: D402–D412. doi:10.1093/nar/gkab1114

CrossRef Medline Google Scholar
↵

Yao Z, van Velthoven CT, Kunst M, Zhang M, McMillen D, Lee C, Jung W, Goldy J, Abdelhak A, Aitken M, et al. 2023. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624: 317–332. doi:10.1038/s41586-023-06812-z

CrossRef Medline Google Scholar
↵

Yeger-Lotem E, Sharan R. 2015. Human protein interaction networks across tissues and diseases. Front Genet 6: 257. doi:10.3389/fgene.2015.00257

CrossRef Google Scholar
↵

Yuan Y, Bar-Joseph Z. 2020. GCNG: graph convolutional networks for inferring gene interaction from spatial transcriptomics data. Genome Biol 21: 300. doi:10.1186/s13059-020-02214-w

CrossRef Medline Google Scholar
↵

Zhang K, Feng W, Wang P. 2022. Identification of spatially variable genes with graph cuts. Nat Commun 13: 5488. doi:10.1038/s41467-022-33182-3

CrossRef Google Scholar
↵

Zitnik M, Leskovec J. 2017. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33: i190–i198. doi:10.1093/bioinformatics/btx252

CrossRef Medline Google Scholar

[1] ↵

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, et al. 2017. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 14: 1083–1086. doi:10.1038/nmeth.4463

CrossRef Medline Google Scholar

[2] ↵

Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, Stegle O. 2020. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol 21: 111. doi:10.1186/s13059-020-02015-1

CrossRef Medline Google Scholar

[3] ↵

Argelaguet R, Cuomo AS, Stegle O, Marioni JC. 2021. Computational principles and challenges in single-cell data integration. Nat Biotechnol 39: 1202–1215. doi:10.1038/s41587-021-00895-7

CrossRef Google Scholar

[4] ↵

Ashuach T, Gabitto MI, Koodli RV, Saldi GA, Jordan MI, Yosef N. 2023. MultiVI: deep generative model for the integration of multimodal data. Nat Methods 20: 1222–1231. doi:10.1038/s41592-023-01909-9

CrossRef Google Scholar

[5] ↵

Blom B, Spits H. 2006. Development of human lymphoid cells. Annu Rev Immunol 24: 287–320. doi:10.1146/annurev.immunol.24.021605.090612

CrossRef Medline Google Scholar

[6] ↵

Bround MJ, Asghari P, Wambolt RB, Bohunek L, Smits C, Philit M, Kieffer TJ, Lakatta EG, Boheler KR, Moore ED, et al. 2012. Cardiac ryanodine receptors control heart rate and rhythmicity in adult mice. Cardiovasc Res 96: 372–380. doi:10.1093/cvr/cvs260

CrossRef Medline Google Scholar

[7] ↵

Bunis DG, Bronevetsky Y, Krow-Lucal E, Bhakta NR, Kim CC, Nerella S, Jones N, Mendoza VF, Bryson YJ, Gern JE, et al. 2021. Single-cell mapping of progressive fetal-to-adult transition in human naive t cells. Cell Rep 34: 108573. doi:10.1016/j.celrep.2020.108573

CrossRef Medline Google Scholar

[8] ↵

Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. 2018. Integrating single-cell transcriptomic data across different conditions technologies and species. Nat Biotechnol 36: 411–420. doi:10.1038/nbt.4096

CrossRef Medline Google Scholar

[9] ↵

Cao ZJ, Gao G. 2022. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat Biotechnol 40: 1458–1466. doi:10.1038/s41587-022-01284-4

CrossRef Medline Google Scholar

[10] ↵

Cao J, O'Day DR, Pliner HA, Kingsley PD, Deng M, Daza RM, Zager MA, Aldinger KA, Blecher-Gonen R, Zhang F, et al. 2020. A human cell atlas of fetal gene expression. Science 370: eaba7721. doi:10.1126/science.aba7721

Abstract/FREE Full Text

[11] ↵

Cha J, Yu J, Cho JW, Hemberg M, Lee I. 2023. scHumanNet: a single-cell network analysis platform for the study of cell-type specificity of disease genes. Nucleic Acids Res 51: e8. doi:10.1093/nar/gkac1042

CrossRef Medline Google Scholar

[12] ↵

Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. 2015. Spatially resolved highly multiplexed RNA profiling in single cells. Science 348: aaa6090. doi:10.1126/science.aaa6090

Abstract/FREE Full Text

[13] ↵

Chen H, Ryu J, Vinyard ME, Lerer A, Pinello L. 2024. SIMBA: single-cell embedding along with features. Nat Methods 21: 1003–1013. doi:10.1038/s41592-023-01899-8

CrossRef Medline Google Scholar

[14] ↵

Conway JR, Lex A, Gehlenborg N. 2017. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33: 2938–2940. doi:10.1093/bioinformatics/btx364

CrossRef Medline Google Scholar

[15] ↵

Domcke S, Hill AJ, Daza RM, Cao J, O'Day DR, Pliner HA, Aldinger KA, Pokholok D, Zhang F, Milbank JH, et al. 2020. A human cell atlas of fetal chromatin accessibility. Science 370: eaba7612. doi:10.1126/science.aba7612

Abstract/FREE Full Text

[16] ↵

Domínguez Conde C, Xu C, Jarvis L, Rainbow D, Wells S, Gomes T, Howlett S, Suchanek O, Polanski K, King H, et al. 2022. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376: eabl5197. doi:10.1126/science.abl5197

CrossRef Medline Google Scholar

[17] ↵

Dries R, Chen J, del Rossi N, Khan MM, Sistig A, Yuan GC. 2021. Advances in spatial transcriptomic data analysis. Genome Res 31: 1706–1718. doi:10.1101/gr.275224.121

Abstract/FREE Full Text

[18] ↵

Du J, Jia P, Dai Y, Tao C, Zhao Z, Zhi D. 2019. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20: 7–15. doi:10.1186/s12864-018-5370-x

CrossRef Medline Google Scholar

[19] ↵

Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, Motamedi A, Shiau AK, Zhou X, Xie F, et al. 2021. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12: 1337. doi:10.1038/s41467-021-21583-9

CrossRef Medline Google Scholar

[20] ↵

Felten SY, Felten DL. 1994. Neural-immune interactions. Prog Brain Res 100: 157–162. doi:10.1016/S0079-6123(08)60781-5

CrossRef Medline Google Scholar

[21] ↵

Feng J, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying chip-seq enrichment using MACS. Nat Protoc 7: 1728–1740. doi:10.1038/nprot.2012.101

CrossRef Medline Google Scholar

[22] ↵

Fessler H. 1997. Heart-lung interactions: applications in the critically ill. Eur Respir J 10: 226–237. doi:10.1183/09031936.97.10010226

Abstract/FREE Full Text

[23] ↵

Fornes O, Castro-Mondragon JA, Khan A, Van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, et al. 2020. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 48: D87–D92. doi:10.1093/nar/gkz1001

CrossRef Medline Google Scholar

[24] ↵

Furness JB. 2012. The enteric nervous system and neurogastroenterology. Nat Rev Gastroenterol Hepatol 9: 286–294. doi:10.1038/nrgastro.2012.32

CrossRef Medline Google Scholar

[25] ↵

Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, Matthews L, Gong C, et al. 2022. The reactome pathway knowledgebase 2022. Nucleic Acids Res 50: D687–D692. doi:10.1093/nar/gkab1028

CrossRef Medline Google Scholar

[26] ↵

Gopal S, Yang Y. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp. 257–265. Association for Computing Machinery, New York. doi:10.1145/2487575.2487644

CrossRef Google Scholar

[27] ↵

Graeber MB, Streit WJ. 2010. Microglia: biology and pathology. Acta Neuropathol 119: 89–105. doi:10.1007/s00401-009-0622-0

CrossRef Medline Google Scholar

[28] ↵

Grover A, Leskovec J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, pp. 855–864. Association for Computing Machinery, New York. doi:10.1145/2939672.2939754

CrossRef Medline Google Scholar

[29] ↵

Gupta D, Shah HP, Malu K, Berliner N, Gaines P. 2014. Differentiation and characterization of myeloid cells. Curr Protoc Immunol 104: 22F.5.1–22F.5.28. doi:10.1002/0471142735.im22f05s104

CrossRef Medline Google Scholar

[30] ↵

Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsköld D, Hendriks GJ, Larsson AJ, Faridani OR, Sandberg R. 2020. Single-cell RNA counting at allele and isoform resolution using smart-seq3. Nat Biotechnol 38: 708–714. doi:10.1038/s41587-020-0497-0

CrossRef Medline Google Scholar

[31] ↵

Haghverdi L, Lun AT, Morgan MD, Marioni JC. 2018. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36: 421–427. doi:10.1038/nbt.4091

CrossRef Medline Google Scholar

[32] ↵

Hahaut V, Pavlinic D, Carbone W, Schuierer S, Balmer P, Quinodoz M, Renner M, Roma G, Cowan CS, Picelli S. 2022. Fast and highly sensitive full-length single-cell RNA sequencing using flash-seq. Nat Biotechnol 40: 1447–1451. doi:10.1038/s41587-022-01312-3

CrossRef Medline Google Scholar

[33] ↵

Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, Yang S, Kim CY, Lee M, Kim E, et al. 2018. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res 46: D380–D386. doi:10.1093/nar/gkx1013

CrossRef Medline Google Scholar

[34] ↵

Hao Y, Hao S, Andersen-Nissen E, Mauck WM III., Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, et al. 2021. Integrated analysis of multimodal single-cell data. Cell 184: 3573–3587.e29. doi:10.1016/j.cell.2021.04.048

CrossRef Medline Google Scholar

[35] ↵

Hasanaj E, Wang J, Sarathi A, Ding J, Bar-Joseph Z. 2022. Interactive single-cell data analysis using Cellar. Nat Commun 13: 1998. doi:10.1038/s41467-022-29744-0

CrossRef Medline Google Scholar

[36] ↵

Hie B, Bryson B, Berger B. 2019. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat Biotechnol 37: 685–691. doi:10.1038/s41587-019-0113-3

CrossRef Medline Google Scholar

[37] ↵

HuBMAP Consortium. 2019. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574: 187–192. doi:10.1038/s41586-019-1629-x

CrossRef Medline Google Scholar

[38] ↵

Jain S, Pei L, Spraggins JM, Angelo M, Carson JP, Gehlenborg N, Ginty F, Gonçalves JP, Hagood JS, Hickey JW, et al. 2023. Advances and prospects for the human biomolecular atlas program (HuBMAP). Nat Cell Biol 25: 1089–1100. doi:10.1038/s41556-023-01194-w

CrossRef Medline Google Scholar

[39] ↵

Khiem D, Cyster JG, Schwarz JJ, Black BL. 2008. A p38 MAPK-MEF2C pathway regulates B-cell proliferation. Proc Natl Acad Sci 105: 17067–17072. doi:10.1073/pnas.0804868105

Abstract/FREE Full Text

[40] ↵

Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. 2017. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14: 483–486. doi:10.1038/nmeth.4236

CrossRef Medline Google Scholar

[41] ↵

Klopfenstein D, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, Mungall CJ, Yunes JM, Botvinnik O, Weigel M, et al. 2018. GOATOOLS: a Python library for Gene Ontology analyses. Sci Rep 8: 10872. doi:10.1038/s41598-018-28948-z

CrossRef Medline Google Scholar

[42] ↵

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh P, Raychaudhuri S. 2019. Fast sensitive and accurate integration of single-cell data with harmony. Nat Methods 16: 1289–1296. doi:10.1038/s41592-019-0619-0

CrossRef Medline Google Scholar

[43] ↵

Langfelder P, Horvath S. 2008. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9: 559. doi:10.1186/1471-2105-9-559

CrossRef Medline Google Scholar

[44] ↵

Lee P, Blood P, Börner K, Campisi J, Chen F, Daldrup-Link H, Jager PD, Ding L, et al. 2022. NIH SenNet Consortium to map senescent cells throughout the human lifespan to understand physiological health. Nat Aging 2: 1090–1100. doi:10.1038/s43587-022-00326-5

CrossRef Medline Google Scholar

[45] ↵

Li H, Radford JC, Ragusa MJ, Shea KL, McKercher SR, Zaremba JD, Soussou W, Nie Z, Kang YJ, Nakanishi N, et al. 2008. Transcription factor MEF2C influences neural stem/progenitor cell differentiation and maturation in vivo. Proc Natl Acad Sci 105: 9397–9402. doi:10.1073/pnas.0802876105

Abstract/FREE Full Text

[46] ↵

Li D, Ding J, Bar-Joseph Z. 2021. Identifying signaling genes in spatial single-cell expression data. Bioinformatics 37: 968–975. doi:10.1093/bioinformatics/btaa769

CrossRef Medline Google Scholar

[47] ↵

Liang Q, Huang Y, He S, Chen K. 2023. Pathway centric analysis for single-cell RNA-seq and spatial transcriptomics data with GSDensity. Nat Commun 14: 8416. doi:10.1038/s41467-023-44206-x

CrossRef Medline Google Scholar

[48] ↵

Lin Q, Schwarz J, Bucana C, Olson EN. 1997. Control of mouse cardiac morphogenesis and myogenesis by transcription factor MEF2C. Science 276: 1404–1407. doi:10.1126/science.276.5317.1404

Abstract/FREE Full Text

[49] ↵

Lin Q, Lu J, Yanagisawa H, Webb R, Lyons GE, Richardson JA, Olson EN. 1998. Requirement of the MADS-box transcription factor MEF2C for vascular development. Development 125: 4565–4574. doi:10.1242/dev.125.22.4565

Abstract

[50] ↵

Liu J, Gao C, Sodicoff J, Kozareva V, Macosko EZ, Welch JD. 2020. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat Protoc 15: 3632–3662. doi:10.1038/s41596-020-0391-8

CrossRef Google Scholar

[51] ↵

Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. 2018. Deep generative modeling for single-cell transcriptomics. Nat Methods 15: 1053–1058. doi:10.1038/s41592-018-0229-2

CrossRef Medline Google Scholar

[52] ↵

Ma A, Wang X, Li J, Wang C, Xiao T, Liu Y, Cheng H, Wang J, Li Y, Chang Y, et al. 2023. Single-cell biological network inference using a heterogeneous graph transformer. Nat Commun 14: 964. doi:10.1038/s41467-023-36559-0

CrossRef Medline Google Scholar

[53] ↵

McKenzie AT, Wang M, Hauberg ME, Fullard JF, Kozlenkov A, Keenan A, Hurd YL, Dracheva S, Casaccia P, Roussos P, et al. 2018. Brain cell type specific gene expression and co-expression network architectures. Sci Rep 8: 8868. doi:10.1038/s41598-018-27293-5

CrossRef Medline Google Scholar

[54] ↵

Neyra JA, Hu MC, Moe OW. 2021. Klotho in clinical nephrology: diagnostic and therapeutic implications. Clin J Am Soc Nephrol 16: 162–176. doi:10.2215/CJN.02840320

Abstract/FREE Full Text

[55] ↵

Nguyen ND, Wang D. 2020. Multiview learning for understanding functional multiomics. PLoS Comput Biol 16: e1007677. doi:10.1371/journal.pcbi.1007677

CrossRef Medline Google Scholar

[56] ↵

Nguyen P, Leray V, Diez M, Serisier S, Bloc'h JL, Siliart B, Dumon H. 2008. Liver lipid metabolism. J Anim Physiol Anim Nutr (Berl) 92: 272–283. doi:10.1111/j.1439-0396.2007.00752.x

CrossRef Medline Google Scholar

[57] ↵

O'Donnell AM, Nakamura H, Puri P. 2019. Altered ryanodine receptor gene expression in Hirschsprung's disease. Pediatr Surg Int 35: 923–927. doi:10.1007/s00383-019-04504-2

CrossRef Medline Google Scholar

[58] ↵

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14: 417–419. doi:10.1038/nmeth.4197

CrossRef Medline Google Scholar

[59] ↵

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. 2011. Scikit-learn: machine learning in Python. J Mach Learn Res 12: 2825–2830.

CrossRef Medline Google Scholar

[60] ↵

Piquereau J, Ventura-Clapier R. 2018. Maturation of cardiac energy metabolism during perinatal development. Front Physiol 9: 959. doi:10.3389/fphys.2018.00959

CrossRef Google Scholar

[61] ↵

Pritykin Y, Ghersi D, Singh M. 2015. Genome-wide detection and analysis of multifunctional genes. PLoS Comput Biol 11: e1004467. doi:10.1371/journal.pcbi.1004467

CrossRef Medline Google Scholar

[62] ↵

Quah FX, Hemberg M. 2022. SC3s: efficient scaling of single cell consensus clustering to millions of cells. BMC Bioinformatics 23: 536. doi:10.1186/s12859-022-05085-z

CrossRef Medline Google Scholar

[63] ↵

Ratnovsky A, Elad D, Halpern P. 2008. Mechanics of respiratory muscles. Respir Physiol Neurobiol 163: 82–89. doi:10.1016/j.resp.2008.04.019

CrossRef Medline Google Scholar

[64] ↵

Rehurek R, Sojka P. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Valletta, Malta, pp. 45–50. European Language Resources Association, Paris. doi:10.13140/2.1.2393.1847

CrossRef Google Scholar

[65] ↵

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43: e47. doi:10.1093/nar/gkv007

CrossRef Medline Google Scholar

[66] ↵

Röder PV, Wu B, Liu Y, Han W. 2016. Pancreatic regulation of glucose homeostasis. Exp Mol Med 48: e219. doi:10.1038/emm.2016.6

CrossRef Medline Google Scholar

[67] ↵

Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ. 2019. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363: 1463–1467. doi:10.1126/science.aaw1219

Abstract/FREE Full Text

[68] ↵

Roux L, Buzsáki G. 2015. Tasks for inhibitory interneurons in intact brain circuits. Neuropharmacology 88: 10–23. doi:10.1016/j.neuropharm.2014.09.011

CrossRef Medline Google Scholar

[69] ↵

Rozenblatt-Rosen O, Stubbington MJ, Regev A, Teichmann SA. 2017. The human cell atlas: from vision to reality. Nature 550: 451–453. doi:10.1038/550451a

CrossRef Medline Google Scholar

[70] ↵

Sanders KM, Koh SD, Ro S, Ward SM. 2012. Regulation of gastrointestinal motility—insights from smooth muscle biology. Nat Rev Gastroenterol Hepatol 9: 633–645. doi:10.1038/nrgastro.2012.168

CrossRef Medline Google Scholar

[71] ↵

Satija R, Farrell JA, Gennert D, Schier AF, Regev A. 2015. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33: 495–502. doi:10.1038/nbt.3192

CrossRef Medline Google Scholar

[72] ↵

Severs NJ. 2000. The cardiac muscle cell. Bioessays 22: 188–199. doi:10.1002/(SICI)1521-1878(200002)22:2<188::AID-BIES10>3.0.CO;2-T

CrossRef Medline Google Scholar

[73] ↵

Sikkema L, Ramírez-Suástegui C, Strobl DC, Gillett TE, Zappia L, Madissoon E, Markov NS, Zaragosi LE, Ji Y, Ansari M, et al. 2023. An integrated cell atlas of the lung in health and disease. Nat Med 29: 1563–1577. doi:10.1038/s41591-023-02327-2

CrossRef Medline Google Scholar

[74] ↵

Sonawane AR, Platig J, Fagny M, Chen CY, Paulson JN, Lopes-Ramos CM, DeMeo DL, Quackenbush J, Glass K, Kuijjer ML. 2017. Understanding tissue-specific gene regulation. Cell Rep 21: 1077–1088. doi:10.1016/j.celrep.2017.10.001

CrossRef Medline Google Scholar

[75] ↵

Steuernagel L, Meckbach C, Heinrich F, Zeidler S, Schmitt AO, Gültas M. 2019. Computational identification of tissue-specific transcription factor cooperation in ten cattle tissues. PLoS One 14: e0216475. doi:10.1371/journal.pone.0216475

CrossRef Medline Google Scholar

[76] ↵

Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302: 249–255. doi:10.1126/science.1087447

Abstract/FREE Full Text

[77] ↵

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III., Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive integration of single-cell data. Cell 177: 1888–1902.e21. doi:10.1016/j.cell.2019.05.031

CrossRef Medline Google Scholar

[78] ↵

Teng H, Yuan Y, Bar-Joseph Z. 2022. Clustering spatial transcriptomics data. Bioinformatics 38: 997–1004. doi:10.1093/bioinformatics/btab704

CrossRef Google Scholar

[79] ↵

Traag VA, Waltman L, Van Eck NJ. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9: 5233. doi:10.1038/s41598-019-41695-z

CrossRef Medline Google Scholar

[80] ↵

Walton KD, Freddo AM, Wang S, Gumucio DL. 2016. Generation of intestinal surface: an absorbing tale. Development 143: 2261–2272. doi:10.1242/dev.135400

Abstract/FREE Full Text

[81] ↵

Welch JD, Hartemink AJ, Prins JF. 2017. MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol 18: 138. doi:10.1186/s13059-017-1269-0

CrossRef Medline Google Scholar

[82] ↵

Wilker PR, Kohyama M, Sandau MM, Albring JC, Nakagawa O, Schwarz JJ, Murphy KM. 2008. Transcription factor Mef2c is required for B cell proliferation and survival after antigen receptor stimulation. Nat Immunol 9: 603–612. doi:10.1038/ni.1609

CrossRef Medline Google Scholar

[83] ↵

Wolf FA, Angerer P, Theis FJ. 2018. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19: 15. doi:10.1186/s13059-017-1382-0

CrossRef Medline Google Scholar

[84] ↵

Xu M, Bai X, Ai B, Zhang G, Song C, Zhao J, Wang Y, Wei L, Qian F, Li Y, et al. 2022. TF-Marker: a comprehensive manually curated database for transcription factors and related markers in specific cell and tissue types in human. Nucleic Acids Res 50: D402–D412. doi:10.1093/nar/gkab1114

CrossRef Medline Google Scholar

[85] ↵

Yao Z, van Velthoven CT, Kunst M, Zhang M, McMillen D, Lee C, Jung W, Goldy J, Abdelhak A, Aitken M, et al. 2023. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624: 317–332. doi:10.1038/s41586-023-06812-z

CrossRef Medline Google Scholar

[86] ↵

Yeger-Lotem E, Sharan R. 2015. Human protein interaction networks across tissues and diseases. Front Genet 6: 257. doi:10.3389/fgene.2015.00257

CrossRef Google Scholar

[87] ↵

Yuan Y, Bar-Joseph Z. 2020. GCNG: graph convolutional networks for inferring gene interaction from spatial transcriptomics data. Genome Biol 21: 300. doi:10.1186/s13059-020-02214-w

CrossRef Medline Google Scholar

[88] ↵

Zhang K, Feng W, Wang P. 2022. Identification of spatially variable genes with graph cuts. Nat Commun 13: 5488. doi:10.1038/s41467-022-33182-3

CrossRef Google Scholar

[89] ↵

Zitnik M, Leskovec J. 2017. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33: i190–i198. doi:10.1093/bioinformatics/btx252

CrossRef Medline Google Scholar

A unified analysis of atlas single-cell data

Abstract

Results

GIANT integrates multimodality and multitissue data

Genes closely embedded in the space show tissue-specific GO enrichment

Testing GIANT integration on a developmental atlas data set

Inferring gene functions from multimodality multitissue embeddings

Gene embeddings enable the discovery of tissue-specific gene regulations

Discussion

Methods

Data sets

scRNA-seq data sets

scATAC-seq data sets

Slide-seq data sets

Clustering cells with the HuBMAP modality-specific pipelines

Constructing gene graphs for cell clusters

Connecting gene graphs through a cell cluster dendrogram

Learning gene embeddings from gene graphs and the cell cluster dendrogram

Splitting embedding space into components and identifying enriched GOs and TF regulons in them

Inferring multicellular functions of genes

Software availability

Competing interest statement

Acknowledgments

Footnotes

References

This article has not yet been cited by other articles.

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

ORCID

Share

Preprint Server

Navigate This Article

Current Issue

In This Issue