Comprehensive characterization of tissue-specific chromatin accessibility in L2 Caenorhabditis elegans nematodes

Recently developed single-cell technologies allow researchers to characterize cell states at ever greater resolution and scale. Caenorhabditis elegans is a particularly tractable system for studying development, and recent single-cell RNA-seq studies characterized the gene expression patterns for nearly every cell type in the embryo and at the second larval stage (L2). Gene expression patterns give insight about gene function and into the biochemical state of different cell types; recent advances in other single-cell genomics technologies can now also characterize the regulatory context of the genome that gives rise to these gene expression levels at a single-cell resolution. To explore the regulatory DNA of individual cell types in C. elegans, we collected single-cell chromatin accessibility data using the sci-ATAC-seq assay in L2 larvae to match the available single-cell RNA-seq data set. By using a novel implementation of the latent Dirichlet allocation algorithm, we identify 37 clusters of cells that correspond to different cell types in the worm, providing new maps of putative cell type–specific gene regulatory sites, with promise for better understanding of cellular differentiation and gene regulation.


Supplementary Files
1. Supplementary Code.tgz -This file contains all of the code for running data and analysis pipelines, along with Jupyter notebooks for reproducing paper figures. It can also be found at https://github.com/tdurham86/L2 sci-ATAC-seq.
2. Supplementary LDA Code.tgz -This file contains the latent Dirichlet allocation implementation. It can also be found at https://github.com/gevirl/LDA.  (Kudron et al., 2018) from different developmental stages, and compared this to a randomlydrawn null distribution of overlaps across developmental stages using the log 2 ratio. Error bars show the 95% confidence interval from comparison to 100 random samples.

Supplementary Figures
Supplementary Figure S3: sci-ATAC-seq peaks tend to be over regions with histone marks associated with gene activation. For each sci-ATAC-seq peak, we computed the average ChIPseq signal from four histone modification data sets collected in L2 worms Jänes et al. (2018), took the log 2 ratio of each peak's signal to the mean ChIP-seq signal across the whole chromosome, and then plotted the distribution of these ratios (blue side of each violin plot). We then randomly shuffled the sci-ATAC-seq peak regions and re-computed the same peak-mean to chromosome-mean ratios (orange side of each violin plot). The horizontal dotted lines in each violin plot show the quartiles of the distributions. Compared to randomly-shuffled locations, sci-ATAC-seq peaks are enriched for H3K4me3 and H3K4me1 histone marks, which are associated with active regulatory regions; uncorrelated with the H3K36me3 histone mark, which is found over actively transcribed gene bodies; and depleted for signal from the repressive H3K27me3 histone mark.

A B
Supplementary Figure S4: Tuning the number of topics using 5-fold cross validation. Models were trained on 4 folds and tested on a held-out fold for varying numbers of topics. The average minimum topic number (solid line) was calculated, and used as the basis to pick a number of topics 1.5 times greater for use in training the full LDA model ( 5   24  51  47  13  23  1  17  43  30  9  41  36  46  10  25  31  48  53  12  21  27  7  18  37  40  4  35  19  6  16  15  33  0  45  14  32 Figure S6: Sci-ATAC-seq peaks called from the whole-worm refinement LDA analysis show enrichment and depletion for overlaps with TF ChIP-seq peaks that is consistent with the TF tissue-specificity. We extended the analysis shown in Fig. 4 to all TFs with ChIP-seq data. The heatmap shows the mean log 2 ratio of the overlap counts for the peaks called based on each LDA topic to 500 randomly sampled sets of matched size from each topic. Each row displays the results for a single TF across all 37 topics used for clustering in the refinement LDA analysis. The rows and columns are hierarchically clustered based on Euclidean distance, using sklearn.metrics.pairwise.nan_euclidean_distances to ignore any TF/topic pairs with no overlaps (grey in the heatmap). TFs with similar tissue-specific expression patterns tend to show similar enrichment and depletion patterns for overlaps with peaks from different topics, and these patterns are consistent with our inferred tissue identities for the topics (Figs. 5, S9).
Supplementary Figure S7: Selecting the top 250 most specific peaks for each topic. For each topic, ranking peaks by the fraction of cut sites from each peak that LDA assigned to that topic reveals that a small proportion of peaks are highly-specific to each topic. The purple vertical lines show the 250 peak threshold that we used for selecting peaks to cross-reference with the scRNA-seq data for associating cell types with the different topics. The red vertical lines show where the weights of the peaks for each topic reach zero; any peaks to the right of the red line have no cut sites assigned that topic. Note that this figure includes all topics from the whole worm refinement LDA model, regardless of whether they were later used for clustering or not.  Fig. 5, we computed the tissue enrichment values by normalizing the tissue expression values for each gene to sum to one, then calculating the log 2 ratio of the mean expression distribution for the top 250 genes by peak topic-specificity to the mean tissue expression distribution of 250 randomly-selected genes. This was repeated for 100 random samples of 250 genes, and the mean log 2 ratio was plotted with error bars indicating the 95% confidence interval for the enrichment of each tissue.  Figure S10: The most topic-specific peaks tend to be depleted of overlaps from other data sets. We counted the fraction of the top 250 most specific genes (Fig. S7) that overlap a TF ChIP-seq site (A) or a bulk ATAC-seq site (Jänes et al., 2018) (B) and compared these fractions to the total fraction of sci-ATAC-seq peaks with overlaps ( Fig. 2A) using the log 2 ratio. For most topics, the highly specific peaks are depleted for overlaps relative to all sci-ATAC-seq peaks. The topics are ordered on the x-axis by their cluster density from Fig. S5.
Supplementary Figure S11: Proximal chromatin accessibility patterns suggest tissuespecific gene regulation of broadly expressed genes. We calculated the entropy of the tissue expression distribution for each gene in the scRNA-seq, and the entropy of the number of overlaps with peaks from different tissues for the peaks within 1200 bp of each gene. Lower entropy scores indicate more tissue-specificity. We visualized the data for 13,111 genes as two-dimensional kernel density estimate plot. We colored the scatter plot by the peak diversity score of each gene, which we define as the number of unique combinations of overlapping peaks from different tissues that are associated with each gene. Genes with high diversity scores tend to have many different peak arrangements involving multiple tissue types, suggesting a more complex regulatory environment at that gene.  Figure S12: Some genes with multiple start sites show promoter accessibility that suggests tissue-specific isoform usage. (A) Gene T23E7.2 is predominantly expressed in muscle and pharynx according to scRNA-seq. There is a large peak of accessibility in muscle cells (red) at the start of the long isoform, and a smaller peak of pharynx accessibility (blue) at the start of the medium/short isoform. The FACS embryo RNA-seq Warner et al. (2019) shows elevated pharyngeal contribution to the global expression in the last two exons. (B) Gene lin-2 shows a broader expression pattern across tissues, with the accessibility pattern suggesting the long isoform is expressed in muscle (red), intestine (yellow), and pharynx (blue), while the shorter isoform shows gonad (pink) and neuron (purple) accessibility. Again, the embryo RNA-seq supports the sci-ATAC-seq data, with greater neuron signal over the short isoform exons. (C ) By scRNA-seq, gene C30F2.2 is mostly expressed in pharynx (blue) and neurons(purple), with some expression in intestine (yellow). There is prominent pharyngeal and intestine accessibility over the start of the long isoform, and a neuron peak over the short isoform. By scRNA-seq, gene C30F2.3 on the opposite strand is expressed in pharynx, with very little intestine or neuron expression. There is also an intriguing complex pattern of accessibility and RNA-seq signal downstream of C30F2.3, giving one example of the complexity captured in these data.  unc-62, lin-39, unc-129, unc-3, unc-4, unc-5 unc-25, unc-30, unc-46, unc-47 unc-17, cho-1, cha-1, ceh-24, ceh-17, acr-15, pdf-1, lim-4, glr-4, glr-5, acr-15  . Each scatter plot dot represents a cell, and the number of genes with nearby accessibility in a given cell is shown by the color and size of its dot on the scatter plot, with cells showing accessibility near more marker genes having dots that are larger and more yellow. Information below each plot details the names of the neurons being highlighted, the type of neuron, and the marker genes used to generate the plot.
Supplementary Figure S17: Topic probabilities for neuron subclustering LDA analysis. Part 1. UMAP plots displaying the results of performing our iterative LDA procedure on only neuron cells (topics 0, 6, 14, 15, 16, 18, 19, 32, 33, 38, and 45 in the whole worm refinement LDA, see Fig. S9). Each dot in the scatterplot represents one cell, and in each plot the cells are colored by their probability for each LDA topic.
Supplementary Figure S18: Topic probabilities for neuron subclustering LDA analysis. Part 2. UMAP plots displaying the results of performing our iterative LDA procedure on only neuron cells (topics 0, 6, 14, 15, 16, 18, 19, 32, 33, 38, and 45 in the whole worm refinement LDA, see Fig. S9). Each dot in the scatterplot represents one cell, and in each plot the cells are colored by their probability for each LDA topic.
Supplementary Figure S19: Topic probabilities for neuron subclustering LDA analysis. Part 3. UMAP plots displaying the results of performing our iterative LDA procedure on only neuron cells (topics 0, 6, 14, 15, 16, 18, 19, 32, 33, 38, and 45 in the whole worm refinement LDA, see Fig. S9). Each dot in the scatterplot represents one cell, and in each plot the cells are colored by their probability for each LDA topic. Markers: emb-9, dig-1, let-2, mig-6, snf-11, glb-26, aex-2, bgal-1, hex-1 Supplementary Figure  Here, the gonad LDA analysis largely separates the cells with accessibility near these two genes (A), suggesting that the cells with high topic 0 probability (Fig. S25) are candidate distal tip cells, while most of the others are candidate germline cells. This observation is also supported by looking for accessibility near the fbf-1 and fbf-2 genes (B), which encode RNA binding proteins that function downstream of GLP-1 to maintain germ cells in the mitotic state. The candidate distal tip cells also show coaccessibility near other distal tip cell marker genes identified from the single cell RNA-seq data (C ), and similarly, additional germline marker genes show nearby coaccessibility in the same cells that have accessible sites near the fbf genes (D).
Supplementary Figure S27: Topic probabilities for hypodermis subclustering LDA analysis. UMAP plots displaying the results of performing our iterative LDA procedure on only hypodermal cells (topics 1, 9, 10, 17, 25, 30, 41, 43, and 46 in the whole worm refinement LDA, see Fig. S9). Each dot in the scatterplot represents one cell, and in each plot the cells are colored by their probability for each LDA topic. Supplementary Figure S30: Marker genes identify pharynx subclusters. To identify pharynx subclusters, we assessed co-accessibility of sites near genes enriched in expression in the pharyngeal tissues identified in Cao et al. 2017(Cao et al., 2017. The genes that we selected have greater than five-fold enrichment in the specified pharyngeal tissue compared to all other tissues, as reported by the GExplore website (http://genome.sfu.ca/gexplore/gexplore search tissues.html). Note that a relatively small subset of the genes matching the expression criteria in these tissues have nearby peaks, probably because our experiment recovered relatively few pharyngeal cells, reducing our power to detect pharynx-specific peaks. Nevertheless, we find that the genes with enriched expression in different pharyngeal tissues show nearby co-accessibility in cells with high probability in different topics. In particular, cells with high probability for topics 3 and 4 ( Fig. S29) have more accessibility near genes expressed in pharyngeal neurons (A), cells with high probability in topic 1 (Fig. S29) have more accessibility near genes expressed in pharyngeal gland (B), cells with high probability in topic 2 (Fig. S29) have more accessibility near genes expressed in pharyngeal epithelium (C ), and cells with high probability in topic 0 (Fig. S29) have more accessibility near genes expressed in pharyngeal muscle (D).

A B
Ranked cells  Figure S32: Filtering peaks found in too many or too few cells. Peaks were ranked by the fraction of cells in which they were detected (blue line), and outlier peaks were filtered out. The thresholds (red vertical lines) were determined by automatically finding the inflection points in the ranking curve (orange line). (A) Filtering peaks before the whole-worm primary LDA iteration. (B) Filtering peaks before the whole-worm refinement LDA iteration.