Single-molecule long-read sequencing reveals the chromatin basis of gene expression

Genome-wide chromatin accessibility and nucleosome occupancy profiles have been widely investigated, while the long-range dynamics remain poorly studied at the single-cell level. Here, we present a new experimental approach, methyltransferase treatment followed by single-molecule long-read sequencing (MeSMLR-seq), for long-range mapping of nucleosomes and chromatin accessibility at single DNA molecules and thus achieve comprehensive-coverage characterization of the corresponding heterogeneity. MeSMLR-seq offers direct measurements of both nucleosome-occupied and nucleosome-evicted regions on a single DNA molecule, which is challenging for many existing methods. We applied MeSMLR-seq to haploid yeast, where single DNA molecules represent single cells, and thus we could investigate the combinatorics of many (up to 356) nucleosomes at long range in single cells. We illustrated the differential organization principles of nucleosomes surrounding the transcription start site for silent and actively transcribed genes, at the single-cell level and in the long-range scale. The heterogeneous patterns of chromatin status spanning multiple genes were phased. Together with single-cell RNA-seq data, we quantitatively revealed how chromatin accessibility correlated with gene transcription positively in a highly heterogeneous scenario. Moreover, we quantified the openness of promoters and investigated the coupled chromatin changes of adjacent genes at single DNA molecules during transcription reprogramming. In addition, we revealed the coupled changes of chromatin accessibility for two neighboring glucose transporter genes in response to changes in glucose concentration.


Supplemental Notes Note 1: Promoter openness and gene transcription
Using the MeSMLR-seq data, we generated the nucleosome occupancy profiles surrounding the TSSs of all protein-coding genes. Consistent with previous studies (Yuan et al. 2005;Hughes and Rando 2014), MeSMLR-seq data showed that highly-expressed genes had more pronounced nucleosome-depletion region in the upstream of TSS and well-positioned nucleosome array across gene body (Supplemental Fig. S5A, B). Nucleosome occupancy of the genes with high expression levels showed an obvious drop at TSS and distinct peaks within gene body, while such tendency was mild for the genes with the lower 25 th percentile expression level (Supplemental Fig. S5B).
In addition to nucleosome occupancy, the chromatin accessibility profiles by MeSMLRseq showed that the promoter regions of the highly-expressed genes were more accessible than the lowly-expressed genes (Supplemental Fig. S5C). It indicates the critical role of promoter accessibility on gene transcription regulation. We further examined the chromatin statuses of the binding regions of several important transcriptional regulators, including RNA polymerase II (Pol2), five general regulatory factors (Abf1, Cbf1, Mcm1, Rap1 and Reb1) and two mediators (Med8 and Med17) (Supplemental Methods) (Park et al. 2013;Grunberg et al. 2016;Rossi et al. 2018). The enrichment signal of Pol2 in gene body was positively correlated with chromatin accessibility of gene promoter (Supplemental Fig. S6A). The binding regions of the other regulatory factors and mediators were relatively accessible and nucleosome-evicted, which allows the assembly of transcription initiation complex (Supplemental Fig. S6B-E).

Note 2: Dynamic change of chromatin status in response to different carbon sources
We next sought to investigate the dynamics of chromatin status during transcription changes in response to different nutrition conditions. Carbon source is the basic nutrition and is essential for yeast growth (Paulo et al. 2015). In addition to glucose (Glu), which is the preferred carbon source for S. cerevisiae, we grew yeast cells separately using galactose (Gal) and raffinose (Raf) carbon sources, and generated both MeSMLR-seq and RNA-seq data. Compared to those under Gal and Raf conditions, yeast cells under Glu showed more accessible promoter (Supplemental Fig. S7A). 21.62% (1,384 of 6,713) of protein-coding genes were differentially expressed between Glu and Gal, and 20% (1,332 of 6,713) between Glu and Raf, which indicated significant transcription reprogramming in response to different carbon sources (Supplemental Fig. S7B). The upregulated genes in Glu compared to Gal or Raf were mainly located in cytoplasm and involved in the biogenesis of ribosomes (Supplemental Fig. S7C). In contrast, the upregulated genes in both Gal and Raf conditions compared to Glu were significantly related to the oxidation-reduction process and carbon metabolism, and were located in mitochondrion. Those significantly up-regulated genes in Glu underwent more difference of chromatin accessibility in their promoters (P-value=1.2 × 10 -14 for Glu vs. Gal, P-value=3.6 × 10 -11 for Glu vs. Raf, Wilcoxon rank sum test, Supplemental Fig. S7D), which contributed the overall high chromatin accessibility in the preferred carbon source (Glu) over Gal and Raf (Supplemental Fig. S7A).

Analyses of ATAC-seq, DNase-seq, MNase-seq, ChIP-seq, ChIP-exo and ChEC-seq data
The information (including yeast strain, growth condition, GEO accession number, data format and reference) of public sequencing data used in this study was summarized in Supplemental Table S6.
Quality control of raw sequencing data (FASTQ format) was performed using FastQC and cutadapt; and alignment was performed using Bowtie2 software (version 2.2.5) (Langmead and Salzberg 2012) with default parameters.
For MNase-seq data (Weiner et al. 2015), iNPS (Chen et al. 2014)with default parameters was used for nucleosome calling.

Correlation and overlapping analyses between MeSMLR-seq and MNase-seq
For correlation analysis of the bulk-cell level nucleosome occupancy results, we used iNPS to generate nucleosome occupancy profiles (BigWig format) for MNase-seq and MeSMLR-seq, respectively. Pearson correlation coefficient of nucleosome occupancy profiles (across whole genome and bin size as 10 bp) was calculated between two methods (Fig. 3A).
For overlapping analysis of nucleosomes, we only considered the two nucleosome peaks (from MeSMLR-seq and MNase-seq, respectively) as overlapped if ≥50% region of one peak was covered by another peak (Fig. 3C).

Correlation and overlapping analyses among MeSMLR-seq, ATAC-seq and DNaseseq
For correlation analysis of the bulk-cell level chromatin accessibility results, we generated genome-wide chromatin accessibility profiles (BigWig format) for three methods, separately. Pearson correlation coefficients of chromatin accessibility profiles (across the whole genome and bin size of 10 bp) were calculated among three methods (Fig. 5A).
For MeSMLR-seq data, we separately called significantly-enriched peaks for molecules aligned to forward and reverse strands. Only the overlapped peaks between the forward and reverse strands for MeSMLR-seq data, and the overlapped peaks between two biological replicates for ATAC-seq and DNase-seq were used for overlapping analysis (Fig.  5C).

Single-cell RNA-seq experiment and data analysis
Yeast cells growing in YPD (1% yeast extract, 2% peptone and 2% glucose) medium were collected and spheroplasts were prepared as described above. Cell viability was measured using Trypan blue exclusion method and cell number was counted by hemocytometer. Of note, considering the fragility of spheroplasts, we modified the loading strategy of buffer before running the 10X ChromiumTM Controler (10X Genomics). Firstly, Single Cell Master Mix (10X Single Cell 3' Reagent Kit v2) was prepared and added into Single Cell A Chip. Next, instead of nuclease-free water, sorbitol was added (final conc. = 1 M) and mixed well. Finally, spheroplasts suspended in 1 M sorbitol were added. In total, 318 million read pairs (2 x 150 bp) were generated by Illumina HiSeq 4000 platform.
The quality of single-cell RNA-seq (scRNA-seq) data was evaluated by FastQC software. Cellranger software (version 2.1.1) with default parameters was used to process scRNAseq data and generate the gene-cell matrix. For quality control of scRNA-seq data, we excluded the cells with >10,000 UMI (unique molecular identifier) counts as they were potentially from artificial cell or cell duplets (Stegle et al. 2015). After quality control, 2,812 single cells with 4,335 UMI counts (median value) per cell and 103,002 read pairs (median value) per cell were used in the following analyses. The number of expressed genes (≥1 UMI) per cell was 1,572 (median value). DESeq2 package (Anders and Huber 2010) was used to normalize scRNA-seq UMI count data for 2,812 cells.

Bulk-cell RNA-seq experiment and data analysis
Total RNA was extracted using Quick-RNA Fungal/Bacterial Miniprep Kit (Zymo Research). Sequencing library was prepared using TruSeq Stranded mRNA Library Prep Kit and 10 million read pairs (2 x 150 bp) on average per sample were generated using Illumina HiSeq 4000 platform. Three biological replicates per biological condition were performed.
The quality of bulk-cell RNA-seq data was evaluated by FastQC software (version 0.11.3, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and sequencing adaptors were trimmed by Cutadapt software (version 1.8.1) (Marcel 2011). Processed reads were aligned to reference genome (version UCSC sacCer3) by Hisat2 software (version 2.0.0beta) (Kim et al. 2015) with default parameters. Cufflinks (version 2.2.1) (Trapnell et al. 2010) with default settings were separately used for quantifying gene expression, normalizing gene expression and analyzing differential gene expression. The FPKM (Fragments Per Kilobase Million) value was calculated as the expression level of genes. The cutoff of statistical significance for differential gene expression analysis was q-value < 0.01.
The bulk-cell RNA-seq data was summarized in the Supplemental Table S7.

Supplemental Figures
Figure S1