SLIC-CAGE: high-resolution transcription start site mapping using nanogram-levels of total RNA

Cap analysis of gene expression (CAGE) is a methodology for genome-wide quantitative mapping of mRNA 5′ ends to precisely capture transcription start sites at a single nucleotide resolution. In combination with high-throughput sequencing, CAGE has revolutionized our understanding of the rules of transcription initiation, led to discovery of new core promoter sequence features, and discovered transcription initiation at enhancers genome-wide. The biggest limitation of CAGE is that even the most recently improved version (nAnT-iCAGE) still requires large amounts of total cellular RNA (5 µg), preventing its application to scarce biological samples such as those from early embryonic development or rare cell types. Here, we present SLIC-CAGE, a Super-Low Input Carrier-CAGE approach to capture 5′ ends of RNA polymerase II transcripts from as little as 5–10 ng of total RNA. This dramatic increase in sensitivity is achieved by specially designed, selectively degradable carrier RNA. We demonstrate the ability of SLIC-CAGE to generate data for genome-wide promoterome with 1000-fold less material than required by existing CAGE methods, by generating a complex, high-quality library from mouse embryonic day 11.5 primordial germ cells.

M. musculus HiSeq2500 SLIC-CAGE 10 4 4 a samples with L in the name denote the same library sequenced on multiple lanes, r in the name stands for replicate, samples with R1 and R2 denote the same library -sequenced in paired end mode, Read1 and Read2 b indicates which samples were prepared at the exact same time (in parallel). Same index denotes the same time of the experiment. c indicates if the samples were sequenced on the same lane in the same run d indicates which reads/samples were merged prior to analysis in R (merge option of samples within CAGEr)   Table S2). The forward primer contains the T7-promoter sequence, and a GN5 sequence (N -random nucleotide). The reverse primer dictates the length of the final carrier and introduces random nucleotides at the 3'end of carrier molecules (N6). After PCR-amplification, the templates are gelpurified, and the carrier molecules synthesised using run-off in vitro transcription. Carriers are then purified and a portion of it capped, followed by purification. Capped carriers are necessary to ensure that there is carrier left after the cap-trapping step, otherwise all carrier molecules would be eliminated from downstream steps. (

Sample collection and nucleic acid extraction
S. cerevisiae BY4741 strain was grown in YPD media, and when the cells reached the exponential phase, collection was done by centrifugation. The cell material was stored at -80 degrees C prior to RNA isolation.
S. cerevisiae total RNA was extracted from using the standard hot-phenol procedure. The extracted RNA was additionally purified using the Qiagen RNeasy kit (clean-up protocol, according to manufacturer's instructions). Isolated RNA was quantified using NanoDrop1000 and the quality of RNA assessed on the bioanalyzer (Agilent). The RNA samples were of high quality (RIN > 9).
Total RNA from mouse embryonic stem cells (E14 cell line) was extracted using Qiagen RNeasy kit (according to manufacturer's instructions). Isolated RNA was quantified using NanoDrop1000 and the quality of RNA assessed on the bioanalyzer (Agilent). The RNA samples were of high quality (RIN > 9).

Clustering of CTSSs into tag clusters and identification of dominant CTSS
CTSSs that pass the threshold of 1 TPM in at least one of the samples were clustered using a distancebased method implemented in the CAGEr package with a maximum allowed distance of 20 bp between the neighbouring CTSS.
For each tag cluster, a cumulative distribution of signal was calculated, and the boundaries of the tag cluster calculated using the 10th and 90th percentile of its signal. The distance between these boundaries represents the interquantile width of a tag cluster. The CTSS with the highest TPM value within a tag cluster is identified as the dominant CTSS (as implemented within CAGEr).
Tag clusters were annotated with their corresponding genomic locations using the ChIPseeker package (Yu et al. 2015). In S. cerevisiae libraries, promoters were defined as 1 kb windows centred on Ensembl (Aken et al. 2016) annotated transcriptions start sites (annotations imported from SGD) and in M. musculus libraries, promoters were defined as <= 1 kb or 1-3 kb from the UCSC annotated transcription start site.

Nucleotide and dinucleotide composition of CTSSs
CTSSs from each library were filtered prior to analysis to include only CTSS with at least 1 TPM. In each library the number of A, C, G or T-containing CTSS was counted, divided by the total number of filtered CTSSs and converted to a percentage. The same analysis was performed using only dominant TSS (identified using the CAGEr package as a CTSS with highest expression within a tag cluster).
For dinucleotide analysis, identified filtered CTSSs were extended to include one upstream nucleotide ([-1, +1] dinucleotides where +1 represents the identified CTSS) and the same analysis as described above repeated for 16 possible dinucleotides.

ROC curves
To assess accuracy of TSS identification for SLIC-CAGE and nanoCAGE libraries, we used nAnT-iCAGE libraries to define the set of true CTSSs and tag clusters. A true positive CTSS or a tag cluster corresponds to the CTSS or tag cluster in the nAnT-iCAGE library, while a false positive CTSS or a tag cluster exists only in the nanoCAGE or SLIC-CAGE library. ROC curves were generated in dependence of the CTSS or tag cluster TPM threshold in nanoCAGE or SLIC-CAGE libraries.

Heatmaps Bioconductor package (Perry M (18). heatmaps: Flexible Heatmaps for Functional
Genomics and Sequence Features. R package version 1.2.0) was used to visualize dinucleotide patterns (TA and GC) across sequences centred on the dominant TSS. Sequences were ordered by interquantile width of the containing tag cluster, with the sharpest on top and broadest tag cluster on the bottom of the heatmap. Raw data with the exact matching for TA or GC was smoothed prior to plotting using kernel smoothing within the heatmaps package. Each heatmap was divided into two sections based on tag cluster's IQ-widths. Empirical boundary (Supplemental Fig. S17A) was set to separate sharp (IQ-width <= 3 bp) and broad (IQ-width > 3) tag clusters identified in M. musculus libraries. The horizontal line/boundary was implemented using heatmaps options to partition heatmaps/rows of an image. Similarity of patterns between libraries was assessed by calculating the Jaccard distance between vectorized image matrices of smoothed heatmaps. Background similarity was assessed through calculation of Jaccard distance between vectorized image matrices of columnrandomized, smoothed heatmaps. Column randomization was performed 10000 times, and the distribution of Jaccard distances calculated per each permutation was plotted and compared to the true Jaccard distance.

TATA-box motif analysis in M. musculus libraries
SeqPattern package was used to scan the sequences for the occurrence of the TATA-box motif using a threshold of 80th percentile match to the TATA-box PWM (imported from the seqPattern package).
We further smoothed the obtained results using the kernel smoothing (heatmaps package) and plotted the results with sequences ordered by interquantile width of the containing tag cluster (sharpest on top and broadest on bottom of the tag cluster) and centred on the dominant TSS. The horizontal line in each heatmap represents the empirical boundary that separates sharp (IQ-width <= 3) and broad tag clusters (IQ-width > 3). Similarity between heatmaps was assessed as described above.
TATA-box metaplots (average signal/profile) were produced separately for sharp and broad tag clusters (see definition above). SeqPattern was used for scanning sequences using TATA-box PWM to identify 80% matches. The results were converted to the average signal using the heatmaps package with a 2 bp bin size. The final data was plotted using the ggplot2 package (Wickham 2009).

Nucleosome positioning signal in in M. musculus libraries -WW periodicity
WW dinucleotide (AA/AT/TA/TT) occurrence (average relative signal) was obtained using the heatmaps package separately for sharp and broad tag clusters (see definition above). A 2 bp bin size was used and the sequences were centred on the dominant TSS. As a control for the importance of centring the sequences on the dominant TSS, WW dinucleotide (AA/AT/TA/TT) occurrence was obtained as an average relative signal from sequences where each sequence is centred on a randomly chosen CTSS within a tag cluster. The final data was plotted using the ggplot2 package (Wickham 2009).

H3K4me3 signal around M. musculus tag clusters
H3K4me3 data for E14 cell line, mapped to mm10 was downloaded from ENCODE experiment ENCSR000CGO. Bam files for two replicates (accession numbers ENCFF997CAQ and ENCFF425ZMWO) were merged using samtools (Li et al. 2009) and the merged bam file was imported to R environment using the rtracklayer package (Lawrence et al. 2009) H3K4me3 coverage was calculated separately for reads mapping to minus or plus strand and minus strand reads subtracted from plus strand reads to get the subtracted H3K4me3 coverage.
Subtracted H3K4me3 coverage was visualized using heatmaps package centred on the dominant TSSs with sequences ordered by IQ-width of the containing tag clusters (sharpest on top, and broadest at the bottom of the heatmap). Each heatmap was divided into two sections based on tag cluster's IQ-widths as described above. Similarity between heatmaps was assessed as described above.
H3K4me3 coverage metaplots were produced separately for sharp and broad tag clusters (see definition above, only strongly supported dominant CTSSs with at least 5 TPM were used) using heatmaps package with a 3 bp bin size The final data was plotted using the ggplot2 package (Wickham 2009).

M. musculus tag cluster overlap with CpG islands
The CpG island track for mm10 was downloaded from the UCSC Genome Browser. Overlap with M. musculus tag clusters was visualized as a coverage heatmap using heatmaps package, centred on the dominant TSS with sequences ordered by IQ-width of the containing tag clusters (sharpest on top, and broadest at the bottom of the heatmap). Each heatmap was divided into two sections based on tag cluster's IQ-widths as described above.
CpG coverage metaplots were produced separately for sharp and broad tag clusters (see definition above) using heatmaps package with a 3 bp bin size. The final data was plotted using the ggplot2 package (Wickham 2009).

SLIC-CAGE mapping efficiency
When only 1 ng of total RNA is used with a 5000-fold more carrier (5 μg), 25% of the sequenced reads are uniquely mapped to the target organism, while the rest corresponds to the leftover carrier (27%), short amplified linkers or multimappers, commonly discarded from TSS analyses (Supplemental Tables S10 and S11). This amount of leftover carrier is minor and does not significantly compromise sequencing depth (10% or less when 10 ng of total RNA are used). We expect that with additional rounds of degradation and purification, the leftover carrier could be further reduced, although with a risk of sample loss, and we found it unnecessary.

Analysis of nanoCAGE XL data
The nanoCAGE XL library exhibited low correlation with the nAnT-iCAGE library at individual CTSSs and tag clusters expression level (Supplemental Fig. S14A and B). Although distribution of interquantile widths suggests that libraries are not of low complexity (Supplemental Fig. S14D), this may be a consequence of contamination with non-capped captured RNA, as nanoCAGE XL poorly captures promoter regions ( Supplementary Fig. S14E). In addition, only 7% of the dominant CTSSs identified in nanoCAGE XL libraries matched nAnT-iCAGE identified dominant CTSSs, while CTSS and initiator biases were even more prominent (Supplemental Fig. S14C and G) than in our dataset.

Additional comparison of PGC E11.5 replicate 1 and 2
Comparison of TATA-box and CpG heatmaps in Fig. 5 and Supplemental Fig. S18G shows that more TATA-box promoters and fewer CpG island-associated promoters are identified in PGC E11.5 replicate 2. Replicate 2 is derived from degraded RNA (see Supplemental Fig. S18A) and the higher background noise (Supplemental Fig. S24A) may increase the total signal level of the identified tag clusters. This is expected to have a greater influence on TATA-box promoters as they typically have precise transcription initiation resulting with sharp tag clusters (Haberle and Lenhard 2016). Lowly expressed CTSSs and tag clusters are typically excluded from analysis by applying a filtering threshold within the standard CAGEr pipeline (Haberle et al. 2015). Higher background noise in the data would increase the width and the total signal level of tag clusters, allowing a subset of lowly expressed tag clusters to pass the threshold and stay included in the dataset, while without the noise, they would be excluded. To see if this is the case with TATA-box promoters detected only in replicate 2, we compared expression levels and IQ-widths of PGC E11.5 replicate 1 and 2 (Supplemental Fig.   S24B,C, TATA-box promoters are identified as those that pass the threshold of minimum 80 th percentile of maximum TATA pwm score at -40-20 bp positions). Indeed, TATA-box promoters are broader and of higher expression levels in replicate 2 than in replicate 1. When dominant TSSs are compared (Supplemental Fig. S24D), the difference in expression levels is much smaller, indicating that the background noise is raising expression levels and IQ-width of TATA-box promoters in replicate 2. Further, we classified TATA-box promoters into those that are common to both replicate 1 and 2 and found in only in replicate 1 or 2. We compared IQ-width, total expression levels and expression levels of dominant TSSs identified in those TATA-box classes (Supplemental Fig. S24E-G). Indeed, TATA-box promoters specific for replicate 2 are broader and more expressed than replicate 1-specific TATA-box promoters, while dominant TSSs are highly similar. Further, genomic locations of replicate 2 tag clusters with identified TATA-boxes are in high percentage in distal intergenic regions, demonstrating again higher noise and lower specificity. GO enrichment analysis showed significant terms only for TATA-box promoters specific for PGC E11.5 replicate 1, while there was no enrichment for TATA-box promoters specific for replicate 2, again showing biological relevance of replicate 1 and noise in replicate 2.