Unsupervised contrastive peak caller for ATAC-seq

The assay for transposase-accessible chromatin with sequencing (ATAC-seq) is a common assay to identify chromatin accessible regions by using a Tn5 transposase that can access, cut, and ligate adapters to DNA fragments for subsequent amplification and sequencing. These sequenced regions are quantified and tested for enrichment in a process referred to as “peak calling.” Most unsupervised peak calling methods are based on simple statistical models and suffer from elevated false positive rates. Newly developed supervised deep learning methods can be successful, but they rely on high quality labeled data for training, which can be difficult to obtain. Moreover, though biological replicates are recognized to be important, there are no established approaches for using replicates in the deep learning tools, and the approaches available for traditional methods either cannot be applied to ATAC-seq, where control samples may be unavailable, or are post hoc and do not capitalize on potentially complex, but reproducible signal in the read enrichment data. Here, we propose a novel peak caller that uses unsupervised contrastive learning to extract shared signals from multiple replicates. Raw coverage data are encoded to obtain low-dimensional embeddings and optimized to minimize a contrastive loss over biological replicates. These embeddings are passed to another contrastive loss for learning and predicting peaks and decoded to denoised data under an autoencoder loss. We compared our replicative contrastive learner (RCL) method with other existing methods on ATAC-seq data, using annotations from ChromHMM genomic labels and transcription factor ChIP-seq as noisy truth. RCL consistently achieved the best performance.

Figures S3-S7 show a more detailed Gene Ontology enrichment analysis accompanying Results Section "Gene Ontology analysis".ChIP−R HMMRATAC LanceOtron RCL-C2 0.6 0.9 1.2 1.5 0.6 0.9 1.2 1.5 0.6 0.9 1.2 1.5 0.6 0.9 1. Figure S3: Gene Ontology analysis of MCF-7 data.Peaks included in analysis are (A) randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool; (C) randomly downsampled from unique peaks to the second smallest number of unique peaks found by any method.LanceOtron unique peaks were not included in (C) due to its small number of peaks (see Supplementary Table S1).Only relevant terms enriched in at least one peak set are plotted.Colors correspond to − log 10 (Binomial Adjusted P-value) where the adjustment was done following the Benjamini-Hochberg (Benjamini and Hochberg 1995) procedure; dot sizes correspond to the observed number of genes associated with the term; x-axis corresponds to log 2 (Fold change+0.1)and vertical line corresponds to fold change of two, which was used to indicate biological significance.Abbreviations: reg., regulation; pos., positive; neg., negative; sig., signaling; morph., morphogenesis; dev., development; mesen., mesenchymal; prol., proliferation.Figure S4: Gene Ontology analysis of A549 data.Peaks included in analysis are (A) randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool; (C) randomly downsampled from unique peaks to the second smallest number of unique peaks found by any method.LanceOtron unique peaks were not included in (C) due to its small number of peaks (see Supplementary Table S1).Only relevant terms enriched in at least one peak set are plotted.See Figure S3 legend for details.: Gene Ontology analysis of GM12878 data.Peaks included in analysis are (A) randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool; (C) randomly downsampled from unique peaks to the third smallest number of unique peaks found by any method.ChIP-R and LanceOtron unique peaks were not included in (C) due to their small numbers of peaks (see Supplementary Table S1).Only relevant terms enriched in at least one peak set are plotted.See Figure S3 legend for details.randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool; (C) randomly downsampled from unique peaks to the third smallest number of unique peaks found by any method.HMMRATAC and LanceOtron unique peaks were not included in (C) due to their small numbers of peaks (see Supplementary Table S1).Only relevant terms enriched in at least one peak set are plotted.See Figure S3 legend for details.

S2 Supplementary Tables
Supplementary Table S1 reports the number of peaks called by each method, including unique peaks not overlapping with peaks called by any other method.Table 1 reports the number of called peaks overlapping with ChromHMM labels.All supplementary tables, including S1, are also available as excel spreadsheets.

Number of Peaks Dataset
Method Total Unique

MCF-7
MACS2 (q = 0.05) 68,694 NC ChIP-R (q = 0.05) 69,864 5,201 MACS2 (q = 0.5) 73,817 NC ChIP-R (q = 0.5) Table S2: Full Gene Ontology (GO) analysis results on MCF-7 data.To carry out analyses for each tool, three sets of peaks were used: all called peaks, randomly downsampled peak sets from all called peaks, and unique called peaks.All biological process terms, including those that are not significantly enriched, are reported.A biological process term is considered enriched if its binomial q-value ≤ 0.05, binomial fold change ≥ 2, and the observed number of associated genes was ≥ 5. See excel file Supplementary Table S2.xlsx.
Table S3: Full Gene Ontology (GO) analysis results on mouse placenta data.See Table S2 legend for more details.
Table S4: Full Gene Ontology (GO) analysis results on GM12878 data.See Table S2 legend for more details.

S3.1 Model Development
Data from the MCF-7 cell line (see Methods Section "ATAC-seq data acquisition") was used to develop the RCL method.During this process, we used different truth labels and procedures for calling peaks than those used in final method assessment.In particular, we used RCL to identify candidate peaks as those α bp segments i with average peak probability R r=1 q ri1 /R > 0.5, where q ri1 is the probability segment i is a peak in replicate r as defined in Results Section "The RCL algorithm".Then, we defined candidate peak segments to be true open segments if they overlapped (≥ 1 bp) regions marked by H3K27ac histone modification in relevant ChIP-seq data (peaks from ENCODE experiment ID ENCFF491LQY, called by MACS2).This histone modification was chosen since it marks active promoters and active enhancers across the genome (Creyghton et al. 2010).Here, only autosomal chromosomes were used.The H3K27ac histone mark was just one of at least six marks available to ChromHMM during annotation (Kundaje et al. 2015), so the labels used in RCL model development are related but not identical to the labels used in method assessment.

S3.1.1 RCL hyperparameter tuning
RCL involves several hyperparameters.The backbone model is composed of multiple CNN layers, with dilation and kernel size hyperparameters that we wanted to optimize.Other hyperparameters are set to standard default values (number of epochs = 25, batch size = 256, learning rate = 10 −4 and temperature τ 1 = τ 2 = 0.5).Using MACS2 peaks called from H3K27ac ChIP-seq marks as true labels, we explored dilations 1, 4, and 8 and kernel sizes 31, 41, 51, and 61 in the MCF-7 cell line data, evaluating precision, recall, and F1 across 22, excluding sex, chromosomes.We observed that kernel size has little effect on precision and recall values, while dilation is influential.In particular, larger dilation improves the precision but lowers the recall, leaving F1 scores roughly unchanged (Supplementary Figure S8).Abbreviations: "aug", method that contrasts the coverage vector with the reversed coverage vector, "rep", method that contrasts the coverage vector with the coverage vector of a second replicate, "self", method that contrasts the coverage vector with itself.

S3.1.2 Ablation study: contrasting biological replicates improves method performance
To study the value of replicates, we contrasted the coverage vector from the first replicate in the MCF-7 data with entities other than the coverage vector of the second replicate.Specifically, in equation ( 1), x r i is replaced with x ri (self) or x ri in reversed order (augmentation).Contrasting to the coverage vector of the second replicate gives the best performance overall, but contrasting with the augmentation formed by reversing the first coverage vector has only slightly worse performance (Supplementary Figure S8).Thus, a valid approach for using RCL when no biological replicates are available might be to create viable augmentations of the single coverage vector.On the other hand, contrasting the coverage vector of the first replicate to itself resulted in the worst performance and notably increased the variation in performance across the different chromosomes.Therefore, contrasting coverage vectors from different biological replicates during learning helps the model retain essential peak information and exclude distracting noise information to improve overall performance.

S3.1.3 Training vs. testing performance: little evidence of overfitting
RCL is an unsupervised method, so we fit the model to all the provided data, or a random subset if the dataset is large, then predict on all the provided data.While we are concerned that developing the model on MCF-7 data may limit the generalizability of RCL to other datasets, we do not need to separate training and testing data within a given dataset to guarantee no model overfitting.Nevertheless, RCL is sufficiently complex that developing or tuning it on set of segments may cause it to overfit that set of segments.We can use training and testing datasets to assess whether the RCL model is overfit if it has decreased performance on testing data.We randomly sampled 80% as the training data and predicted on the remaining "test data" in 10 replicated experiments.The performance is assessed across all chromosomes.The average precision and recall for the training set are 0.741 and 0.593, while the average precision and recall for the testing set are 0.731 and 0.602, representing a change of less than 2% (average absolute change for precision and recall are 0.01 and 0.02).There is no evidence of overfitting in the MCF-7 data.Further testing on four independent datasets confirms that the RCL advantage over other methods is robust across datasets, despite the model development done on MCF-7.

S3.2 Score distribution analysis shows clear separation between peak and nonpeak regions predicted by RCL
To understand the connection between peak predictions and the scores assigned by each method, we analyzed the score distribution of peaks called within the ChromHMM labeled regions.In particular, we examined scores from MACS (-q 0.05), ChIP-R (-a 0.05), LanceOtron, HMMRATAC and RCL-MED, for MCF-7, A549, and K652, or RCL-C2 for the GM12878 and mouse placenta data using pairs plots (Figure S9).In general, we observed that MACS and ChIP-R scores are highly correlated, likely because MACS and ChIP-R scores were computed from the same input regions called on individual replicates by MACS (see Methods).RCL scores tend to be least correlated with scores of the other methods, suggesting RCL is detecting distinct signals of peaks.Overall, the separation between peak and non-peak scores is more obvious for RCL than other methods, though the deep learner LanceOtron also shows strong separation.For example, some predictions are exceptionally confident for RCL-C2 (absolute logit-transformed scores > 10) and LanceOtron (absolute logit-transformed scores > 4) on the MCF-7 data (Figure S9).Interestingly, RCL also has many intermediate scores (logit transformed scores near 0), where the separation between peaks and non-peaks is less obvious.We calculated PRAUCs using just these intermediate-scoring regions for MCF-7, and RCL-C2 still achieves better separation (PRAUCs: 0.718, 0.483, 0.492 and 0.480 for RCL-C2, MACS, HMMRATAC and CHIP-R, respectively), indicating it makes better predictions even in challenging regions.

S3.3 RCL predictions are stable with more replicates and high coverage replicates
To assess how the number of replicates affects RCL performance, we supplied increasing numbers of biological replicates to RCL in the A549 (three total available replicates) and GM12878 (four replicates) data and used ChromHMM labels to evaluate the performance.Of note, one GM12878 replicate has higher library size than the others (64.8 million reads compared to an average 20.2 million reads in the other three replicates).When adding GM12878 replicates, we added the high library replicate last.RCL performance hardly changed with additional replicates, even the replicate with high coverage (Figure S10).It is possible that in these data, replicates are so similar that additional replicates provide no new information to the contrastive learner.Since the datasets we investigated were of high quality, it remains to be seen if particularly noisy replicates can affect RCL performance.Based on our experiments, however, we recommend using all available replicates.It does not hurt, hardly takes more computational time (see Discussion), and could help RCL call peaks.

S3.4 Extent of candidate peak regions A are crucial to RCL success
Peak calling involves identifying likely peaks and their boundaries.Methods are indirectly judged on their ability to identify region boundaries since evaluations typically check if the predicted peaks overlap with labeled regions (see Methods in Tarbell and Liu (2019) and Hentges et al. (2021)).To analyze the role of the candidate regions in A we used as prediction regions for RCL (see Results "Prediction region selection"), we conducted several experiments on the MCF-7 data (here, using t = 2, RCL-C2) and A549 (here, using default t, RCL-MED).The first set of experiments explored the role of the RCL score in its performance on MCF-7.We tried scoring the RCL candidate regions with the mean coverage across replicates (denoted as PRE in Figure S11A), and we mapped other methods' predictions to the RCL candidate peak regions in A, as we had for RCL segment scores: if multiple peaks overlap one candidate region, the region was scored with the mean score of the overlapping peaks (see Methods).For ChIP-R and MACS, we used the predictions and scores from runs with q-value = 0.5.The second set of experiments explored the role of the candidate regions A in RCL performance on A549.Instead of A prediction regions, we used the 1,000 bp segments as the prediction regions.We also attempted to use the decoder output mri to identify peak centers.Specifically, we identified all sites with denoised coverage above the 0.5 quantile within the 1,000 bp segment, merged adjacent sites, and used the largest merged region as the prediction region.
Candidate peak regions in A scored simply by average coverage performed better than all other competing methods except RCL-C2, where scores came from the contrastive learner (Figure S11A).These experiments confirm that high coverage is highly predictive of true peaks, and our method to choose candidate peak regions is effective.However, the experiments also show that other signals extracted by the contrastive learner contribut substantively to the best peak predictions.However, using these superior RCL scores on other prediction regions, either the 1,000 bp segments themselves or an approximation of the peak center estimated from the decoded signal, performed abysmally (Figure S11B).Further investigation (data not shown) indicated that signal from strong, narrow peaks in 1,000 bp regions leaked false positive calls into nearby non-peak regions.However, our attempt to shrink down 1,000 bp segments to peak centers performed even less well.The difficulties revealed by these experiments stem from the fact that RCL is not determining the peak extent.Further discussion on how an unsupervised learning method could accomplish both facets of peak calling, presence and extent, are mentioned in the main text Discussion.

Figure S1 :
Figure S1: Distribution of default threshold (cut-off) values for extracting candidate peak regions (see main text section Prediction region selection) across chromosomes.The GM12878 and day 9.5 mouse placenta datasets are not shown since their median coverage (among all covered sites) is one for all chromosomes.

Figure S2 :
Figure S2: Precision-Recall (PR) curve for mouse placenta data.See main text Figure 2 for PR curves for other datasets and more information.ChIP-R and MACS2 multiQ points at the upper, left have been overplotted by PR curves.
an epithelial sheet neg.reg. of epithelial cell differentiation neg.reg. of epithelial to mesen.transition neg.reg. of morph. of an epithelium neg.reg. of sprouting angiogenesis pos.reg. of epithelial to mesen.transition pos.reg. of morph. of an epithelium prostate gland epithelium morph.reg. of blood vessel endothelial cell prol.involved in sprouting angiogenesis reg. of epithelial cell prol.involved in lung morph.reg. of epithelial to mesen.transition reg. of epithelial to mesen.transition involved in endocardial cushion formation reg. of morph. of an epithelium reg. of sprouting angiogenesis log 2 (Fold change + 0 involved in determination of left/right asymmetry establishment of planar polarity of embryonic epithelium neg.reg. of epithelial cell proliferation involved in prostate gland development pos.reg. of epithelial to mesenchymal transition involved in endocardial cushion formation log 2 (Fold change + 0 FigureS5: Gene Ontology analysis of GM12878 data.Peaks included in analysis are (A) randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool; (C) randomly downsampled from unique peaks to the third smallest number of unique peaks found by any method.ChIP-R and LanceOtron unique peaks were not included in (C) due to their small numbers of peaks (see Supplementary TableS1).Only relevant terms enriched in at least one peak set are plotted.See FigureS3legend for details.
FigureS6: Gene Ontology analysis of K562 data.Peaks included in analysis are (A) randomly downsampled from all peaks to the smallest number of peaks found by any method; (B) all unique peaks found by each tool.Only relevant terms enriched in at least one peak set are plotted.See FigureS3legend for details.

Figure S8 :
Figure S8: Varying RCL hyperparameters and inputs.(A) Change in precision, recall and F1 score across 22 chromosomes at different choices of dilation in {1, 4, 8} for the MCF-7 cell line data.dn means dilation is n, k31 means kernel size is fixed at 31.(B) Change in precision, recall and F1 score for different choices of the contrasted data across chromosomes.Abbreviations: "aug", method that contrasts the coverage vector with the reversed coverage vector, "rep", method that contrasts the coverage vector with the coverage vector of a second replicate, "self", method that contrasts the coverage vector with itself.
Figure S11: PR curve when varying prediction region or prediction score.(A) We vary the score assigned to MCF-7 candidate regions in A. PRE represents the candidate peak regions scored with mean coverage across replicates; ChIP-R, MACS, HMMRATAC, and LanceOtron are the respective method predictions mapped to candidate peak regions in A; RCL-C2 are the peak predictions using the full RCL framework, duplicate of the same curve in Results Figure 2A.(B) We substitute prediction regions other than the candidate regions in A in A549.RCL-1000 uses 1,000 bp segments as prediction regions.RCL-pre uses the candidate regions defined in A, duplicate of the RCL-med curve in Results Figure 2D.RCL-decoder uses subregions within the 1,000 bp segments obtained from the decoded coverage (see text).

Table S1 :
Number of called peaks and uniquely called peaks by each method for each dataset.A uniquely called peak is one that does not overlap with a peak called by any other method.NC: not calculated.

Table S5 :
Full Gene Ontology (GO) analysis results on A549 data.See TableS2legend for more details.See Excel file Supplementary TableS5.xlsx.

Table S6 :
Full Gene Ontology (GO) analysis results on K562 data.See Table S2 legend for more details.See Excel file Supplementary Table S6.xlsx.

Table S7 :
Definitions of ChromHMM states and lists of transcription factor ChIP-seq data used for performance evaluation.See Excel file Supplementary TableS7.xlsx.