Dual threshold optimization and network inference reveal convergent evidence from TF binding locations and TF perturbation responses

A high-confidence map of the direct, functional targets of each transcription factor (TF) requires convergent evidence from independent sources. Two significant sources of evidence are TF binding locations and the transcriptional responses to direct TF perturbations. Systematic data sets of both types exist for yeast and human, but they rarely converge on a common set of direct, functional targets for a TF. Even the few genes that are both bound and responsive may not be direct functional targets. Our analysis shows that when there are many nonfunctional binding sites and many indirect targets, nonfunctional sites are expected to occur in the cis-regulatory DNA of indirect targets by chance. To address this problem, we introduce dual threshold optimization (DTO), a new method for setting significance thresholds on binding and perturbation-response data, and show that it improves convergence. It also enables comparison of binding data to perturbation-response data that have been processed by network inference algorithms, which further improves convergence. The combination of dual threshold optimization and network inference greatly expands the high-confidence TF network map in both yeast and human. Next, we analyze a comprehensive new data set measuring the transcriptional response shortly after inducing overexpression of a yeast TF. We also present a new yeast binding location data set obtained by transposon calling cards and compare it to recent ChIP-exo data. These new data sets improve convergence and expand the high-confidence network synergistically.

increases the size of the networks over fixed thresholds (red bars). This is true for both the intersection of the TFKO and ZEV networks (left bar grouping) and their union (right bar grouping). Post processing expression data with NetProphet 2.0 (green bars) further increases the size of the networks. See Fig. 2 for the numbers of TFs showing acceptable convergence in these analyses. (C) Comparison of yeast networks derived from DTO on ChIP data and response data processed using several network inference algorithms (D) Comparison of human K562 networks derived from DTO on ChIP-seq and raw response data or network inference processed response data. Bar height is the fraction of TFs showing acceptable convergence divided by the number of TFs that were ChIPped and perturbed by either TFKD or CRISPR. Network inference postprocessing improves convergence over unprocessed response data. (E) Similar to (D), but considering only TFs that were not directly perturbed. These TFs cannot be analyzed for convergence without network inference. Figure S5. (A) Among 16 TFs for which we have data in TFKO and ChIP-exo in four nutrient limited conditions, the number of TFs that show convergence between ChIP-exo data and TFKO response data (dotted blue) or network inference processed response data (other colored lines). (B) Same as (A) except that ZEV15 data replaces TFKO data. (C) Comparison of networks derived from DTO on calling cards data and response data processed through several network inference algorithms. For each TF that has any enriched GO term, the five most enriched terms when the targets of each TF are chosen by using fixed thresholds on Harbison ChIP and TFKO data. In most cases these same terms are even more significantly enriched when the targets are chosen by using a different method --dual threshold optimization comparing calling cards data to output from NetProphet 2.0 run on the TFKO and ZEV expression data (red bars). The numbers to the right of the bars indicate the number of genes with a given GO term among the targets of the TF. (B) For each TF that has any enriched GO term, the five most significantly enriched terms chosen by using dual threshold optimization comparing calling cards data to output from NetProphet 2.0 run on the TFKO and ZEV expression data. In many cases, terms with one or two fewer genes are more familiar than the terms with the most genes. For example, Gcr2 has 12 target genes annotated with "glycolytic process", corresponding to its accepted function, but the most significant term is "ADP metabolic process", which contains those 12 glycolytic genes plus one additional gene. Figure S7. Same as Fig. 5A. The fraction of most strongly TF-bound genes that are responsive to the perturbation of that TF, as a function of the number of most-strongly bound genes considered. Shown are the other 11 TFs, in addition to Leu3, that had data available in Harbison ChIP, transposon calling cards, TFKO, and ZEV15.

DATA PREPARATION Yeast gene and TF definitions
For all yeast analyses, we considered the 5,887 genes labeled as "ORF verified" or "uncharacterized" in the Saccharomyces Genome Database (SGD), discarding the 1,127 labeled as "dubious", "ncRNA", "rRNA", "snoRNA", "snRNA", or "tRNA". We only considered TFs with evidence of direct DNA binding via a DNA binding domain. To identify these, we compared multiple lists, including those that had been ChIPped by Harbison et al., those that were over-expressed in the ZEV data, and those that had DNA binding specificity models in the CIS-BP database (Weirauch et al. 2014). In cases of disagreement, we curated the list manually by consulting data in SGD, focusing primarily on domain analysis of the protein and on gene ontology categories assigned via high-throughput experiments such as protein-binding microarrays. In most cases the judgment is clear but there are some borderline cases that require a best guess. The complete list of 183 yeast proteins that we treated as TFs is provided as Supplemental File S9.
Yeast ChIP-chip data sets The Harbison ChIP-chip binding location data was published in (Harbison et al. 2004). We downloaded the p-values that represent the significance of TF binding within the intergenic regions from http://younglab.wi.mit.edu/regulatory_code/GWLD.html. Following the authors' recommendation, targets were considered significantly bound if their p-value was less than or equal to 0.001. TFs with no significantly bound targets were eliminated from further analysis. The Venters ChIP-chip data were published in (Venters et al. 2011). We downloaded the occupancy-level profiles for 200 transcription-related proteins from Table S4a in (Venters, et al. 2011). The log2 fold change of experimental signal over background signal within each promoter was used as the binding signal strength. The probes covered a distal region (260-320 bp upstream of ATG) and a proximal region (30-90 bp upstream of ATG). The downloaded occupancy level took the maximal level from either regulatory region. The authors used an FDR threshold of 5%, but we used 1% in order to make the data more comparable to those from the Harbison data set. For each TF, an FDR cutoff was calculated by searching for an occupancy level such that the ratio of number of targets in the mock IP control over the number in the experimental sample reaches the desired FDR. The "25&37C merged MockIP controls" file was obtained directly from the authors as it was unpublished. TFs with no significantly bound target were eliminated from further analysis.
Yeast ChIP-exo data sets The ChIP-exo data for 26 TFs were compiled from four resources (Bergenholm et al. 2018;Holland et al. 2019;Rhee and Pugh 2011;Rossi et al. 2018aRossi et al. , 2018b. We downloaded genomic coordinates of the ChIP peaks for Reb1, Gal4, Phd1 and Rap1 published in ref. (Rhee and Pugh 2011) from https://ars.els-cdn.com/content/image/1-s2.0-S0092867411013511-mmc2.xls. We mapped the peaks to genes using the coordinates of each gene's promoter region (700 bp upstream to ATG) in reference genome S288C-R55, which was the last release prior to the date cited in the paper, "(build: 19-Jan-2007)" (ref. (Engel et al. 2014) lists all releases). We then calculated each TF's binding strength at each promoter as the sum of all the TF's in that promoter. We downloaded peaks for Abf1 and Ume6 generated using a newer protocol, ChIPexo 5.0, described in ref. (Rossi, et al. 2018b), from GEO Series GSE110681. We also downloaded peaks for Cbf1 from GEO Series GSE93662 (see ref. (Rossi, et al. 2018a)). The assignment of peak-promoter and binding strength at promoter were calculated as for the data from Rhee, et al., 2011, except that both peak and promoter coordinates were based on gene annotation from reference genome S288C-R64. Lastly, we obtained ChIP-exo data for 20 TFs (Cat8, Cbf1, Ert1, Gcn4, Gcr1, Gcr2, Hap1, Ino2, Ino4, Leu3, Oaf1, Pip2, Rds2, Rgt1, Rtg1, Rtg3, Sip4, Stb5, Sut1, Tye7) directly from the authors of (Holland, et al. 2019) and (Bergenholm, et al. 2018). Each TF was assayed in four different environmental conditions, but we focused on the glucose limited chemostat data as that gave the best agreement with both TFKO and ZEV15 response data. This data set contains scores for each promoter that had at least one peak assigned to that promoter. We directly used the highest score at each promoter as the binding strength, after removing any peak that was > 700 bp upstream from ATG. Any promoter without a score in the file was assigned a score of zero.
Yeast transposon calling cards data We combined calling cards data from (Wang et al. 2011) and (Shively et al. 2019) on Cbf1, Cst6, Gal4, Gcr1, Gcr2, Rgm1, and Tye7 with new data on Eds1, Gcn4, Ino4, Leu3, Lys14, Rgt1, Sfp1, and Zap1 (File S10). For ease of access, we provide P-values for promoter binding from the combined data sets as Supplemental File S10. For each TF, all data from all replicates were combined. Transpositions within the promoters of yeast genes (700 bp upstream to ATG, reference genome S288C-R61) were used for calculating the significance of TF binding. For each promoter, a Poisson p-value was calculated by comparing the experiment sample with a no-TF control sample as described (Wang et al. 2012). To obtain a ranking by calling cards signal strength for dual threshold optimization, we used the normalized transposition count of the experimental samples minus that of the control samples to break ties when promoters had identical p-values.
Yeast TFKO data The microarray expression data on gene knockout strains was published in ref (Kemmeren et al. 2014). The gene expression profiles of 1,484 single gene deletion strains and 3 wild type replicates were downloaded from http://deleteome.holstegelab.nl/data/downloads/deleteome_all_mutants_controls.txt. In addition, we downloaded the gene expression profiles after removal of the slow growth signature removed http://deleteome.holstegelab.nl/data/downloads/deleteome_all_mutants_svd_transformed.txt. This transformed data set did not contain new p-values and analyzing it did not produce better results than the untransformed data, so we focused on the untransformed data.
Yeast ZEV induction data The ZEV induction system was described in ref. (Hackett et al. 2019). The shrunken expression profiles were used as the quantitative responsiveness of target genes after TF induction. Specifically, the file "Raw & processed gene expression data" was downloaded from https://idea.research.calicolabs.com/data and the column labeled log2_shrunken_timecourses was used. The responsive set contains all targets with nonzero expression levels. We systematically analyzed ZEV expression profiles measured at all time points (5,10,15,20,30,45, and 90 minutes) after TF induction. To make different time points comparable, we only focused on 103 TFs that were available in the Harbison ChIP-chip data and each time point of the ZEV data. The maximal number of acceptable TFs was obtained at 15 min, so we chose to move forward with this time point for all subsequent analyses except those that involve network inference. For network inference, we used the 15, 45, and 90 minute samples.
Human ChIP-seq data Two human ChIP-seq data sets were analyzed in this work: ChIP-seq in K562 cell line published by ENCODE, and ChIP-seq in HEK293 cell line published in ref. (Schmitges et al. 2016). All ENCODE data were downloaded from www.encodeproject.org as of January 21st, 2019. We focused on the data on K562 cells because it had by far the most TFs with both ChIP-seq and perturbation response data. We downloaded the "conservative" ChIP-seq peaks mapped to GRCh38 as called by the ENCODE pipeline, which uses the Irreproducible Discovery Rate (IDR) analysis of biological replicates with 2% IDR cutoff. Using the ENCODE definition of transcription factor, there was ChIP-seq data for 261 TFs in K562. To quantify the significance of each TF-target binding interaction, we summed the log10 q-values of significant peaks that were within the regulatory regions of each gene (defined below). For the HEK293 cell line, ChIP-seq was carried out using an antibody against GFP. We downloaded the combined summits for 131 ChIPped zinc finger proteins from GEO Series GSE76494 (see ref. (Schmitges, et al. 2016) for details). The binding strength within the regulatory regions of a gene was the summed scores of all summits assigned to those regions. We tried two definitions of regulatory region: (1) a single long promoter extending from 10 Kb upstream of the 5'-most transcription start site (TSS) to 2Kb downstream (Ensembl Release 92), or (2) a core promoter extending from 500 bp upstream of the TSS to 500 bp downstream combined with the gene's enhancers from the GeneHancer database V4.8 (Fishilevich et al. 2017). We used only the "double elite" enhancers, for which both the existence of the enhancer and the gene-enhancer association are supported by at least two evidence sources. This double-elite list was obtained by emailing the authors of the paper. In order to properly use the ChIP summits in HEK293 whose coordinates were based on GRCh37, we used the LiftOver tool in UCSC genome browser to lift over the coordinates of regulatory regions from GRCh38 to GRCh37.
Human ChIP-exo data A collection of ChIP-exo data of 221 KRAB zinc-finger proteins in HEK293T cell lines was downloaded from GSE78099 (Imbeault et al. 2017). We mapped the MACS peaks obtained from the supplemental files to the regulatory regions defined above (GRCh37). We then summed the scores of peaks for each gene to represent the binding strength between each protein and its target.
Human TFKD, CRISPRi and TF-induction data We considered three human perturbation response data sets: TF knockdown (TFKD) in K562, CRISPRi in K562, and TF-induction in HEK293 (Schmitges, et al. 2016). The RNA-seq expression profiles of wild-type controls, TFKDs and CRISPRi were downloaded from the ENCODE web site. Knockdowns using small-interfering RNA (siRNA) or small-hairpin RNA (shRNA) were combined in the data set we referred to as TFKD while the CRISPRi and CRISPR TF-disablement data were combined in the data set we referred to as CRISPR. For K562 cells, there were TFKD experiments targeting 261 different proteins and CRISPRi experiments targeting 96 different proteins. The expected counts were reported by the RSEM program in the ENCODE RNA-seq processing pipeline using gene annotation from GENCODE V24 (GRCh38). Differentially expressed genes in each perturbed TF strain were processed by comparing the experimental replicate set to the control set using DESeq2 (V1.10.1). On the TFinduction for HEK293, RNA-seq was carried out 24 hours after overexpressing the TF from a tetracycline-inducible plasmid. For the majority of TFs there was only a single replicate of the RNA-seq experiment, which prevents the calculation of statistical significance by traditional methods. The processed RNA-seq expression profiles (after lowly-expressed gene removal and batch normalization) for 80 induced zinc finger proteins were downloaded from GEO Series GSE76495. Since there were no control replicates, we used the expression levels in each profile, normalized to the medians of the respective batches, as the response strength (Schmitges, et al. 2016).
NETPROPHET ANALYSIS NetProphet 2.0 is a TF network inference algorithm that exploits gene expression data under genetic or environmental perturbation and genome sequences with annotations. The algorithm is described in detail in ref. (Kang et al. 2017). Here, two yeast TF networks were mapped, one using the Kemmeren TFKO gene expression data and one using the ZEV data. Three human TF networks were mapped using the three perturbation response data sets.
Yeast NP networks NetProphet 2.0 requires gene expression data in the form of a gene expression matrix and differential expression matrix. The Kemmeren gene expression matrix was represented as the log2 fold-change (logFC) values of strains with gene deletions over wild-type strains. The ZEV gene expression matrix was represented as the logFC values of the levels measured at a certain time point after the TF induction relative to time 0. For the differential expression (DE) module of the algorithm, we used Kemmeren samples in which a TF-encoding gene was knocked out, not those in which some other type of gene was knocked out, and ZEV samples from 15 min after TF induction. For the co-expression module of the algorithm, we used the complete set of 1,484 Kemmeren expression profiles from strains lacking one gene (not necessarily encoding a TF) or 590 ZEV expression profiles from 15 minutes, 45 minutes, or 90 minutes post-induction. The other two inputs were DNA sequences of yeast promoters and amino acid sequences of TFs' DNA binding domains (DBDs), as described in ref. (Kang, et al. 2017). PWM models of TFs' binding specificity were not used. Each output network is an adjacency matrix, where the rows represent TFs, the columns represent genes, and the entries are NetProphet scores representing the aggregate strength of evidence that the gene is a direct functional target of the TF.
Human NP networks For K562 data, we calculated differential expression (DE) p-values for TFKD and CRISPRi independently using DESeq2. The DE matrix input to NetProphet 2.0 contained the -log pvalues, with a negative sign for apparent repression (the knockdown of the TF makes the target gene go up). For HEK293 data, we directly used the logFC values because there were no replicates, hence no p-values, for most TFs. The co-expression matrix contained the logFC of individual mutant strain replicates over the median expression level of control replicates (K562) or the median expression level of the gene across all perturbations (HEK293). There were 765 expression profiles (including replicates) for K562 TFKD, 252 for K562 CRISPRi, and 107 for HEK293 TF inductions. We obtained DNA sequences of the regulatory regions based on their coordinates in GRCh38 using our definition (2) of regulatory regions. We concatenated the enhancers and promoter of each gene into a single sequence for the purpose of motif inference in NetProphet 2.0. Each pair of concatenated regions was separated by 50 N's to ensure that no inferred motif instances crossed between one enhancer and another. We also queried the CIS-BP (Weirauch, et al. 2014) database for the amino acid sequences of human TFs' DBDs. The details of DBD preprocessing are described in (Kang, et al. 2017 (Greenfield et al. 2013) was downloaded from https://github.com/ChristophH/Inferelator. No prior network was used because our intention is to infer networks from perturbation response data without any influence from data on TF binding locations. Otherwise, default parameters were used. MERLIN (Roy et al. 2013) was downloaded from https://github.com/marbach/gpdream and used with default parameters. We input the same gene expression matrix, TF list and gene list to NetProphet 2.0, GENIE3, Inferelator, and MERLIN.

ACCEPTABLE TFS
For each pair of binding and expression data on a given TF, the positive gene sets (bound genes or responsive genes) were compared. The TF was deemed acceptable if they met two criteria.
(1) The lower bound on the expected FDR had to be less than or equal to 20% when the sensitivity was fixed at 80%, as calculated by using the formula derived in Box S1.
(2) The p-value for the significance of the overlap between the bound and responsive gene sets had to be <= 0.01. For fixed threshold analysis, the p-value was calculated using the hypergeometric null distribution. For dual threshold analysis, it was calculated using the randomization-based null distribution, not the nominal p-value (see description of dual threshold optimization).

DUAL THRESHOLD OPTIMIZATION
Software availability Software implementing dual threshold optimization and instructions can be found at https://github.com/BrentLab/Dual_Threshold_Optimization.

DTO algorithm
For each TF, dual threshold optimization (DTO) uses one binding location data set and one gene expression data set. The genes in each data set are ranked by the strength of their binding or expression signal. By default, the signal strength is the negative log p-value, but different experiments and methods may require different calculations of signal strength (see sections below). For each data set, DTO chooses a threshold on the ranks such that genes ranking above the threshold are considered positives for binding or response (see Fig. 2C). A series of rank-threshold combinations (places where the gray lines cross in Fig. 2C) are used to generate positive subsets of genes in each dataset. The series of thresholds for each data set, T1, T2, …, were generated using the recurrence: This formula produces a fine spacing among smaller subsets that becomes coarser as the subsets grow. If a threshold would split a group of genes that all have the same score that threshold is skipped. For each pair of subsets, a hypergeometric p-value was computed using a hypergeometric survival function (Scipy's hypergeom.sf) with the following parameters: k = # of genes in the intersection of the subsets -1 M = # of genes in the universe of assayed genes n = # of genes in the expression subset N = # of genes in the bound subset This hypergeometric p-value is the probability of an intersection as large as, or larger than, the observed intersection, when choosing random subsets of genes, with the number of genes in each random subset equal to the number of genes in the positives defined by the rank threshold pair. We refer to this as the nominal p-value because it is only used for selecting the best pair of thresholds, not for determining whether the resulting overlap is significantly larger than would be expected by running DTO on random rankings. DTO returns the threshold combination that minimizes the nominal p-value.
Randomization-based p-values for overlaps identified by DTO To produce a null distribution for testing the significance of the overlap chosen by DTO, a randomization procedure was used. A new set of data was generated using random assignment of signal strength scores to genes and then DTO was run. The best nominal p-value for each randomized data set was used to calculate a null distribution of nominal p-values. This was done by running the randomized DTO procedure 1000 times for each TF in each analysis and the distribution of nominal (hypergeometric) p-values enabled us to determine a P<0.01 significance threshold on the nominal p-value that is specific to that TF. When the nominal pvalue of the rank pair chosen by DTO using the true data was below the threshold defined by the randomizations, the overlap was considered significant.
Application of DTO to yeast data In the analysis of yeast data, the universe was defined as the set of all genes assayed in either of the two datasets being compared. For data sets that do not have p-values, the signal strength is the log fold change (ZEV) or score (NetProphet 2.0, GENIE3, Inferelator or MERLIN). For Calling cards, where many p-values were identical, ties were broken by the difference between the number of insertions in the experimental sample and the number of insertions in the control sample. The number of insertions was normalized to the total insertion count in each sample.
Occasionally, DTO can produce implausible results, such as concluding that all genes are responsive to a perturbation. To prevent this, we set very relaxed limits on the bound or responsive genes in certain data sets. For Harbison ChIP, we required P<0.1. For each inferred network output we required that the score of the TF-target relationship be among the top 150,000 scores. These were sufficient to eliminate any anomalous results; no constraints on the TFKO or ZEV data were required.
Application of DTO to human ENCODE data In the analysis of human data on K562 cells, the universe was defined as the set of all genes detected in the gene expression dataset. Response signal strength for DTO was the absolute value of the log fold change, relative to non-perturbed control samples. DTO was limited to choosing bound or responsive genes with P<=0.1. For each inferred network output, the score of the TF-target relationship was required to be among the top 500,000 scores.
Application of DTO to human HEK293 data In the analysis of human data on HEK293 cells, the universe was defined as the set of all genes detected in the gene expression dataset. Response signal strength for DTO was the absolute value of the log fold change, relative to non-perturbed control samples. No replicates or pvalues were available for most TFs. For both raw perturbation-response data and inferred network scores, the score of the TF-target relationship was required to be among the top 300,000 scores.

COMPARISONS AMONG BINDING DATA SETS
Harbison ChIP-chip compared to Venters ChIP-chip The TFs used in the comparison shown in Figure 4A are: Ash1, Cha4, Cin5, Fkh1, Fkh2, Gal4, Gcn4, Gln3, Ino4, Leu3, Msn2, Pho2, Rfx1, Rph1, Sfp1, Skn7, Stp1, Swi5, Uga3, Wtm1, Wtm2, Xbp1, Yap5, Yap6, Zap1, Zms1 Harbison ChIP-chip compared to ChIP-exo The TFs used in the comparison shown in Figure 4B are: Cat8, Cbf1, Ert1, Gal4, Gcn4, Gcr2, Hap1, Ino4, Leu3, Oaf1, Phd1, Pip2, Rds2, Rgt1, Rtg1, Rtg3, Sip4, Stb5, Sut1, Tye7 Harbison ChIP-chip compared to transposon calling cards The TFs used in the comparison shown in Figure 4C are: Cbf1, Cst6, Gal4, Gcn4, Gcr2, Ino4, Leu3, Rgm1, Rgt1, Sfp1, Tye7, Zap1 RANK RESPONSE PLOTS To create the lines in the rank response plots such Fig. 5A, we first determined the minimum of the number of responsive genes in the TFKO and the ZEV15 data --call it n. A gene was considered responsive in the TFKO data if it had adjusted P<0.05 and in the ZEV15 data if it had shrunken absolute log fold change > 0. We then labeled the top n most strongly responsive genes in the TFKO and ZEV15 data as responsive for purposes of this plot (see "DTO algorithm" above for definitions of signal strength). This equalized the number of ZEV15responsive and TFKO-responsive genes for each TF. We then sorted genes by the strength of their binding signal for the TF in question. Next, we considered the top 1, 2, 3, 4, etc. most strongly bound genes. For each such group, we calculated and plotted the fraction of genes that were responsive. For the mean rank-response plot (Fig. 5B) we simply averaged the response rates across the 12 TFs. In comparison of ChIP-chip, ChIP-exo and calling cards (Fig. 5C), we averaged the response rates across the 8 TFs that are present in all 3 binding data sets.
GO ENRICHMENT ANALYSIS Gene ontology (GO) enrichments for each TF were analyzed using two networks mapped using different methods: (1) fixed threshold on Harbison ChIP data and TFKO data; (2) DTO on calling cards data and output from NetProphet 2.0 run on the TFKO and ZEV response data. For each TF, its target set in network 2 was the union of the output of DTO applied to NetProphet scores from ZEV expression data and DTO applied to NetProphet scores for TFKO expression data. The mapping of GO term to gene for Saccharomyces cerevisiae was queried using R Bioconductor library org.Sc.sgd.db (V3.5.0). Any GO terms annotated with less than 3 or greater than 300 genes were eliminated. Using the target genes of a TF identified from network (1) or (2), the GO enrichment in biological process was analyzed using the hypergeometric test implemented in R Bioconductor library GOstats (V2.44.0). The output p-values were used for ranking the enriched terms, from most significant to the least significant. When plotting the top GO term shown in Figure 4D, for each TF, we combined all terms from both networks into a single rank list. If multiple terms were enriched by the same set of targets, only the most specific term was retained, based on the GO hierarchical structure, i.e. redundant ancestral terms were removed from the rank list. Subsequently, the top GO term was chosen for the corresponding TF. When plotting the top GO terms shown in Figure S6, we used the GO terms enriched in one network and chose the top five (if available) as described above.