library('gbm')
library('ROCR')
library("kohonen")
library('gplots')
library('RColorBrewer')
library("biomaRt")
library("ChIPpeakAnno")
library("org.Hs.eg.db")
In this vignette, we provide an example dataset for ZNF143 transcription factor in the form of a list, which stores data list of four cell types: Gm12878, H1hesc, Helas3, and K562. For each cell type, the list includes four objects: trainseq, testseq, training, and testing. The training and testing matrix each stores sequences in the rows and features in the columns, with an additional column of sequence labels.
Load EMT functions.
source('R/emt.r')
source('R/clustering.r')
source('R/coherence.r')
Load ZNF143 data.
tf <- 'ZNF143'
data("ZNF143")
cells <- names(datalist)
print(cells)
## [1] "Gm12878" "H1hesc" "Helas3" "K562"
The trainseq columns are :
sequence: 100 base pair sequence.seqnames: Name of the chromosomestart: The starting position of the binding site in the chromosomeend: The ending position of the binding site in the chromosomescore: Signal valuevalue: The overall measurement of the enrichment in the region; for non-binding sites it is the value of associated binding site.peakid: Id.label: Indicator of binding or non-binding.head(datalist$Gm12878$trainseq)
## sequence
## 1 TGACAGCCTCAGCCAACCCCACAGGGAGCTCTGAAGCTAGAATGGCCCTTCAGAGTTGACCCAAAATAAGCTAAGAAGGCCAGGCCGTTACACACCTGTT
## 2 TTGCAGCAGGTGCCTGGGAAGCCAGCTTAACATAAGCTGGCTTTGGGCTGTCCTGGCCCAGGCCTGGCCCTGCAGGGTGACTGGACCCTGCCCAGACTTG
## 3 CGCGACCAATGGGCCCCCGCCGCCGGGAAGCCGCGCCCGCCCCCTGGCGGTGGAGGACCAAGCGGGCGCCCGGGCCGGCCAGAGGGAAGGGCCGGAGAGC
## 4 GGCCCCTGTCGGCCGCCAAGCCCCTCCGCCCCTCACAGCGCCCAGGTCCGCGGCCGGGCCTTGATTTTTTGGCGGGGACCGTCATGGCGTCGCAGCCAAA
## 5 GAAACTCAGATCTTTTTGAAGAGGATGCAGCTGTCACAGAAACATGCAGCTGCTGCTGGCAGAGTGCATGGGTCAGAGTGGGCCACCAGGAGCTGTCTGC
## 6 ACCACACAGACCTCCCCCTCCCCACCCCCAGCCCCGCCTGCCCTAGCCCCGCCGCCGCCGCCGCCGAAACTCTTGGGCCTCTGGCCGCCCAGACCCCTCA
## seqnames start end score value peakid label
## 1 chr1 36192886 36192985 623 97.53589 4476 0
## 2 chr1 26872569 26872668 1000 158.16834 2305 0
## 3 chr7 121036413 121036512 833 130.41128 2805 1
## 4 chr10 94353050 94353149 1000 437.76702 406 0
## 5 chr6 2932247 2932346 673 105.41407 3818 0
## 6 chr12 46123024 46123123 831 130.07808 2815 0
The train columns are
label: Indicator of binding or non-binding.M00001: TRANSFAC id (i.e. feature 1)M00002: TRANSFAC id (i.e. feature 2)M00003: TRANSFAC id (i.e. feature 3)M00004: TRANSFAC id (i.e. feature 4)datalist$Gm12878$train[1:5, 1:5]
## label M00001 M00002 M00003 M00004
## 1 0 0.000000 0.000000 2.327241 2.188066
## 2 0 2.402537 2.236245 0.000000 0.000000
## 3 1 0.000000 0.000000 0.000000 2.116407
## 4 0 0.000000 0.000000 0.000000 0.000000
## 5 0 3.376568 2.854730 0.000000 0.000000
Note that this steps takes considerable amount of time unless multiple cores are used.
models <- lapply(datalist, function(x) build_emt(x))
names(models) <- cells
For the purpose of this vignette, we pre-computed the model and stored the results in an R object which can be easily loaded as follows.
data("models")
sapply(cells, function(x) models[[x]]$auc)
## Gm12878 H1hesc Helas3 K562
## 0.896 0.891 0.847 0.902
The number of sub-models for each of the four cell-type:
sapply(cells, function(x) models[[x]]$model$n.tree)
## Gm12878 H1hesc Helas3 K562
## 25 23 29 32
Let’s create an object that stores the sub-model results only.
submodels <- lapply(models, function(x) x$model)
Summarize the cluster membership matrix and plot the constituent cell-specific sub-model counts. In the figure, each row denotes a cluster number and each column denotes a cell line. Each value of the matrix gives the number of sub-models coming from a cell line and belonging to a cluster.
ld <- get_cluster_membership(submodels, clen = 16)
fit <- ld$fit
cluster.membership <- ld$cluster.membership
plot_cluster_membership(cluster.membership, tf)
Make a new ensemble object for each cluster.
newEnsembles <- make_cluster_ensembles(submodels, fit, clen = 16)
head(sapply(newEnsembles, class))
## [1] "gbm" "gbm" "gbm" "gbm" "gbm" "gbm"
Get the target genes (i.e. determined by the nearest gene from a binding site which belong to the cluster) of each cell type from each cluster.
targets <- get_targets(datalist, newEnsembles, clen = 16)
How many clusters have expression coherence?
data("expression")
coherence <- setup_exp_coherence(targets, exprsn, exptheK = 1, verbose = F)
df <- exp_coherence(coherence)
sum(df$odds > 1 & df$p.value < 0.05, na.rm = T)
## [1] 5
head(df) #clusters with single-celll and/or insufficient data have NA in the corresponding row
## odds p.value
## 1 NA NA
## 2 0.615 8.126557e-20
## 3 NA NA
## 4 NA NA
## 5 0.911 1.911765e-01
## 6 2.206 9.288746e-85
How many clusters have pathway coherence?
data("keggmat")
coherence2 <- setup_pathway_coherence(targets$entrezmat, pathwaymat, verbose=F)
df <- pathway_coherence(coherence2)
sum(df$odds > 1 & df$p.value < 0.05, na.rm = T)
## [1] 1
head(df) #clusters with single-celll and/or insufficient data have NA in the corresponding row
## odds p.value
## 1 NA NA
## 2 5.012 7.866594e-07
## 3 NA NA
## 4 NA NA
## 5 1.194 6.862095e-01
## 6 1.450 2.455643e-01
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 grid parallel splines stats graphics grDevices
## [8] utils datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.2.3 AnnotationDbi_1.32.3 Biobase_2.30.0
## [4] ChIPpeakAnno_3.4.6 RSQLite_1.0.0 DBI_0.3.1
## [7] VennDiagram_1.6.16 futile.logger_1.4.1 GenomicRanges_1.22.4
## [10] GenomeInfoDb_1.6.3 Biostrings_2.38.4 XVector_0.10.0
## [13] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1
## [16] biomaRt_2.26.1 RColorBrewer_1.1-2 kohonen_2.0.19
## [19] MASS_7.3-45 class_7.3-14 ROCR_1.0-7
## [22] gplots_2.17.0 gbm_2.1.1 lattice_0.20-33
## [25] survival_2.38-3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 GO.db_3.2.2
## [3] Rsamtools_1.22.0 gtools_3.5.0
## [5] digest_0.6.9 mime_0.4
## [7] R6_2.1.2 futile.options_1.0.0
## [9] evaluate_0.8.3 httr_1.1.0
## [11] BiocInstaller_1.20.1 zlibbioc_1.16.0
## [13] GenomicFeatures_1.22.13 gdata_2.17.0
## [15] rmarkdown_0.9.5 BiocParallel_1.4.3
## [17] AnnotationHub_2.2.5 stringr_1.0.0
## [19] RCurl_1.95-4.8 shiny_0.13.1
## [21] httpuv_1.3.3 rtracklayer_1.30.2
## [23] multtest_2.26.0 htmltools_0.3
## [25] SummarizedExperiment_1.0.2 interactiveDisplayBase_1.8.0
## [27] matrixStats_0.50.1 XML_3.98-1.4
## [29] GenomicAlignments_1.6.3 bitops_1.0-6
## [31] RBGL_1.46.0 xtable_1.8-2
## [33] magrittr_1.5 formatR_1.3
## [35] graph_1.48.0 KernSmooth_2.23-15
## [37] stringi_1.0-1 limma_3.26.8
## [39] lambda.r_1.1.7 ensembldb_1.2.2
## [41] tools_3.2.3 BSgenome_1.38.0
## [43] yaml_2.1.13 regioneR_1.2.3
## [45] caTools_1.17.1 memoise_1.0.0
## [47] knitr_1.12.3