Set up

library('gbm')
library('ROCR')
library("kohonen")
library('gplots')
library('RColorBrewer')
library("biomaRt")
library("ChIPpeakAnno")
library("org.Hs.eg.db")

Loading source files and data

In this vignette, we provide an example dataset for ZNF143 transcription factor in the form of a list, which stores data list of four cell types: Gm12878, H1hesc, Helas3, and K562. For each cell type, the list includes four objects: trainseq, testseq, training, and testing. The training and testing matrix each stores sequences in the rows and features in the columns, with an additional column of sequence labels.

Load EMT functions.

source('R/emt.r') 
source('R/clustering.r')
source('R/coherence.r')

Load ZNF143 data.

tf <- 'ZNF143'
data("ZNF143")
cells <- names(datalist)
print(cells)
## [1] "Gm12878" "H1hesc"  "Helas3"  "K562"

The trainseq columns are :

  1. sequence: 100 base pair sequence.
  2. seqnames: Name of the chromosome
  3. start: The starting position of the binding site in the chromosome
  4. end: The ending position of the binding site in the chromosome
  5. score: Signal value
  6. value: The overall measurement of the enrichment in the region; for non-binding sites it is the value of associated binding site.
  7. peakid: Id.
  8. label: Indicator of binding or non-binding.
head(datalist$Gm12878$trainseq)
##                                                                                               sequence
## 1 TGACAGCCTCAGCCAACCCCACAGGGAGCTCTGAAGCTAGAATGGCCCTTCAGAGTTGACCCAAAATAAGCTAAGAAGGCCAGGCCGTTACACACCTGTT
## 2 TTGCAGCAGGTGCCTGGGAAGCCAGCTTAACATAAGCTGGCTTTGGGCTGTCCTGGCCCAGGCCTGGCCCTGCAGGGTGACTGGACCCTGCCCAGACTTG
## 3 CGCGACCAATGGGCCCCCGCCGCCGGGAAGCCGCGCCCGCCCCCTGGCGGTGGAGGACCAAGCGGGCGCCCGGGCCGGCCAGAGGGAAGGGCCGGAGAGC
## 4 GGCCCCTGTCGGCCGCCAAGCCCCTCCGCCCCTCACAGCGCCCAGGTCCGCGGCCGGGCCTTGATTTTTTGGCGGGGACCGTCATGGCGTCGCAGCCAAA
## 5 GAAACTCAGATCTTTTTGAAGAGGATGCAGCTGTCACAGAAACATGCAGCTGCTGCTGGCAGAGTGCATGGGTCAGAGTGGGCCACCAGGAGCTGTCTGC
## 6 ACCACACAGACCTCCCCCTCCCCACCCCCAGCCCCGCCTGCCCTAGCCCCGCCGCCGCCGCCGCCGAAACTCTTGGGCCTCTGGCCGCCCAGACCCCTCA
##   seqnames     start       end score     value peakid label
## 1     chr1  36192886  36192985   623  97.53589   4476     0
## 2     chr1  26872569  26872668  1000 158.16834   2305     0
## 3     chr7 121036413 121036512   833 130.41128   2805     1
## 4    chr10  94353050  94353149  1000 437.76702    406     0
## 5     chr6   2932247   2932346   673 105.41407   3818     0
## 6    chr12  46123024  46123123   831 130.07808   2815     0

The train columns are

  1. label: Indicator of binding or non-binding.
  2. M00001: TRANSFAC id (i.e. feature 1)
  3. M00002: TRANSFAC id (i.e. feature 2)
  4. M00003: TRANSFAC id (i.e. feature 3)
  5. M00004: TRANSFAC id (i.e. feature 4)
datalist$Gm12878$train[1:5, 1:5]
##   label   M00001   M00002   M00003   M00004
## 1     0 0.000000 0.000000 2.327241 2.188066
## 2     0 2.402537 2.236245 0.000000 0.000000
## 3     1 0.000000 0.000000 0.000000 2.116407
## 4     0 0.000000 0.000000 0.000000 0.000000
## 5     0 3.376568 2.854730 0.000000 0.000000

Building model for each cell

Note that this steps takes considerable amount of time unless multiple cores are used.

models <- lapply(datalist, function(x) build_emt(x))
names(models) <- cells

For the purpose of this vignette, we pre-computed the model and stored the results in an R object which can be easily loaded as follows.

data("models")
sapply(cells, function(x) models[[x]]$auc) 
## Gm12878  H1hesc  Helas3    K562 
##   0.896   0.891   0.847   0.902

The number of sub-models for each of the four cell-type:

sapply(cells, function(x) models[[x]]$model$n.tree)
## Gm12878  H1hesc  Helas3    K562 
##      25      23      29      32

Let’s create an object that stores the sub-model results only.

submodels <- lapply(models, function(x) x$model)

Clustering the submodels across cells

Summarize the cluster membership matrix and plot the constituent cell-specific sub-model counts. In the figure, each row denotes a cluster number and each column denotes a cell line. Each value of the matrix gives the number of sub-models coming from a cell line and belonging to a cluster.

ld <- get_cluster_membership(submodels, clen = 16)
fit <- ld$fit
cluster.membership <- ld$cluster.membership
plot_cluster_membership(cluster.membership, tf)

Make a new ensemble object for each cluster.

newEnsembles <- make_cluster_ensembles(submodels, fit, clen = 16)
head(sapply(newEnsembles, class))
## [1] "gbm" "gbm" "gbm" "gbm" "gbm" "gbm"

Functional assesment of the target genes

Get the target genes (i.e. determined by the nearest gene from a binding site which belong to the cluster) of each cell type from each cluster.

targets <- get_targets(datalist, newEnsembles, clen = 16)

How many clusters have expression coherence?

data("expression")
coherence <- setup_exp_coherence(targets, exprsn, exptheK = 1, verbose = F)
df <- exp_coherence(coherence)
sum(df$odds > 1 & df$p.value < 0.05, na.rm = T)
## [1] 5
head(df) #clusters with single-celll and/or insufficient data have NA in the corresponding row
##    odds      p.value
## 1    NA           NA
## 2 0.615 8.126557e-20
## 3    NA           NA
## 4    NA           NA
## 5 0.911 1.911765e-01
## 6 2.206 9.288746e-85

How many clusters have pathway coherence?

data("keggmat")
coherence2 <- setup_pathway_coherence(targets$entrezmat, pathwaymat, verbose=F)
df <- pathway_coherence(coherence2)
sum(df$odds > 1 & df$p.value < 0.05, na.rm = T) 
## [1] 1
head(df) #clusters with single-celll and/or insufficient data have NA in the corresponding row
##    odds      p.value
## 1    NA           NA
## 2 5.012 7.866594e-07
## 3    NA           NA
## 4    NA           NA
## 5 1.194 6.862095e-01
## 6 1.450 2.455643e-01

Session information

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] stats4    grid      parallel  splines   stats     graphics  grDevices
##  [8] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.2.3   AnnotationDbi_1.32.3 Biobase_2.30.0      
##  [4] ChIPpeakAnno_3.4.6   RSQLite_1.0.0        DBI_0.3.1           
##  [7] VennDiagram_1.6.16   futile.logger_1.4.1  GenomicRanges_1.22.4
## [10] GenomeInfoDb_1.6.3   Biostrings_2.38.4    XVector_0.10.0      
## [13] IRanges_2.4.8        S4Vectors_0.8.11     BiocGenerics_0.16.1 
## [16] biomaRt_2.26.1       RColorBrewer_1.1-2   kohonen_2.0.19      
## [19] MASS_7.3-45          class_7.3-14         ROCR_1.0-7          
## [22] gplots_2.17.0        gbm_2.1.1            lattice_0.20-33     
## [25] survival_2.38-3     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3                  GO.db_3.2.2                 
##  [3] Rsamtools_1.22.0             gtools_3.5.0                
##  [5] digest_0.6.9                 mime_0.4                    
##  [7] R6_2.1.2                     futile.options_1.0.0        
##  [9] evaluate_0.8.3               httr_1.1.0                  
## [11] BiocInstaller_1.20.1         zlibbioc_1.16.0             
## [13] GenomicFeatures_1.22.13      gdata_2.17.0                
## [15] rmarkdown_0.9.5              BiocParallel_1.4.3          
## [17] AnnotationHub_2.2.5          stringr_1.0.0               
## [19] RCurl_1.95-4.8               shiny_0.13.1                
## [21] httpuv_1.3.3                 rtracklayer_1.30.2          
## [23] multtest_2.26.0              htmltools_0.3               
## [25] SummarizedExperiment_1.0.2   interactiveDisplayBase_1.8.0
## [27] matrixStats_0.50.1           XML_3.98-1.4                
## [29] GenomicAlignments_1.6.3      bitops_1.0-6                
## [31] RBGL_1.46.0                  xtable_1.8-2                
## [33] magrittr_1.5                 formatR_1.3                 
## [35] graph_1.48.0                 KernSmooth_2.23-15          
## [37] stringi_1.0-1                limma_3.26.8                
## [39] lambda.r_1.1.7               ensembldb_1.2.2             
## [41] tools_3.2.3                  BSgenome_1.38.0             
## [43] yaml_2.1.13                  regioneR_1.2.3              
## [45] caTools_1.17.1               memoise_1.0.0               
## [47] knitr_1.12.3