These scripts were written in R markdown and the HTML files were generated using the knitr package; R files were converted into HTML reports using the "spin()" function.
Note: Most scripts contain a variable "theRootDir", which specifies the root directory that contains the data. This is assumed to be located at "/mnt/data_scratch/finalData". This should either be updated to match the location you have used on your machine, or you must created this directory.
Note: These scripts need to be run in the sequence below: the early scripts create / download data that is used by the subsequent scripts!
Note: Substantial computing resources are required to run some of these scripts. The scripts were tested on the Bionimbus Protected Data Cloud, using Ubuntu Linux 14.04.3 LTS, under a configuration with 32 cores and 128Gb or RAM. R version 3.2.2 was used.
To install the package, first install the dependencies and required packages. From the R prompt:
> source("http://bioconductor.org/biocLite.R")
> biocLite(c("car", "ridge", "preprocessCore", "genefilter", "sva")) # pRRophetic dependencies
> biocLite(c("parallel", "GenomicFeatures", "ggplot2", "org.Hs.eg.db", "TxDb.Hsapiens.UCSC.hg19.knownGene", "glmnet", "gdata", "knitr")) # other requried packages
Then download and install the pRRophetic package, which is used to impute drug sensitivty in the TCGA samples:
> download.file("http://128.135.165.197/pRRophetic/pRRophetic_0.5.tar.gz", "pRRophetic.tar.gz")
> install.packages("pRRophetic.tar.gz", repos=NULL, type="source")
Note that the pRRophetic package is also available for download from the Open Science Framework: https://osf.io/5xvsg/
download_tcga_data.R [R SCRIPT] [HTML Report]
This script will allow one to download the TCGA data used in this project. Data is downloaded from firebrowse.org.
batch_correct_tcga_data.R [R SCRIPT] [HTML Report]
This script will allow one to batch correct the data downloaded above. We base our batch correction method on the RUV method described here.
map_cnvs_to_genes.R [R SCRIPT]
We used this script to map CNV regions to genes in the TCGA data.
getPredsOnAllTCGA_batchCorrData.R [R SCRIPT] [HTML Report]
This script will create a drug response prediction for every sample in TCGA, using the RNA-seq gene expression data.
classify_Type_Models_reproduce.R [R SCRIPT] [HTML Report]
This script will allow one to fit logistic regression models on the CGP cell lines data and attempt to classify the tissue-of-origin of TCGA samples using models for on cell lines. (Supplementary Figure 1).
breast_cancer_analysis.R [R SCRIPT] [HTML Report]
This script will show that our method reproduces the expected association between lapatinib and ERBB2 status in breast cancer. We also show that this association is drug specific and that it can be recovered when we generate predictions on only breast cancer samples or across all of TCGA. (Figure 2).
breast_cancer_cnv_analysis.R [R SCRIPT] [HTML Report]
This script will perform an integrative analysis of the CNV and predicted drug response data in TCGA breast cancer samples. The script shows that the ERBB2 amplification and lapatinib response are strongly associated, and that ERBB2 can be identified from these data as the causative gene. This script also identifies the association between ERLIN2 amplification and Vinorelbine resistance. (Figure 3 a and b, Figure 4 a and b).
erbb2_cnv_in_cgp.R [R SCRIPT] [HTML Report]
In this script we will investigate the association between ERBB2 amplification and lapatinib response in the GDSC cell lines. (Figure 3 c and d)
erlin2_cnv_in_cgp.R [R SCRIPT] [HTML Report]
In this script we will investigate the association between ERLIN2 amplification and vinorelbine response in the GDSC cell lines. (Supplementary Figure 6 and 7)
loocv_gdsc_allDrugs.R [R SCRIPT] [HTML Report]
This script will predict drug response from gene expression using 10-fold cross-validation on the GDSC cell lines. This is to estimate the prediction accuracy for the each of the drugs. (Supplementary Table 6)
models_on_all_TCGA.R [R SCRIPT] [HTML Report]
This script will run IDWAS against all somatic mutations called from the exome sequence data in TCGA. (Figure 5 and supplementary tables containing these results with an without GLDS correction. Supplementary Tables containing the imputed drug information for TCGA, correceted for both Cancer-type and GLDS.)
models_on_all_TCGA_againstCNVs.R [R SCRIPT] [HTML Report]
This script will run IDWAS against all CNAs called from the exome sequence data in TCGA. (Supplementary Figure 9 and 10)