Supplement II: Studying differential usage of exons using RNA-seq data
Table of Contents
1 Summary
The purpose of this document is to provide a reproducible workflow for the analysis and comparisons made in the paper "Studying differential usage of exons using RNA-seq data". The starting point of the workflow are the files with the read alignments. All the data files and all scripts needed to run this analysis are provided in http://www-huber.embl.de/pub/DEXSeq/analysis/. Environment variables that point to the location of these files are defined in the code. In order to reproduce the code, set these variables to reflect the location their local copies on your system.
2 System requirements to run all code
a) Unix environment with bash shell b) R and attached packages (check below) c) python 2.6.6 or higher e) htseq-0.5.3 python library f) cuffdiff 1.3.0, 1.1.0, 1.2.0
> sessionInfo() R Under development (unstable) (2012-03-13 r58728) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] gplots_2.10.1 KernSmooth_2.23-7 caTools_1.12 [4] bitops_1.0-4.1 gdata_2.8.2 gtools_2.6.2 [7] lattice_0.20-6 GenomicFeatures_1.7.31 AnnotationDbi_1.17.27 [10] BSgenome_1.23.4 GenomicRanges_1.7.36 Biostrings_2.23.6 [13] IRanges_1.13.30 multicore_0.1-7 DEXSeq_1.1.10 [16] Biobase_2.15.4 BiocGenerics_0.1.14 loaded via a namespace (and not attached): [1] biomaRt_2.11.1 DBI_0.2-5 hwriter_1.3 plyr_1.7.1 [5] RCurl_1.91-1 Rsamtools_1.7.40 RSQLite_0.11.1 rtracklayer_1.15.7 [9] statmod_1.4.14 stats4_2.16.0 stringr_0.6 tools_2.16.0 [13] XML_3.9-4 zlibbioc_1.1.1
3 Setting the enviroment variables
exportANALYSIS=/g/huber/users/reyes/Graveley/doc exportSCRIPTS=$ANALYSIS/scripts
The variable "ANALYSIS" should point to a local copy of the directory http://www-huber.embl.de/pub/DEXSeq/analysis/.
4 Brooks et al
Brooks et al generated data using S2 cell lines from 3 biological replicates of a siRNA knockdown of the splicing factor pasilla and 4 biological replicates of an untreated control. These reads were downloaded from the SRA archive accession number GSM18508 and aligned using Tophat version 1.2.0 against the Drosophila Melanogaster genome.
4.1 DEXSeq analysis
Brooks et al annotation
We start by downloading the annotation file from the supplementary material of the Brooks et al paper (this step is not shown here). The script finaledit.pl adds columns with empty values to this gtf file in order to make it compatible with the standard gtf format. Then, we define the exon counting bins, as described in the paper, using the script dexseq_prepare_annotation.py
cd $ANALYSIS/brooksetal perl $SCRIPTS/finaledit.pl SG_r5_11_MB5_annotations.gtf \ > graveley_annotation.gtf python $SCRIPTS/dexseq_prepare_annotation.py graveley_annotation.gtf \ graveley_DEXSeq.gtf
Now we can count the number of reads whose alingnment overlaps with any of these defined exon bins.
python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ graveley_DEXSeq.gtf bam/treated1.sam treated1gr.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ graveley_DEXSeq.gtf bam/treated2.sam treated2gr.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ graveley_DEXSeq.gtf bam/treated3.sam treated3gr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ graveley_DEXSeq.gtf bam/untreated1.sam untreated1gr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ graveley_DEXSeq.gtf bam/untreated2.sam untreated2gr.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ graveley_DEXSeq.gtf bam/untreated3.sam untreated3gr.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ graveley_DEXSeq.gtf bam/untreated4.sam untreated4gr.counts
We perform differential exon usage analysis using DEXSeq.
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") countfiles <- list.files("../brooksetal", full.names=TRUE, pattern="en.counts") samplenames <- sapply( strsplit(sapply( strsplit(countfiles, "gr"), "[[", 1), "\\/"), "[[", 3) conditions <- substr(samplenames, 1, c(7, 7, 7, 9,9,9,9)) type <- c("single", "paired", "paired", "single", "single", "paired", "paired") des <- data.frame(condition=conditions, type=type) library(DEXSeq) ecs <- read.HTSeqCounts(countfiles, design=des, flattenedfile="../brooksetal/ENSEMBLannotation.DEXSeq.gtf") sampleNames(ecs) <- samplenames library(multicore) formuladispersion <- count ~ sample + (condition + type) * exon formula0 <- count ~ sample + type * exon + condition formula1 <- count ~ sample + type * exon + condition * I(exon == exonID) ecs <- makeCompleteDEUAnalysis(ecs, minCount=0, nCores=10, formulaDispersion=formuladispersion, formula0=formula0, formula1=formula1, maxExon=10000) save(ecs, file="brooksannotationecs.RData")
We observe 259 exon usage differences in 159 genes.
suppressMessages(library(DEXSeq)) setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("brooksannotationecs.RData") table( fData(ecs)$padjust < 0.1 ) length(unique(geneIDs(ecs)[which( fData(ecs)$padjust < 0.1 )]))
FALSE TRUE 41914 259 [1] 159
ENSEMBL annotation (used for comparisons with cuffdiff)
The differential exon usage analysis in the publication was also done using a different annotation, ENSEMBL (Flybase). To do this, we begin by downloading the annotation files from ENSEMBL from the release 62.
wget ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz gunzip Drosophila_melanogaster.BDGP5.25.62.gtf.gz
We run a preprocesing script edit_gtf_forcuff.pl to remove the "Het" chromosomes from the annotation file, and change the chromosome names to match them with the chromosome names from the alignment indexes. (eg. change 2L to chr2L).
perl $SCRIPTS/edit_gtf_forcuff.pl Drosophila_melanogaster.BDGP5.25.62.gtf \ > Drosophila_melanogaster.BDGP5.25.62.mychr.gtf
We run the preprocessing scripts for DEXSeq.
python $SCRIPTS/dexseq_prepare_annotation.py \ Drosophila_melanogaster.BDGP5.25.62.mychr.gtf \ ENSEMBLannotation.DEXSeq.gtf
We proceed to count how many read alignments overlap with our defined counting bins, using the script dexseq_count.py:
python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/treated1.sam treated1en.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/treated2.sam treated2en.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/treated3.sam treated3en.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/untreated1.sam untreated1en.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/untreated2.sam untreated2en.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/untreated3.sam untreated3en.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 10 \ ENSEMBLannotation.DEXSeq.gtf bam/untreated4.sam untreated4en.counts
Once finished with these preprocessing steps, we can continue our analysis using the Bioconductor package DEXSeq. The code is displayed below. If you would like to know details of the code, check the DEXSeq vignette and user manual.
We begin by generating an ExonCountSet object.
setwd(Sys.getenv("ANALYSIS")) setwd("org") countfiles <- list.files("../brooksetal", full.names=TRUE, pattern="en.counts") samplenames <- sapply( strsplit(sapply( strsplit(countfiles, "en"), "[[", 1), "\\/"), "[[", 3) conditions <- substr(samplenames, 1, c(7, 7, 7, 9,9,9,9)) type <- c("single", "paired", "paired", "single", "single", "paired", "paired") des <- data.frame(condition=conditions, type=type) library(DEXSeq) ecs <- read.HTSeqCounts(countfiles, design=des, flattenedfile="../brooksetal/ENSEMBLannotation.DEXSeq.gtf") sampleNames(ecs) <- samplenames
We define the formulas used for the analysis and make a complete differential exon usage analysis using 10 CPU cores, taking all the samples and testing treated vs untreated. Comparison Number 1
library(multicore) formuladispersion <- count ~ sample + (condition + type) * exon formula0 <- count ~ sample + type * exon + condition formula1 <- count ~ sample + type * exon + condition * I(exon == exonID) allENSecs <- list() allENSecs[[1]] <- makeCompleteDEUAnalysis(ecs, minCount=0, nCores=10, formulaDispersion=formuladispersion, formula0=formula0, formula1=formula1, maxExon=10000)
Testing just in the paired end samples, taking the library type as a covariate, e.g. using the default formulas. Comparison number 2.
allENSecs[[2]] <- makeCompleteDEUAnalysis(ecs[,design(ecs)$type == "paired"], minCount=0, nCores=10, maxExon=100000)
Mixing paired-end and single-end libraries and testing treated vs untreated. Comparison number 3, treated 1-2 vs untreated 1-3.
allENSecs[[3]] <- makeCompleteDEUAnalysis(ecs[,c(1,2,4,6)],
minCount=0,
nCores=10,
maxExon=100000)
Making the first "two-by-two" comparison, untreated 1-3 vs untreated 2-4, comparison number 4:
ecsun <- ecs[,design(ecs)$condition == "untreated"] design(ecsun)$condition <- factor(c("un-A", "un-B", "un-A", "un-B")) allENSecs[[4]] <- makeCompleteDEUAnalysis(ecsun, minCount=0, nCores=10, maxExon=100000)
Making the second "two-by-two" comparison untreated 1-4 vs untreated 2-3, comparison number 5:
design(ecsun)$condition <- factor(c("un-A", "un-B", "un-B", "un-A")) allENSecs[[5]] <- makeCompleteDEUAnalysis(ecsun, minCount=0, nCores=10, maxExon=100000) save(allENSecs, file="../brooksetal/allENSecs.RData")
Cuffdiff 1.3.0 analysis
The exact same tests done with DEXSeq are also performed with cuffdiff. We begin by modifying the annotation files to add the attributes that cuffdiff requires to test for splicing regulation as its manual indicates. According to this manual, the output file with name cuffcmp.combined.gtf should have now the appropiate attributes for cuffdiff.
cd $ANALYSIS/brooksetal exportCUFFDIFF=/g/huber/users/reyes/software/opt/cufflinks-1.3.0.Linux_x86_64/ $CUFFDIFF/cuffcompare -s d_melanogaster_dm3.fa -CG -p Drosophilacuffdiffattr -r \ Drosophila_melanogaster.BDGP5.25.62.mychr.gtf \ Drosophila_melanogaster.BDGP5.25.62.mychr.gtf
Now we run cuffdiff, testing all treated vs untreated samples, comparison number 1.
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/all-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam,bam/treated3.bam \ bam/untreated1.bam,bam/untreated2.bam,bam/untreated3.bam,bam/untreated4.bam
Run cuffdiff comparing treated vs untreated just with the paired samples, comparison number 2:
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/paired-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated2.bam,bam/treated3.bam \ bam/untreated3.bam,bam/untreated4.bam
Run cuffdiff for comparison number 3:
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/mixed-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam \ bam/untreated1.bam,bam/untreated3.bam
Run cuffdiff testing untreated 1-3 vs untreated 2-4, comparison number 4:
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/untreated13-vs-untreated24 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated3.bam \ bam/untreated2.bam,bam/untreated4.bam
Run cuffdiff testing untreated 1-4 vs untreated 2-3, comparison number 5:
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/untreated14-vs-untreated23 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated4.bam \ bam/untreated2.bam,bam/untreated3.bam
Cuffdiff 1.2.0 analysis
We perform exactly the same comparisons, now using cuffdiff version 1.2.0:
exportCUFFDIFF102=/g/huber/users/reyes/software/opt/cufflinks-1.2.0.Linux_x86_64/ $CUFFDIFF102/cuffdiff --FDR 0.1 -o cuffdiff-outputs-102/all-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam,bam/treated3.bam \ bam/untreated1.bam,bam/untreated2.bam,bam/untreated3.bam,bam/untreated4.bam $CUFFDIFF102/cuffdiff --FDR 0.1 -o cuffdiff-outputs-102/paired-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated2.bam,bam/treated3.bam \ bam/untreated3.bam,bam/untreated4.bam $CUFFDIFF102/cuffdiff --FDR 0.1 -o cuffdiff-outputs-102/mixed-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam \ bam/untreated1.bam,bam/untreated3.bam $CUFFDIFF102/cuffdiff --FDR 0.1 -o cuffdiff-outputs-102/untreated13-vs-untreated24 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated3.bam \ bam/untreated2.bam,bam/untreated4.bam $CUFFDIFF102/cuffdiff --FDR 0.1 -o cuffdiff-outputs-102/untreated14-vs-untreated23 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated4.bam \ bam/untreated2.bam,bam/untreated3.bam
Cuffdiff 1.1.0 analysis
Now using cuffdiff version 1.1.0:
exportCUFFDIFF110=/g/huber/users/reyes/software/opt/cufflinks-1.1.0.Linux_x86_64 $CUFFDIFF110/cuffdiff --FDR 0.1 -o cuffdiff-outputs-110/all-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam,bam/treated3.bam \ bam/untreated1.bam,bam/untreated2.bam,bam/untreated3.bam,bam/untreated4.bam $CUFFDIFF110/cuffdiff --FDR 0.1 -o cuffdiff-outputs-110/paired-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated2.bam,bam/treated3.bam \ bam/untreated3.bam,bam/untreated4.bam $CUFFDIFF110/cuffdiff --FDR 0.1 -o cuffdiff-outputs-110/mixed-treated-vs-untreated \ cuffcmp.combined.gtf \ bam/treated1.bam,bam/treated2.bam \ bam/untreated1.bam,bam/untreated3.bam $CUFFDIFF110/cuffdiff --FDR 0.1 -o cuffdiff-outputs-110/untreated13-vs-untreated24 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated3.bam \ bam/untreated2.bam,bam/untreated4.bam $CUFFDIFF110/cuffdiff --FDR 0.1 -o cuffdiff-outputs-110/untreated14-vs-untreated23 \ cuffcmp.combined.gtf \ bam/untreated1.bam,bam/untreated4.bam \ bam/untreated2.bam,bam/untreated3.bam
DEXSeq vs cuffdiff
These are the numbers summarizing the comparisons which each software. In comparison 1, we made a test, treated vs untreated, using all the 7 samples.
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("allENSecs.RData") sapply(allENSecs, function(x){ length(unique( geneIDs(x)[which(fData(x)$padjust < 0.1)] )) } ) system("grep -c yes cuffdiff-outputs*/*/splicing*")
[1] 159 154 52 8 7 cuffdiff-outputs-102/all-treated-vs-untreated/splicing.diff:69 cuffdiff-outputs-102/mixed-treated-vs-untreated/splicing.diff:120 cuffdiff-outputs-102/paired-treated-vs-untreated/splicing.diff:735 cuffdiff-outputs-102/untreated13-vs-untreated24/splicing.diff:650 cuffdiff-outputs-102/untreated14-vs-untreated23/splicing.diff:724 cuffdiff-outputs-110/all-treated-vs-untreated/splicing.diff:145 cuffdiff-outputs-110/mixed-treated-vs-untreated/splicing.diff:323 cuffdiff-outputs-110/paired-treated-vs-untreated/splicing.diff:310 cuffdiff-outputs-110/untreated13-vs-untreated24/splicing.diff:314 cuffdiff-outputs-110/untreated14-vs-untreated23/splicing.diff:392 cuffdiff-outputs/all-treated-vs-untreated/splicing.diff:50 cuffdiff-outputs/mixed-treated-vs-untreated/splicing.diff:578 cuffdiff-outputs/paired-treated-vs-untreated/splicing.diff:754 cuffdiff-outputs/untreated13-vs-untreated24/splicing.diff:639 cuffdiff-outputs/untreated14-vs-untreated23/splicing.diff:728
Making the HTML
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("allENSecs.RData") ecs <- allENSecs[[1]] plus <- fData(ecs)$strand == "+" minus <- unique(geneIDs(ecs)[which(!plus)]) exonID <- exonIDs(ecs) geneID <- geneIDs(ecs) for(gn in minus){ exids <- exonID[geneID %in% gn] exonID[geneID %in% gn] <- exids[length(exids):1] } exonIDs(ecs) <- exonID save(ecs, file="ecs-plots.RData") color.samples=c("red1", "red4", "red4", "blue1", "blue1", "#2171B5", "#2171B5") mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") attributes <- c("external_gene_id", "description") filter="ensembl_gene_id" DEXSeqHTML(ecs, FDR=0.1, color.samples=color.samples, path="ens-rep", filter=filter, mart=mart, attributes=attributes)
5 Braward et al
To demonstrate that our method is also applicable to other types of experimental designs and different library preparations, we performed additional analysis taking the chimpanzee brain and cerebellum samples from the dataset by Braward et al. In this case, our aim was to find exon usage differences between this tissues. The dataset contains 5 brain samples, one female and 4 males, and 2 cerebellum samples from one female and one male. As all the samples from the male brain samples are paired-end and the rest are single read, we decided to trim all the reads to make them look the same, e.g. single read 76 bp, to avoid confounding factors. The short reads can be downloaded from the Gene Expression Omnibus under the accession code GSE30352.
These alignments were done using GSNAP 2012-01-11 against Pan troglodytes genomic DNA of the ENSEMBL release 64.
DEXSeq analysis
We download the chimp annotation from the ENSEMBL release 64 and define our counting bins.
cd $ANALYSIS/brawardetal/ exportSCRIPTS=../scripts wget ftp://ftp.ensembl.org/pub/release-64/gtf/pan_troglodytes/Pan_troglodytes.CHIMP2.1.64.gtf.gz gunzip Pan_troglodytes.CHIMP2.1.64.gtf.gz python $SCRIPTS/dexseq_prepare_annotation.py Pan_troglodytes.CHIMP2.1.64.gtf \ ptr_DEXSeq.gtf
Count the number of reads falling into our defined exon bins.
python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-F-1-ptr.unpaired_uniq\ ptr-br-F-1-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-M-1-ptr.unpaired_uniq\ ptr-br-M-1-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-M-2-ptr.unpaired_uniq\ ptr-br-M-2-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-M-3-ptr.unpaired_uniq\ ptr-br-M-3-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-M-4-ptr.unpaired_uniq\ ptr-br-M-4-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-br-M-5-ptr.unpaired_uniq\ ptr-br-M-5-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-cb-F-1-ptr.unpaired_uniq\ ptr-cb-F-1-ptr.counts python $SCRIPTS/dexseq_count.py -p no -s no -a 0\ ptr_DEXSeq.gtf\ aln/ptr-cb-M-1-ptr.unpaired_uniq\ ptr-cb-M-1-ptr.counts
We remove the sex chromosomes to avoid biases due to the sex of the individuals. DEXSeq analysis testing brain against cerebellum:
setwd(Sys.getenv("ANALYSIS")) setwd("brawardetal") library(DEXSeq) library(multicore) filecounts <- dir(".", pattern="counts") des <- sapply(strsplit(filecounts, "-"), "[[", 2) ecs <- read.HTSeqCounts(filecounts, design=des, flattenedfile="ptr_DEXSeq.gtf") ecs <- ecs[-which(fData(ecs)$chr == "X"),] ecs <- makeCompleteDEUAnalysis( ecs, minCount=0, maxExon=150, nCores=10) plus <- fData(ecs)$strand == "+" minus <- unique(geneIDs(ecs)[which(!plus)]) exonID <- exonIDs(ecs) geneID <- geneIDs(ecs) for(gn in minus){ exids <- exonID[geneID %in% gn] exonID[geneID %in% gn] <- exids[length(exids):1] } ## exonIDs(ecs) <- exonID mart <- useMart("ensembl", dataset="ptroglodytes_gene_ensembl") attributes <- c("external_gene_id", "description") filter="ensembl_gene_id" DEXSeqHTML(ecs, FDR=0.1, path="brawardetal-rep", mart=mart, attributes=attributes, filter=filter) save(ecs, file="br-vs-cb.RData")
We take all the male brain samples, make all possible tests of 2 male brains against 2 male brains.
ecsMbr <- ecs[,c(2, 3, 4, 5, 6)] posconditions <- cbind(expand.grid(c("br-a", "br-b"), c("br-a", "br-b")), Var3=c("br-a", "br-b", "br-b", "br-a")) comboffor <- combn(1:5, 4) listecs <- list() cont=1 for(i in 1:ncol(comboffor)){ for(j in 1:ncol(posconditions)){ ecsn <- newExonCountSet(counts(ecsMbr[,comboffor[,i]]), design=posconditions[,j], geneIDs=as.character(geneIDs(ecs)), exonIDs=exonIDs(ecs)) ecsn <- estimateSizeFactors(ecsn) ecsn <- estimateDispersions(ecsn, minCount=0, maxExon=150, nCores=10) ecsn <- fitDispersionFunction(ecsn) ecsn <- testForDEU(ecsn) listecs[cont] <- ecsn cont=cont+1 } } save(listecs, file="br-ecs.RData")
Now we make all possible comparisons of 2 brains vs 2 cerebellums. Always taking one male and one female from each tissue. (Note that in the Brawand et al. data, samples from primates denoted brain are actually only prefrontal cortex.)
ecscbbr <- list() for(i in 2:6){ ecsn <- newExonCountSet(counts(ecs)[,c(1, i, 7, 8)], design(ecs)[c(1, i, 7, 8)], geneIDs=as.character(geneIDs(ecs)), exonIDs=exonIDs(ecs)) ecsn <- estimateSizeFactors(ecsn) ecsn <- estimateDispersions(ecsn, minCount=0, maxExon=150, nCores=10) ecsn <- fitDispersionFunction(ecsn) ecsn <- testForDEU(ecsn) ecscbbr[i-1] <- ecsn } save(ecscbbr, file="ecscbbr.RData")
Cuffdiff
We perform the same comparisons using cuffdiff version 1.3.0.
exportCUFFDIFF=/g/huber/users/reyes/software/opt/cufflinks-1.3.0.Linux_x86_64 cd $ANALYSIS/brawardetal $CUFFDIFF/cuffcompare -s Pan_troglodytes.CHIMP2.1.64.dna.toplevel.fa -CG -r\ Pan_troglodytes.CHIMP2.1.64.gtf\ Pan_troglodytes.CHIMP2.1.64.gtf grep -v ^X cuffcmp.combined.gtf > cuffcmp.combined.noX.gtf $CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/all-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-1-ptr-sorted.bam,aln/ptr-br-M-2-ptr-sorted.bam,aln/ptr-br-M-3-ptr-sorted.bam,aln/ptr-br-M-4-ptr-sorted.bam,aln/ptr-br-M-5-ptr-sorted.bam
$CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/two-1-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-1-ptr-sorted.bam $CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/two-2-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-2-ptr-sorted.bam $CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/two-3-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-3-ptr-sorted.bam $CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/two-4-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-4-ptr-sorted.bam $CUFFDIFF/cuffdiff --FDR 0.1 -o cuffdiff-outputs/two-5-br-vs-cb \ cuffcmp.combined.noX.gtf \ aln/ptr-cb-F-1-ptr-sorted.bam,aln/ptr-cb-M-1-ptr-sorted.bam \ aln/ptr-br-F-1-ptr-sorted.bam,aln/ptr-br-M-5-ptr-sorted.bam
For the mock comparisons we take all male brain samples (5), and make all the possible 2 vs 2 comparisons, calling them from R using system calls.
setwd(Sys.getenv("ANALYSIS")) setwd("brawardetal") comboffor <- combn(1:5, 4) posconditions <- cbind(expand.grid(c("br-a", "br-b"), c("br-a", "br-b")), Var3=c("br-a", "br-b", "br-b", "br-a")) bamfiles <- dir("aln", pattern="*sorted.bam")[c(2, 3, 4, 5, 6)] cont <- .01 for( i in 1:ncol(comboffor)){ for( j in 1:ncol(posconditions)){ totest <- bamfiles[comboffor[,i]] names(totest) <- posconditions[,j] gr1 <- paste("aln/", totest[which(names(totest) == "br-a")], sep="") gr2 <- paste("aln/", totest[which(names(totest) == "br-b")], sep="") gr1 <- paste(gr1[1], gr1[2], sep=",") gr2 <- paste(gr2[1], gr2[2], sep=",") nm1 <- sapply(strsplit(gr1, "-"), function(x){x[c(4,9)]}) nm2 <- sapply(strsplit(gr2, "-"), function(x){x[c(4,9)]}) nm1 <- paste(nm1[1,], nm1[2,], sep="") nm2 <- paste(nm2[1,], nm2[2,], sep="") dirname <- paste("cuffdiff-outputs/br", cont, nm1, nm2, sep="-") system(paste("mkdir", dirname)) que <- paste("$CUFFDIFF/cuffdiff --FDR 0.1 -o", dirname, "cuffcmp.combined.noX.gtf", gr1, gr2, sep=" ") print(que) system(que) cont=cont+.01 } }
DEXSeq vs cuffdiff
Summary of the results of the comparison brain vs cerebellum from the chimpanzee data from Braward et al:
suppressMessages(library(DEXSeq)) setwd(Sys.getenv("ANALYSIS")) setwd("brawardetal") load("br-vs-cb.RData") table( fData(ecs)$padjust < 0.1 ) length(unique(geneIDs(ecs)[which( fData(ecs)$padjust < 0.1 )])) system("grep -c yes cuffdiff-outputs/all-br-vs-cb/splicing.diff")
FALSE TRUE 153103 866 [1] 650 114
Results of doing the mock comparisons 2 brains vs 2 brains,
load("br-ecs.RData") sapply(listecs, function(x){ length(unique( geneIDs(x)[which(fData(x)$padjust < 0.1)] )) } ) system("grep -c yes cuffdiff-outputs/br-0.*/splicing*")
[1] 3 0 244 2 1 2 2 2 2 10 5 0 1 19 0 cuffdiff-outputs/br-0.01-13-24/splicing.diff:405 cuffdiff-outputs/br-0.02-12-34/splicing.diff:399 cuffdiff-outputs/br-0.03-14-23/splicing.diff:590 cuffdiff-outputs/br-0.04-13-25/splicing.diff:628 cuffdiff-outputs/br-0.05-12-35/splicing.diff:499 cuffdiff-outputs/br-0.06-15-23/splicing.diff:555 cuffdiff-outputs/br-0.07-14-25/splicing.diff:460 cuffdiff-outputs/br-0.08-12-45/splicing.diff:504 cuffdiff-outputs/br-0.09-15-24/splicing.diff:308 cuffdiff-outputs/br-0.1-14-35/splicing.diff:497 cuffdiff-outputs/br-0.11-13-45/splicing.diff:554 cuffdiff-outputs/br-0.12-15-34/splicing.diff:353 cuffdiff-outputs/br-0.13-24-35/splicing.diff:476 cuffdiff-outputs/br-0.14-23-45/splicing.diff:823 cuffdiff-outputs/br-0.15-25-34/splicing.diff:526
Results of doing all possible comparisons, 2 cerebellums (male and female) vs 2 brains (male and female).
load("ecscbbr.RData") sapply(ecscbbr, function(x){ length(unique( geneIDs(x)[which(fData(x)$padjust < 0.1)] )) } ) system("grep -c yes cuffdiff-outputs/two-*/splicing*")
[1] 56 18 26 32 27 cuffdiff-outputs/two-1-br-vs-cb/splicing.diff:230 cuffdiff-outputs/two-2-br-vs-cb/splicing.diff:361 cuffdiff-outputs/two-3-br-vs-cb/splicing.diff:370 cuffdiff-outputs/two-4-br-vs-cb/splicing.diff:215 cuffdiff-outputs/two-5-br-vs-cb/splicing.diff:380
6 ENCODE dataset
We now apply DEXSeq to an additional dataset from the ENCODE project. We chose the h1-hesc and Huvec cell lines because they had 2 biological replicates each one. They are 76 bp paired end reads.
exportENCODE=$ANALYSIS/encode cd $ENCODE/fastq wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd1Rep1.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep1.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd1Rep2.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep2.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd1Rep1.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd2Rep1.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd1Rep2.fastq.gz wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd2Rep2.fastq.gz gunzip *gz exportGSNAP=/g/huber/users/reyes/software/opt/gmap-2012-01-11-schroedinger/bin1/ exportFASTQ=$ANALYSIS/encode/fastq cd $ENCODE/aln/ $GSNAP/gsnap -D /g/huber/users/reyes/Mammals/genome-indexes-sch-2012-11-01/ -d hsa --split-output=HuvecR1 -N 1 -A sam -t 4 $FASTQ/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd1Rep1.fastq $FASTQ/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep1.fastq $GSNAP/gsnap -D /g/huber/users/reyes/Mammals/genome-indexes-sch-2012-11-01/ -d hsa --split-output=HuvecR2 -N 1 -A sam -t 4 $FASTQ/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd1Rep2.fastq $FASTQ/wgEncodeCshlLongRnaSeqHuvecCellLongnonpolyaFastqRd2Rep2.fastq $GSNAP/gsnap -D /g/huber/users/reyes/Mammals/genome-indexes-sch-2012-11-01/ -d hsa --split-output=H1hescR1 -N 1 -A sam -t 4 $FASTQ/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd1Rep1.fastq $FASTQ/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd2Rep1.fastq $GSNAP/gsnap -D /g/huber/users/reyes/Mammals/genome-indexes-sch-2012-11-01/ -d hsa --split-output=H1hescR2 -N 1 -A sam -t 4 $FASTQ/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd1Rep2.fastq $FASTQ/wgEncodeCshlLongRnaSeqH1hescCellLongnonpolyaFastqRd2Rep2.fastq python $SCRIPTS/dexseq_count.py -p yes -s no -a 0 hsa.DEXSeq.gtf aln/HuvecR1.concordant_uniq HuvecR1.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 0 hsa.DEXSeq.gtf aln/HuvecR2.concordant_uniq HuvecR2.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 0 hsa.DEXSeq.gtf aln/H1hescR1.concordant_uniq H1hescR1.counts python $SCRIPTS/dexseq_count.py -p yes -s no -a 0 hsa.DEXSeq.gtf aln/H1hescR2.concordant_uniq H1hescR2.counts
Run a standard differential exon usage analysis,
library(DEXSeq) library(multicore) load("ecs.RData") ecs <- ecs[,3:6] ecs <- estimateSizeFactors( ecs ) ecs <- estimateDispersions( ecs, nCores=15, maxExon=150, minCount=30) ecs <- fitDispersionFunction( ecs ) ecs <- testForDEU( ecs, nCores=15 ) save(ecs, file="ecs-sub.RData") ### flip exon labels of the minus strand plus <- fData(ecs)$strand == "+" minus <- unique(geneIDs(ecs)[which(!plus)]) exonID <- exonIDs(ecs) geneID <- geneIDs(ecs) for(gn in minus){ exids <- exonID[geneID %in% gn] exonID[geneID %in% gn] <- exids[length(exids):1] } ## exonIDs(ecs) <- exonID ecs <- estimatelogfold2changes(ecs, nCores=35) save(ecs, file="ecs-sub-plots.RData") mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl") attributes <- c("external_gene_id", "description") filter="ensembl_gene_id" DEXSeqHTML(ecs, FDR=0.1, path="cell-lines-report", mart=mart, attributes=attributes, filter=filter)
setwd(Sys.getenv("ANALYSIS")) setwd("encode/") load("ecs-sub.RData") table(fData(ecs)$padjust < 0.1) length(unique(geneIDs(ecs)[which(fData(ecs)$padjust < 0.1)]))
FALSE TRUE 253369 30339 [1] 7795
7 Figures and numbers of the paper
Numbers in the first paragraph
tx <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "knownGene") pr2 <- exonsBy(tx, "tx") sums <- sapply(pr2, function(x){rn <- ranges(x);sum(abs(start(rn)-end(rn)))}) median(sums) max(sums)
Figure 2
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("allENSecs.RData") ecs <- allENSecs[[1]] png( "dispests.png", width=1400, height=1400, pointsize=30 ) plot( rowMeans( counts(ecs,normalized=TRUE) ), fData(ecs)$dispBeforeSharing, log="xy", pch=20, cex=.2, col="#000000B0", xlim=c( 1, 3e5), ylim=c(1e-6,1e3), xaxs="i", yaxs="i", xlab="average normalized count value", ylab="Cox-Reid dispersion estimate", cex.axis=1.4, cex.lab=1.4) segments( rowMeans( counts(ecs,normalized=TRUE) )[ fData(ecs)$dispBeforeSharing < 1e-6 ], 1.5e-6, rowMeans( counts(ecs,normalized=TRUE) )[ fData(ecs)$dispBeforeSharing < 1e-6 ], 1e-6, col="#00000040" ) xg <- 10^seq( 0, 6, length.out=300 ) yg <- ecs@dispFitCoefs[1] + ecs@dispFitCoefs[2] / xg lines( xg, yg, col="red", lwd=3 ) for( q in c( .01, .05, .95, .99 ) ) lines( xg, yg * qchisq( q, 4 )/4, col="red", lwd=3, lty="dashed" ) dev.off()
Figure 3
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") pdf("Ten-m.pdf") plotDEXSeq(ecs, "FBgn0004449", legend=TRUE, norCounts=TRUE, color=c("red1", "blue1"), lwd=1.8) dev.off()
Figure 4
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") png( "MvA.png", width=1400, height=1400, pointsize=30 ) ylim <- 5 means <- rowMeans(counts(ecs,normalized=TRUE)) lfc <- fData(ecs)$`log2fold(untreated/treated)` cols <- ifelse( fData(ecs)$padj<.1, "#A00000A0", "#00000040" ) plot( means, lfc, col=cols, las=1, log="x", pch=20, cex=.4, ylim=c(-ylim,ylim), yaxs="i", yaxt="n", xlim = range( means[fData(ecs)$testable] ), xlab="averaged normalized count value", ylab="fold change in exon usage", cex.axis=1.4, cex.lab=1.4 ) axis( 2, -4:4, c( "1/16", "1/8", "1/4", "1/2", "1", "2", "4", "8", "16" ), las=1, cex.axis=1.4) segments( means[ lfc < -ylim ], -ylim*.97, means[ lfc < -ylim ], -ylim, col = cols[ lfc < -ylim ], lwd=3 ) segments( means[ lfc > ylim ], ylim*.97, means[ lfc > ylim ], ylim, col = cols[ lfc > ylim ], lwd=3 ) dev.off()
Figure 5
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") pdf("Ten-m.pdf") plotDEXSeq(ecs, "FBgn0004449", legend=TRUE, norCounts=TRUE, color=c("red1", "blue1"), lwd=1.8) dev.off()
Figure 6
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") pdf("RpS14b.pdf") plotDEXSeq(ecs, "FBgn0004404", legend=TRUE, norCounts=TRUE, color=c("red1", "blue1"), lwd=1.8) dev.off()
Figure S1
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") pdf("ps.pdf") plotDEXSeq(ecs, "FBgn0261552", legend=TRUE, norCounts=TRUE, color=c("red1", "blue1"), splicing=TRUE, lwd=1.8) dev.off()
Figure S2
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") ecsPois <- ecs fData(ecsPois)$dispersion <- 1e-10 ecsPois <- testForDEU( ecsPois, formula0 = count ~ sample + type * exon + condition, formula1 = count ~ sample + type * exon + condition * I(exon == exonID), nCores = 30 ) ecsPois <- estimatelog2FoldChanges( ecsPois, nCores=40 ) png( "MvAZero.png", width=1400, height=1400, pointsize=30 ) ylim <- 5 means <- rowMeans(counts(ecsPois,normalized=TRUE)) lfc <- fData(ecsPois)$`log2fold(untreated/treated)` cols <- ifelse( fData(ecsPois)$padj<.1, "#A00000A0", "#00000040" ) plot( means, lfc, col=cols, las=1, log="x", pch=20, cex=.4, ylim=c(-ylim,ylim), yaxs="i", yaxt="n", xlim = range( means[fData(ecsPois)$testable] ), xlab="averaged normalized count value", ylab="fold change in exon usage", cex.axis=1.4, cex.lab=1.4 ) axis( 2, -4:4, c( "1/16", "1/8", "1/4", "1/2", "1", "2", "4", "8", "16" ), las=1, cex.axis=1.4) segments( means[ lfc < -ylim ], -ylim*.97, means[ lfc < -ylim ], -ylim, col = cols[ lfc < -ylim ], lwd=3 ) segments( means[ lfc > ylim ], ylim*.97, means[ lfc > ylim ], ylim, col = cols[ lfc > ylim ], lwd=3 ) dev.off()
Figure S3
setwd(Sys.getenv("ANALYSIS")) setwd("brooksetal/") load("ecs-plots.RData") pdf("FBgn0017581.pdf") plotDEXSeq(ecs, "FBgn0017581", legend=TRUE, norCounts=TRUE, color=c("red1", "blue1"), splicing=TRUE, cex.axis=1, lwd=1.8) dev.off()
Figure S4
setwd(Sys.getenv("ANALYSIS")) setwd("brawardetal") load("br-vs-cb.RData") gene="ENSPTRG00000000042" fData(ecs)[geneIDs(ecs) %in% gene,]$transcripts <- c("PB1", "PB1", "PB1", "PB1", "PB1", "", "", "", "", "Y-kinase;S/T-kinase", "Y-kinase;S/T-kinase","Y-kinase;S/T-kinase","Y-kinase;S/T-kinase","Y-kinase;S/T-kinase","Y-kinase;S/T-kinase","Y-kinase;S/T-kinase;AGC-kinase", "AGC-kinase", "AGC-kinase", "AGC-kinase") fData(ecs)[geneIDs(ecs) %in% gene,]$transcript pdf("PRKCZ.pdf", width=8) plotDEXSeq(ecs, gene, displayTranscripts=TRUE, names=TRUE, legend=TRUE, color=c("red1", "blue1"), lwd=1.8) dev.off()
Figure S5
setwd(Sys.getenv("ANALYSIS")) setwd("encode") load("ecs-sub-plots.RData") head(fData(ecs)) ylim <- 5 means <- rowMeans(counts(ecs,normalized=TRUE)) lfc <- fData(ecs)$`log2fold(HuvecR/H1hescR)` cols <- ifelse( fData(ecs)$padj<.1, "#A0000010", "#00000010" ) png( "MvAencode.png", width=1400, height=1400, pointsize=30 ) plot( means, lfc, col=cols, las=1, log="x", pch=20, cex=.4, ylim=c(-ylim,ylim), yaxs="i", yaxt="n", xlim = range( means[fData(ecs)$testable] ), xlab="averaged normalized count value", ylab="fold change in exon usage", cex.axis=1.4, cex.lab=1.4 ) axis( 2, -4:4, c( "1/16", "1/8", "1/4", "1/2", "1", "2", "4", "8", "16" ), las=1, cex.axis=1.4) segments( means[ lfc < -ylim ], -ylim*.97, means[ lfc < -ylim ], -ylim, col = cols[ lfc < -ylim ], lwd=3 ) segments( means[ lfc > ylim ], ylim*.97, means[ lfc > ylim ], ylim, col = cols[ lfc > ylim ], lwd=3 ) dev.off()
Figure S6
setwd(Sys.getenv("ANALYSIS")) setwd("encode") load("ecs-sub-plots.RData") design(ecs) <- as.factor(gsub("R", "", design(ecs))) pdf("NFYA.pdf", width=8) plotDEXSeq(ecs, "ENSG00000001167", displayTranscripts=TRUE, names=FALSE, legend=TRUE, color=c("red1", "blue1"), lwd=1.8, norCounts=TRUE) dev.off()




