importgzip,os,sysfromIPython.displayimportHTMLimportpandasaspdimportseabornassnsimportmatplotlib.pyplotaspltimportscipy.statsasssimportscipyasspimportnumpyasnpfromcollectionsimportdefaultdictfromscipy.signalimportargrelmin,argrelmaximportmathimportmatplotlib.patchesaspatches#import custom functions for displaying tables, bash commandssys.path.append(os.path.abspath("/home/damian/"))fromdk_ipythonimport*%matplotlibinlineHTML(addToggle())Pre-processing: generation of coverage matrices
For specific questions: anish.dattani@zoo.ox.ac.uk
Chip-seq libraris are available from the Aboobaker lab and Sanchez lab. Both labs used Drosophila spike in controls for down-stream normalization. The Aboobaker lab libraries are pair-end 40bp. The sanchez lab libraries are single-end 50bp. The general steps of this process are:
Trim reads for quality
Map reads to the asexual S.mediterranea and D.melanogaster genome
Filter out ambiguous reads (low mapping quality, discordant pairs, PCR duplicates)
Separate out S.med / D.mel reads
Convert to .bed file
Read Trimming
java -jar trimmomatic-0.36.jar PE -threads 32 -phred33 libA libB libA.paired libA.unpaired libB.paired libB.unpaired LEADING:20 TRAILING:20 MINLEN:30 ILLUMINACLIP:/data3/bin/Trimmomatic-0.36/adapters/adapters.fa:2:30:12:1:false TOPHRED33 2> lib.log
The S.mediterranea asexual genome (SmedAsxl_genome_v1.1.nt) was downloaded from smedgd database (http://smedgd.stowers.org/).
The D. melanogaster genome (r6.10) was downloaded from flybase. The species name (Smed, Dmel) were prepended to each fasta header for easier species identification down-stream. Both fasta files were concatenated and indexed with BWA.
The trimmed reads were mapped to this combined 'smed_dmel.fa' genome index with BWA. The BWA results were piped to samtools for filtering out unmapped reads and mapping quality less than 10:
# for single-end reads
bwa mem -t 32 smed_dmel.fasta lib | samtools view -b -F 4 -q 10 - > lib.filtered.bam
# for pair-end reads, use a secondary filter script to filter by pairs.
# discard pairs where both reads have mapping quality less than 10.
# also discard pairs where reads map to separate reference contigs
import sys
pair = []
for line in sys.stdin:
if line[0] == '@':
print line.strip()
else:
meta = line.strip().split()
pair.append(meta)
if len(pair) == 2:
if int(pair[0][4]) >= 10 or int(pair[1][4]) >= 10:
if pair[0][6] == '=' and pair[1][6] == '=':
print '\t'.join(pair[0])
print '\t'.join(pair[1])
pair = []
bwa mem -t 32 smed_dmel.fasta libA.paired libB.paired | samtools view -h -f 2 - | python pe_filterMQ.py | samtools view -b - > lib.filtered.bam
The mapping results were sorted with samtools:
samtools sort -m 500M -T lib.tmp -O bam -@ 32 lib.filtered.bam > lib.filtered.sorted.bam
java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES=100 INPUT=lib.filtered.sorted.bam OUTPUT=lib.filtered.sorted.markdup.bam METRICS_FILE=lib.filtered.sorted.log REMOVE_DUPLICATES=true
Reads that were mapped to S.mediterranea were extracted from each bam file. Only reads where the forward and reverse both map to the same reference contig were kept. The prepended species name are removed during this step:
import sys
for line in sys.stdin:
if line[0] == '@':
if line[:3] == '@SQ':
meta = line.strip().split()
ref = meta[1].split(':')[1]
if ref.split('_')[0] != 'dmel':
meta[1] = 'SN:' + ref.split('_')[1]
print '\t'.join(meta)
else:
meta = line.strip().split('\t')
if meta[2].split('_')[0] != 'dmel':
if meta[6] == '=' or meta[6] == '*':
meta[2] = meta[2].split('_')[1]
print '\t'.join(meta)
samtools view -h lib.filtered.sorted.markdup.bam | python filterDmel.py | samtools view -b > lib.filtered.sorted.markdup.smed.bam
¶ Sanchez's chip-seq libraries were single-end. Use MACS2's predictd function to predict fragment sizes:
macs2 predictd -i file.bam 2> log
Each BAM file was converted to BED format. The BED file reflects the DNA fragment that was sequenced, not the sequenced read. So for single-end reads, an extension by mean fragment size needs to be performed to get the end coordinate. For pair-end reads, the position from the start of the forward read to the end of the reverse read is recorded in the BED file.
For single-end files, each read was extended from its respective orientation from 5- 3 direction using the estimated fragment size generated previously. The resulting genomic interval is then reported in the BED file. The intervals are bounded by 1 and length of the reference contig in cases where the interval extends beyond the length of the contig or is below 1. For pair-end files, the start of the forward read to the end of the reverse read is designated the fragment interval. Fragment interval lengths that are smaller than the read lengths (meaning the reads are in Reverse-Forward orientation rather than Forward-Reverse) are discarded as ambiguous.
The following script was used to generate the BED file. It requires as input, the BAM file, a size file describing length of each reference contig, and average fragment size.
To get the size file for each reference contig, you can simply run samtools faidx on the genome FASTA file, then run
cut -f 1,2 genome.fasta.fai > size.file
to generate the size file.
For single-end files, use the distance predicted previously. For pair-end files, used 0 as the average fragment size to ignore extension. For pair-end data, this script can potentially use up a lot of memory.
import sys
from collections import defaultdict
fragSize = int(sys.argv[1])
refLengths = dict(x.strip().split() for x in open(sys.argv[2]))
if fragSize == 0:
pairs = defaultdict(list)
ref = ''
for line in sys.stdin:
meta = line.strip().split()
if meta[2] != ref:
ref = meta[2]
pairs = defaultdict(list)
pairs[meta[0]].append(meta)
if len(pairs[meta[0]]) == 2:
a = pairs[meta[0]][0]
b = pairs[meta[0]][1]
ref = a[2]
refLength = int(refLengths[ref])
aPos = int(a[3])
aLength = len(a[9])
aFlag = int(a[1])
bPos = int(b[3])
bLength = len(b[9])
bFlag = int(b[1])
if aFlag & 16:
aPos += aLength
else:
bPos += bLength
left = min(aPos,bPos)
right = max(aPos,bPos)
left = max(0,left)
right = min(refLength,right)
fragLength = right - left
if fragLength > aLength and fragLength > bLength:
print '\t'.join([ref,str(left),str(right),meta[0],str(fragLength)])
del pairs[meta[0]]
else:
for line in sys.stdin:
meta = line.strip().split()
flag = int(meta[1])
ref = meta[2]
pos = int(meta[3])
length = len(meta[9])
refLength = int(refLengths[ref])
left = pos
right = pos + fragSize
if flag & 16:
left = pos + length - fragSize
right = pos + length
left = max(left,0)
right = min(refLength,right)
print '\t'.join([ref,str(left), str(right),meta[0]])
run this above script by:
samtools view lib.bam | python bam2bed.py fragLength sizes.file > lib.bed
Using the fragments .bed file, we also generate a "mid" file where the position of the fragment center is stored. Conceptually, the center of the fragment should be where the protein of interest is bound. In ChIP-seq experimentals, we are really only concerned with point-based data of the binding position. Extending the size of the fragment from the center just smooths out the binding density data.
import sys
inFile = open(sys.argv[1],'r')
for line in inFile:
meta = line.strip().split('\t')
ref,start,end = meta[:3]
start = int(start)
end = int(end)
mid = start + ((end - start + 1) / 2)
print ref + '\t' + str(mid)
run this above script by:
python getMid.py lib.bed | sort -k 1,1 -k2,2n > lib.m
With this "mid" format, we can generate fragment bam/bed formatted data in whatever fragment length we want by using this script to generate a bed file with user defined fragment size:
import sys
inFile = open(sys.argv[1],'r')
chrSize = dict([x.strip().split() for x in open(sys.argv[2],'r').read().strip().split('\n')])
fragSize = int(sys.argv[3])
halfSize = fragSize / 2
count = 0
for line in inFile:
ref, mid = line.strip().split()
refLength = int(chrSize[ref])
mid = int(mid)
start = max(0,mid - halfSize)
end = min(refLength,mid + halfSize)
print '\t'.join(map(str,[ref,start,end,'frag' + str(count)]))
count += 1
run this above script by: python mid2fragment.py lib.m chromosomeSizes fragmentSize > lib.fragmentSize.bed
bed file can then be optionally converted to bam with the bedtools' bed2bam command.
The .bam/.bed file can be used to generate a bedgraph track using bedtools genomecov.
./samtools view -b | genomeCoverageBed -ibam stdin -bga -g SmedAsxl.size.file > chipfile.smed.bdg
The resulting .bdg file can be converted to bigwig using UCSC's bedGraphToBigWig:
bedGraphToBigWig chipfile.smed.bdg SmedAsxl.size.file chipfile.smed.bw
DeepTools' computeMatrix command can be used on multiple bigwig files and the annotations to extract coverage of region around features. Below is an example whereby we produce a matrix describe coverage at 50bp for Input and ChIP sample with respect to the new genome annotation bed file.
computeMatrix reference-point -S input.bw chipfile.smed.bw -R ox_smed_asx1.1_1.0.tss.bed -a 2500 -b 2500 -bs 50 -out input.chip.matrix.gz -p max --quiet --missingDataAsZero &
We then proceed to calculate normalization factors (to input and between samples). Additionally, we subtract ChIP sample coverage from input to account for background singal. Methodology is covered in proceeding section.
There are two general types of normalization that can be performed on ChIP-seq data.
1. A normalization to input to remove read mappability biases or sequencing biases.
2. A normalization across samples for comparison of regions among samples.
There doesn't seem to be a standard way of incorporating both normalization methods to generate one single output track. The strategy for producing the final signal track in this analysis is to subtract the input track from sample track after normalizing coverage across all samples and input.
Normalization to input transforms the sample track in a way to account for biases from mapping, sequencing, and chemistry. To obtain a comparable input track, the sample and input needs to be scaled correctly. Most chip-seq analysis scale sample to input by scaling according to library size. The optimal way would be to the scale by comparing the background in both sample and input. I am going to use the SES method described in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342857/ DeepTool's API was used to calculate the SES scaling factors. For this we need the bam file for the input and ChIP sample. This prints out two scale factors.
from deeptools import SES_scaleFactor
python
from deeptools import SES_scaleFactor
result = SES_scaleFactor.estimateScaleFactor(['samplePath','inputPath'],500,200000,1,numberOfProcessors=32)
print str(result['size_factors_SES'][0]) + '\t' + str(result['size_factors_SES'][1])
One caveat about input samples. There is anectodal evidence from other bioinformaticians suggesting input libraries might be biased to also include the IP signal. Since input also goes through the cross-linking and sonication process, protein-bound fragments will tend not to be fragmented, potentially resulting in a signal for any protein-bound chromatin. The Sono-seq paper seems to also suggest this:
Mapping accessible chromatin regions using Sono-Seq (http://www.pnas.org/content/106/35/14926.short)
If the signal from input samples are inclusive of the IP signal, then normalization to input will potentially bury real signal.
Theory behind chip-seq normalization using spike-in samples can be found in this paper: Bin Hu, et al. Biological chromodynamics: a general method for measuring protein occupancy across the genome by calibrating ChIP-seq. Nucl. Acids Res. (2015) doi: 10.1093/nar/gkv670. http://nar.oxfordjournals.org/content/early/2015/06/30/nar.gkv670.full
Cells for each sample were FACS sorted for X1 population and spiked-in with Drosophila S2 cells. Alternatively, following X1 Chromatin preparation and quantification by Qubit, we added a 3% commercially available Drosophile Spike-in (ActiveMotif).
Each sequenced sample contains reads representing occupancy of histone marker on the S.med and D.mel genome:
Reads smed = # reads for S.med
Reads dmel = # reads for D.mel
For S.med and D.mel, the number of reads is a factor of the number of input cells, the histone occupancy rate, and technical residuals.
N = number of cells
O = occupancy rate
R = residuals
Reads = N * O * R
Reads smed = N smed * O smed * R smed
Reads dmel = N dmel * O dmel * R dmel
Occupancy is defined as the probability that a position on the genome is bound by the protein of interest. Occupancy rate is the average across all the positions on the genome.
For each sample, the proportion of S.med reads to D.mel reads can be written as:
Reads smed / Reads dmel = (N smed * O smed * R smed) / (N dmel * O dmel * R dmel)
What we are interested in is the histone occupancy rate of S.med (O smed). The above equation can be rearranged to solve for O smed:
O smed = (Reads smed * N dmel * O dmel * R dmel) / (Reads dmel * N smed * R smed)
The histone occupancy rate of D.mel (O dmel) is a constant since we are spiking in the same S2 cells in all samples. The residuals (R) include variances from sonication, IP, library prep, sequencing. Residuals do NOT include histone occupancy, which can be different among S.med samples.
While the residuals for each sample maybe different, the ratio of S.med residuals (R smed) to D.mel residuals (R dmel) can be assumed to be constant among all samples undergoing the same immunoprecipitation. Our four samples (GFP01, GFP02, LPT01, LPT02), under same immunoprecipitation should have a constant O dmel and constant R dmel / R smed.
O dmel * R dmel / R dmel = constant = Alpha
Substituting for Alpha in previous equation:
O smed = ((Reads smed * N dmel) / (Reads dmel * N smed)) * Alpha
Designate ((Reads smed N dmel) / (Reads dmel N smed)) as Scale Factor:
Scale Factor = ((Reads smed * N dmel) / (Reads dmel * N smed))
O smed = Scale Factor * Alpha
The number of S.med and D.mel cells (N smed and N dmel) mixed in each sample can be obtained by the initial measurement of cell concentration after FACS and spike-in aliquot. Alternatively, we can estimate N smed and N dmel using S.med / D.mel read proportions of our input samples.
According to the paper, this scaling factor is then applied to the density distribution of the coverage (ie. Reads per million). Conversion of raw read coverage to a density distribution can be done by dividing each coverage value per base by the total number of reads, then multiplying by a million. The multiplication by a million make the number bigger and easier to work with.
When putting both RPM conversion and scaling factor together, we are essentially dividing each coverage value per base by:
(1e6 / Reads smed) * ((Reads smed * N dmel) / (Reads dmel * N smed))
We can get rid of the 1e6 as a constant and take out Reads smed. We are left with:
Scaling factor = N dmel / (Reads dmel * N smed)
This scaling factor can then be used on the raw read coverage (bedgraph) to normalize samples.
We can calculate a scaling for for each sample and its respective input using the SES method. In addition to this, we also calculate a scaling factor among samples using the spike-in method. To consolidate these two scaling factors, we can simply scale all samples according to the spike-in normalization and then scale the respective inputs according to the SES factor while setting the sample to 1.0.
For example:
SES scaling factor:
h3k4me3 gfp01 : input gfp01 = 1 : 0.6
h3k4me3 gfp02 : input gfp02 = 1 : 0.4
h3k4me3 lpt01 : input rnai01 = 1 : 0.7
h3k4me3 lpt02 : input rnai02 = 0.8 : 1 or 1 : 1.25
Spike-in scaling factor:
h3k4me3 gfp01 1
h3k4me3 gfp02 0.6
h3k4me3 rnai01 0.9
h3k4me3 rnai02 0.7
Final scaling factor:
h3k4me3 gfp01 1
h3k4me3 gfp02 0.6
h3k4me3 rnai01 0.9
h3k4me3 rnai02 0.7
input gfp01 0.6 (1 * 0.6)
input gfp02 0.24 (0.6 * 0.4)
input rnai01 0.63 (0.9 * 0.7)
input rnai02 0.875 (1.25 * 0.7)
This final scaling will produce coverage tracks that are comparable among the same IPs.
Subtracting Sample signal from Input background signal in Deeptools matrix, taking into account scale factor (calculated in previous section)
The following python script was utilized for matrix subtraction:
import gzip
import numpy as np
inFile = gzip.open('input.chip.matrix.gz')
inpNorm = scale factor
samNorm = scale factor
inFile.next()
for line in inFile:
cols = line.strip().replace('nan','0').split('\t')
data = cols[6:]
data = map(float,data)
inputVals = [x * inpNorm for x in data[:100]]
ChIPVals = [x * samNorm for x in data[100:]]
norm = [max(0,h3k27Vals[i] - inputVals[i]) for i in range(len(inputVals))]
print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,norm))
DeepTools was used to generate a data matrix of profiles based on annotated loci. This matrix file was parsed into data file consisting of coverages for each sample/IP type. This data file represents coverage at each 50bp bin across 5kb region of each loci (2.5k up and downstream of each loci start position).
Underneath we document the plots made in this paper; utilizing the deeptools generated matrix for each ChIP.
RNA-Seq library Sample SF Input SF
aa_x1_h3k4me3_gfp_01 0.88218894621 0.600914311794
aa_x1_h3k4me3_gfp_02 0.95966677872 0.674130090697
aa_x1_h3k4me3_gfp_03 0.552394968171 0.558729702141
aa_x1_h3k4me3_lpt_01 0.915808259834 0.381598670565
aa_x1_h3k4me3_lpt_02 1.0 0.613979197163
aa_x1_h3k4me3_lpt_03 0.452963962102 0.445215653049
aa_x1_h3k4me1_gfp_01 0.96459032622 0.951071177767
aa_x1_h3k4me1_gfp_02 0.929322002571 1.08916554374
aa_x1_h3k4me1_gfp_03 0.353849481684 0.538814176571
aa_x1_h3k4me1_lpt_01 1.0 0.630755202802
aa_x1_h3k4me1_lpt_02 0.7515479874 0.890616388552
aa_x1_h3k4me1_lpt_03 0.385392012257 0.565965518655
aa_x1_h3k27me3_gfp_01 0.670212238126 0.474703809016
aa_x1_h3k27me3_gfp_02 1.0 0.614514716306
aa_x1_h3k27me3_gfp_03 0.348596453072 0.281044204574
aa_x1_h3k27me3_lpt_01 0.720752482921 0.603051423894
aa_x1_h3k27me3_lpt_02 0.733248317316 0.488233050969
aa_x1_h3k27me3_lpt_03 0.397413123449 0.262846984653
aa_x1_k36_01 0.303077006 1.0
aa_x1_k36_02 0.056651272 0.13916828
aa_x1_RNA_Pol_Ser5P_01 6.30521E-05 0.000401732
aa_x1_RNA_Pol_Ser5P_02 0.338095714 1.0
aa_x1_RNA_Pol_Ser5P_03 0.129391907 0.399284173
prefix='/home/share/projects/smed_lpt/'#load coverage data for H3K4me3, H3K4me1, and H3K27me3 for AAA GFP and LPT RNAi samples. NB. LPT RNAi Samples pertain to Mihaylova et al (2018) and the same GFP data was used for the analyses in this paper defsmooth(x,window_len=5,window='hanning'):ifwindow_len<3:returnxifnotwindowin['flat','hanning','hamming','bartlett','blackman']:raiseValueError,"Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'"s=np.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]ifwindow=='flat':w=np.ones(window_len,'d')else:w=eval('np.'+window+'(window_len)')y=np.convolve(w/w.sum(),s,mode='valid')returnydefgetFC(control,treatment,maxThresh):ifcontrol==0andtreatment==0:return0elifcontrol==0andtreatment>0:returnmaxThreshelifcontrol>0andtreatment==0:return-maxThreshelse:returnmin(maxThresh,max(-maxThresh,math.log(treatment/control,2)))data_gfp=defaultdict(lambda:defaultdict())data_lpt=defaultdict(lambda:defaultdict())data_delta=defaultdict(lambda:defaultdict())data_count=defaultdict(lambda:defaultdict())data_fc=defaultdict(lambda:defaultdict())df_data=[]#Following coverage data, append to dataframe max coverage for each locus, log-fold change between GFP and LPT RNAi, max, and min count coverage values dataFile=open(prefix+'coverage/aa.data')lids=[]forlineindataFile:lid,mark,gfp,lpt,delta,count=line.strip().split()lids.append(lid)gfp=smooth(map(float,gfp.split(',')),10)lpt=smooth(map(float,lpt.split(',')),10)delta=lpt-gfpsignal_max=max(gfp.max(),lpt.max())gfp_norm=[0.0]*109lpt_norm=[0.0]*109delta_norm=[0.0]*109fc=[0.0]*109ifsignal_max>0:gfp_norm=np.array([x/signal_maxforxingfp])lpt_norm=np.array([x/signal_maxforxinlpt])delta_norm=lpt_norm-gfp_normfc=[getFC(gfp_norm[i],lpt_norm[i],5)ifgfp_norm[i]!=lpt_norm[i] \
else0foriinrange(len(gfp_norm))]count=map(float,count.split(','))count_max=max(count)count_min=min(count)count_sum=sum(count)count_mean=count_sum/float(len(count))df_data.append([mark,gfp,lpt,delta,gfp_norm,lpt_norm,delta_norm,fc,count_max,count_min,count_sum,count_mean])dataFile.close()df_data=pd.DataFrame(df_data,\
columns=['mark','gfp','lpt','delta','gfp_norm',\
'lpt_norm','delta_norm','fc','count_max','count_min',\
'count_sum','count_mean'],index=lids)#load blast annotationsblastFile=open(prefix+'genome/ox_smed_asx1.1_1.0.genes.blast.tab')df_blast=[]forlineinblastFile:meta=line.strip().split('\t')df_blast.append(meta[:3]+[meta[-2]])df_blast=pd.DataFrame(df_blast,columns=['lid','descr','spe','eval'])blastFile.close()blastFile=open(prefix+'genome/ox_smed_asx1.1_1.0.transdecoder.protein.top10.blast')df_coding_blast=[]forlineinblastFile:meta=line.strip().split('\t')df_coding_blast.append(meta[:3]+[meta[-2]])df_coding_blast=pd.DataFrame(df_blast,columns=['lid','descr','spe','eval'])blastFile.close()#load categoriesall_lids=set(open(prefix+'figure/all.lids').read().strip().split('\n'))coding_lids=set(open(prefix+'figure/coding.lids').read().strip().split('\n'))tf_lids=set(open(prefix+'figure/tf.lids').read().strip().split('\n'))lncrna_lids=set(open(prefix+'figure/lncrna.lids').read().strip().split('\n'))propFile=open(prefix+'figure/facs_proportions.tab')lids=[]df_props=[]forlineinpropFile:meta=line.strip().split()lids.append(meta[0])df_props.append(map(float,meta[1:]))df_props=pd.DataFrame(df_props,index=lids,columns=['x1','x2','xins'])df_props['x1_2']=df_props['x1']+df_props['x2']df_props['x2_ins']=df_props['x2']+df_props['xins']propFile.close()x1_lids=set(df_props[df_props['x1']>=50.0].index)x2_lids=set(df_props[df_props['x2']>=50.0].index)xins_lids=set(df_props[df_props['xins']>=50.0].index)x1_2_lids=set(df_props[df_props[['x1','x2']].sum(axis=1)>=75].index)-x1_lids-x2_lidsx2_ins_lids=set(df_props[df_props[['x2','xins']].sum(axis=1)>=75].index)-x2_lids-xins_lidsglobal_lids=set(df_props.index)-x1_lids-x2_lids-xins_lids-x1_2_lids-x2_ins_lidsunclassified_lids=all_lids-x1_lids-x2_lids-xins_lids-x1_2_lids-x2_ins_lids-global_lids#define enriched loci categories lids_categories={}lids_categories['x1']=x1_lidslids_categories['x2']=x2_lidslids_categories['xins']=xins_lidslids_categories['x1_2']=x1_2_lidslids_categories['x2_ins']=x2_ins_lidslids_categories['x0']=global_lidsdefgetCellCategory(lid):iflidinx1_lids:return'x1'eliflidinx2_lids:return'x2'eliflidinxins_lids:return'xins'eliflidinx1_2_lids:return'x12'eliflidinx2_ins_lids:return'x2ins'eliflidinglobal_lids:return'global'else:return'unclassified'defplotProfile(mark,gids,signal,count_thresh,c,ax):window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)sns.set_style('darkgrid')sns.tsplot(list(df_data[(df_data['mark']==mark)&(df_data['count_max']>=count_thresh)].ix[gids].dropna()[signal]),\
ax=ax,color=c,time=xticks)#Load K36 coverage chipFile=open('/home/anish/k36.rep12.matrix.gz')x1_k36={}forlineinchipFile:try:cols=line.strip().split('\t')vals=smooth(map(float,cols[6:]),10)x1_k36[cols[3]]=valsexceptKeyError:passchipFile.close()#Produce K36 dataframe with coverage matrix for each locus, FACS proportion for said locus, as well as 'max' coverage valuea=pd.DataFrame(x1_k36)L=[(k,v)fork,vinx1_k36.items()]a=pd.DataFrame(L,columns=['','Values'])u=a.set_index('')k36_df_props=u.join(df_props)b=[kforkinall_lids]df_b=pd.DataFrame(b)ids=df_b.set_index(0)a['max']=pd.DataFrame(k36_df_props['Values'].values.tolist()).max(axis=1)u=a.set_index('')k36_df=u.join(df_props)k36_df_props_max=a.join(df_props)u=a.set_index('')k36_df_props_max=u.join(df_props)#Load Pol II Ser5P merged replicate matrix chipFile=open('/home/anish/RNAPolII.matrix.gz')x1_ps5={}forlineinchipFile:try:cols=line.strip().split('\t')vals=smooth(map(float,cols[6:]),10)x1_ps5[cols[3]]=valsexceptKeyError:passchipFile.close()#Produce Ser5P dataframe with coverage matrix for each locus, FACS proportion for said locus, as well as 'max' coverage valuea_ps5=pd.DataFrame(x1_ps5)L_ps5=[(k,v)fork,vinx1_ps5.items()]a_ps5=pd.DataFrame(L_ps5,columns=['','Values'])u_ps5=a_ps5.set_index('')ps5_df_props=u_ps5.join(df_props)df_b=pd.DataFrame(b)ids=df_b.set_index(0)a_ps5['max']=pd.DataFrame(ps5_df_props['Values'].values.tolist()).max(axis=1)u_ps5=a_ps5.set_index('')ps5_df=u_ps5.join(df_props)ps5_df_a=a_ps5.join(df_props)u_ps5=a_ps5.set_index('')ps5_df_a=u_ps5.join(df_props)defmaxRegion(x):#-500 to +800 around tssreturnmax(x[44:73])h3k4me3_gfp_regionMag=pd.Series(df_data[df_data['mark']=='h3k4me3']['gfp'].apply(np.max))h3k4me1_gfp_regionMag=pd.Series(df_data[df_data['mark']=='h3k4me1']['gfp'].apply(np.max))h3k27me3_gfp_regionMag=pd.Series(df_data[df_data['mark']=='h3k27me3']['gfp'].apply(np.max))h3k4me3_lpt_regionMag=pd.Series(df_data[df_data['mark']=='h3k4me3']['lpt'].apply(np.max))h3k4me1_lpt_regionMag=pd.Series(df_data[df_data['mark']=='h3k4me1']['lpt'].apply(np.max))h3k27me3_lpt_regionMag=pd.Series(df_data[df_data['mark']=='h3k27me3']['lpt'].apply(np.max))fig,ax=plt.subplots(5,figsize=[10,5])h3k4me3_gfp_regionMag[h3k4me3_gfp_regionMag<100].hist(bins=100,ax=ax[0],alpha=1)#h3k4me3_gfp_regionMag[h3k4me3_lpt_regionMag < 100].hist(bins=100,ax=ax[0],alpha=0.5)h3k4me1_gfp_regionMag[h3k4me1_gfp_regionMag<50].hist(bins=100,ax=ax[1],alpha=1)#h3k4me1_gfp_regionMag[h3k4me1_lpt_regionMag < 50].hist(bins=100,ax=ax[1],alpha=0.5)h3k27me3_gfp_regionMag[h3k27me3_gfp_regionMag<50].hist(bins=100,ax=ax[2],alpha=1)#h3k27me3_gfp_regionMag[h3k27me3_lpt_regionMag < 50].hist(bins=100,ax=ax[2],alpha=0.5)df36=k36_df_props_max[['max']]df36[df36<50].hist(bins=100,ax=ax[3],alpha=1)dfps5=ps5_df_a[['max']]dfps5[dfps5<50].hist(bins=100,ax=ax[4],alpha=1)ax[0].set_title('h3k4me3',fontsize=15)ax[0].set_xlim(0,100)ax[1].set_title('h3k4me1',fontsize=15)ax[1].set_xlim(0,50)ax[2].set_title('h3k27me3',fontsize=15)ax[2].set_xlim(0,50)ax[3].set_title('h3k36me3',fontsize=15)ax[3].set_xlim(0,50)ax[4].set_title('Pol II Ser5p',fontsize=15)ax[4].set_xlim(0,50)fig.tight_layout()fig.suptitle('Distribution of Magnitudes',y=1.05,fontsize=18)fig.tight_layout()#Find RNA Pol II (Ser5P) outliers and removeps5_outlier=dfps5[dfps5['max']>10].index.tolist()df=pd.DataFrame(x1_ps5)x1_ps5df=pd.DataFrame(ps5_df_props)ps5_df_a.index.name='ID'k36_outlier_new=pd.DataFrame(ps5_outlier)k36_outlier_new2=k36_outlier_new.set_index(0)ps5removeoutlier=ps5_df_props[~ps5_df_props.index.isin(k36_outlier_new.index)]ps5removeoutlier.index.name='ID'ps5_outlier_removal=u_ps5.loc[~u_ps5.index.isin(ps5_outlier)]k36removeoutlier2=k36_df_props[k36_df_props.index.isin(ids.index)]#Remove common outliers for 5 ChIP-seq marksh3k4me3_magOutlier=set(h3k4me3_gfp_regionMag[h3k4me3_gfp_regionMag>80].index) \
|set(h3k4me3_lpt_regionMag[h3k4me3_lpt_regionMag>80].index)h3k4me1_magOutlier=set(h3k4me3_gfp_regionMag[h3k4me1_gfp_regionMag>20].index) \
|set(h3k4me3_lpt_regionMag[h3k4me1_lpt_regionMag>20].index)h3k27me3_magOutlier=set(h3k4me3_gfp_regionMag[h3k27me3_gfp_regionMag>10].index) \
|set(h3k4me3_lpt_regionMag[h3k27me3_lpt_regionMag>10].index)k36_outlier=set(df36[df36['max']>10].index.tolist())ps5_outlier=set(dfps5[dfps5['max']>10].index.tolist())mark_outlier=h3k4me3_magOutlier|h3k4me1_magOutlier|h3k27me3_magOutlier|k36_outlier|ps5_outlierall_lids=set(df_data.index)-mark_outlier#Following outlier removal, define max for h3k4me3, h3k4me1, and h3k27me3. Turn each locus' coverage matrix into percentage. h3k4me3_max=max(df_data[df_data['mark']=='h3k4me3'].ix[all_lids]['gfp'].apply(np.max).max(), \
df_data[df_data['mark']=='h3k4me3'].ix[all_lids]['lpt'].apply(np.max).max())h3k4me1_max=max(df_data[df_data['mark']=='h3k4me1'].ix[all_lids]['gfp'].apply(np.max).max(), \
df_data[df_data['mark']=='h3k4me1'].ix[all_lids]['lpt'].apply(np.max).max())h3k27me3_max=max(df_data[df_data['mark']=='h3k27me3'].ix[all_lids]['gfp'].apply(np.max).max(), \
df_data[df_data['mark']=='h3k27me3'].ix[all_lids]['lpt'].apply(np.max).max())defh3k4me3_div(x):return[y/h3k4me3_max*100foryinx]defh3k4me1_div(x):return[y/h3k4me1_max*100foryinx]defh3k27me3_div(x):return[y/h3k27me3_max*100foryinx]h3k4me3_gfp=df_data[df_data['mark']=='h3k4me3'].ix[all_lids]['gfp'].apply(h3k4me3_div)h3k4me1_gfp=df_data[df_data['mark']=='h3k4me1'].ix[all_lids]['gfp'].apply(h3k4me1_div)h3k27me3_gfp=df_data[df_data['mark']=='h3k27me3'].ix[all_lids]['gfp'].apply(h3k27me3_div)df_data['h3k4me3_perCov']=df_data[df_data['mark']=='h3k4me3'].ix[all_lids]['gfp'].apply(h3k4me3_div)df_data['h3k4me1_perCov']=df_data[df_data['mark']=='h3k4me1'].ix[all_lids]['gfp'].apply(h3k4me1_div)df_data['h3k27me3_perCov']=df_data[df_data['mark']=='h3k27me3'].ix[all_lids]['gfp'].apply(h3k27me3_div)#Following outlier removal, define max for h3k36me3df=pd.DataFrame(x1_k36)k36df=pd.DataFrame(k36_df_props)k36df.index.name='ID'k36_outlier_new=pd.DataFrame(k36_max)k36_outlier_new2=k36_outlier_new.set_index(0)k36removeoutlier=k36_df_props[~k36_df_props.index.isin(k36_outlier_new.index)]k36_outlier_new.index.name='ID'k36_outlier_removal=u.loc[~u.index.isin(k36_outlier)]K36_outlierremove_array=k36_outlier_removal.Valuesk36_df=pd.DataFrame(k36_outlier_removal['max'])k36_df[k36_df['max']==k36_df['max'].max()]#Convert h3k36me3 coverage to percentage k36_gfp=K36_outlierremove_array.apply(lambdax:[(y/9.99854)*100foryinx])# define max for RNA Pol II Ser5Pps5_df=pd.DataFrame(ps5_outlier_removal['max'])ps5_df[ps5_df['max']==ps5_df['max'].max()]#Convert RNA Pol II Ser5P coverage to percentage ps5_perc=ps5_outlier_removal.Values.apply(lambdax:[(y/9.97421)*100foryinx])defgetPropCorr2(pop):df2=k36_df_props_max[(k36_df_props_max['max']>=3)]l2=len(df2['Values'][0])c2=[ss.spearmanr([x2[i]forx2indf2['Values']],df2[pop])foriinrange(l2)]returnc2defgetPropCorr(mark,pop,countThresh=10):df=pd.DataFrame(df_data[(df_data['mark']==mark)&(df_data['count_max']>=countThresh)]\
['gfp']).join(df_props,how='inner')l=len(df['gfp'][0])c=[ss.spearmanr([x[i]forxindf['gfp']],df[pop])foriinrange(l)]returncwindow=int(math.ceil(5000.0/109))idx=range(-2500,2500,window)fig,ax=plt.subplots(4,figsize=[10,8])colors=['#1C637A','#81CCE6','#FA8D20','#4AB545']pd.Series([x2[0]forx2ingetPropCorr2('x1')],index=idx).plot(ax=ax[0],color=colors[0])pd.Series([x2[0]forx2ingetPropCorr2('x2')],index=idx).plot(ax=ax[0],color=colors[1])pd.Series([x2[0]forx2ingetPropCorr2('xins')],index=idx).plot(ax=ax[0],color=colors[2])pd.Series([x[0]forxingetPropCorr('h3k4me3','x1',3)],index=idx).plot(ax=ax[1],color=colors[0])pd.Series([x[0]forxingetPropCorr('h3k4me3','x2',3)],index=idx).plot(ax=ax[1],color=colors[1])pd.Series([x[0]forxingetPropCorr('h3k4me3','xins',3)],index=idx).plot(ax=ax[1],color=colors[2])pd.Series([x[0]forxingetPropCorr('h3k4me1','x1',3)],index=idx).plot(ax=ax[2],color=colors[0])pd.Series([x[0]forxingetPropCorr('h3k4me1','x2',3)],index=idx).plot(ax=ax[2],color=colors[1])pd.Series([x[0]forxingetPropCorr('h3k4me1','xins',3)],index=idx).plot(ax=ax[2],color=colors[2])pd.Series([x[0]forxingetPropCorr('h3k27me3','x1',3)],index=idx).plot(ax=ax[3],color=colors[0])pd.Series([x[0]forxingetPropCorr('h3k27me3','x2',3)],index=idx).plot(ax=ax[3],color=colors[1])pd.Series([x[0]forxingetPropCorr('h3k27me3','xins',3)],index=idx).plot(ax=ax[3],color=colors[2])ax[0].set_ylabel('Spearman\ncorrelation\nH3K36me3')ax[1].set_ylabel('Spearman\ncorrelation\nH3K4me3')ax[2].set_ylabel('Spearman\ncorrelation\nH3K4me1')ax[3].set_ylabel('Spearman\ncorrelation\nH3K27me3')fig.tight_layout()# list of X1,X2,Xins minus common outliers x1_minus_outlier=(x1_lids-mark_outlier)x2_minus_outlier=(x2_lids-mark_outlier)xins_minus_outlier=(xins_lids-mark_outlier)#k4me3,k4me1,k27me3 overall gene profiles, for all genes following the removal of outliersdefplotProfile(mark,gids,signal,count_thresh,c,ax):window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)sns.tsplot(list(df_data[(df_data['mark']==mark)& \
(df_data['count_max']>=count_thresh)].ix[gids].dropna()[signal]),\
ax=ax,color=c,time=xticks)defplotProfileGroups(groups,leg,signal='_perCov',colors=['#1C637A','#81CCE6','#FA8D20','#4AB545']):fig,ax=plt.subplots(3,figsize=[10,8])fori,groupinenumerate(groups):plotProfile('h3k4me3',group,'gfp'+signal,0,colors[i],ax[0])fori,groupinenumerate(groups):plotProfile('h3k27me3',group,'gfp'+signal,0,colors[i],ax[1])fori,groupinenumerate(groups):plotProfile('h3k4me1',group,'gfp'+signal,0,colors[i],ax[2])ax[0].set_ylabel('H3K4me3',fontsize=20)ax[1].set_ylabel('H3K27me3',fontsize=20)ax[2].set_ylabel('H3K4me1',fontsize=20)ax[0].legend(leg)ax[1].legend(leg)ax[2].legend(leg)plotProfileGroups([x1_lids-mark_outlier,\
x2_lids-mark_outlier,\
xins_lids-mark_outlier]\
,['X1','X2','Xins'],signal='')# k36me3 overall gene profiles, for all genes following the removal of outliersx1_lids_n1=[]forkinx1_minus_outlier:try:x1_lids_n1.append(x1_k36[k])exceptKeyError:passx2_lids_n1=[]forkinx2_minus_outlier:try:x2_lids_n1.append(x1_k36[k])exceptKeyError:passxins_lids_n1=[]forkinxins_minus_outlier:try:xins_lids_n1.append(x1_k36[k])exceptKeyError:passwindow=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)sns.tsplot((x1_lids_n1),time=xticks,color="#1C637A")sns.tsplot((x2_lids_n1),time=xticks,color="#81CCE6")sns.tsplot((xins_lids_n1),time=xticks,color="#FA8D20")plt.legend(['X1','X2','Xins'])plt.title('H3K36me3',fontsize=20)# RNA Pol II Ser5p overall gene profiles, for all genes following the removal of outliersx1_lids_n2=[]forkinx1_minus_outlier:try:x1_lids_n2.append(x1_ps5[k])exceptKeyError:passx2_lids_n2=[]forkinx2_minus_outlier:try:x2_lids_n2.append(x1_ps5[k])exceptKeyError:passxins_lids_n2=[]forkinxins_minus_outlier:try:xins_lids_n2.append(x1_ps5[k])exceptKeyError:passwindow=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)sns.tsplot((x1_lids_n2),time=xticks,color="#1C637A")sns.tsplot((x2_lids_n2),time=xticks,color="#81CCE6")plt.legend(['X1','X2','Xins'])plt.title('Ser5p',fontsize=20)# Produce DFs with appended FACS categorizations for each mark following outlier removal. The IDs per ChIP X2 plot given common outlier removal ak4me1=pd.DataFrame(h3k4me1_gfp)k4me1_minusoutlier=ak4me1.join(df_props)ak4me3=pd.DataFrame(h3k4me3_gfp)k4me3_minusoutlier=ak4me3.join(df_props)ak27me3=pd.DataFrame(h3k27me3_gfp)k27me3_minusoutlier=ak27me3.join(df_props)#h3k4me3x_k4me3=k4me3_minusoutlier[k4me3_minusoutlier['x2']>50]['x2']x_k4me3=x_k4me3.sort_values(ascending=False)fig,ax=plt.subplots(figsize=[10,5])c=sns.color_palette("Set1",n_colors=9,desat=.5)forn,iinenumerate(range(0,len(x_k4me3),1500)):gids=x_k4me3[i:i+1500].indexplotProfile('h3k4me3',gids,'gfp',0,c[n],ax)plt.legend(['1-1500','1500-3000','3000-4500','4500-6000','6000-7500','7500-8161'])sns.set_style('darkgrid')#h3k36me3window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)k36_new=k4me3_minusoutlier[k4me3_minusoutlier['x2']>50]['x2']k36_new=k36_new.sort_values(ascending=False)fig,ax=plt.subplots(figsize=[10,5])c=sns.color_palette("Set1",n_colors=9,desat=.5)sns.set_style('darkgrid')forn,iinenumerate(range(0,len(k36_new),1500)):gids=k36_new[i:i+1500].indexsns.tsplot([K36_outlierremove_array[k]forkingids],time=xticks,color=c[n])sns.set_style('darkgrid')plt.legend(['1-1500','1500-3000','3000-4500','4500-6000','6000-7500','7500-8161'])#h3k4me1k4me1=k4me1_minusoutlier[k4me1_minusoutlier['x2']>50]['x2']k4me1=k4me1.sort_values(ascending=False)fig,ax=plt.subplots(figsize=[10,5])c=sns.color_palette("Set1",n_colors=9,desat=.5)forn,iinenumerate(range(0,len(k4me1),1500)):gids=k4me1[i:i+1500].index-mark_outlierplotProfile('h3k4me1',gids,'gfp',0,c[n],ax)plt.legend(['1-1500','1500-3000','3000-4500','4500-6000','6000-7500','7500-8161'])#h3k27me3x_k27me3=k27me3_minusoutlier[k27me3_minusoutlier['x2']>50]['x2']x_k27me3=x_k27me3.sort_values(ascending=False)fig,ax=plt.subplots(figsize=[10,5])c=sns.color_palette("Set1",n_colors=9,desat=.5)forn,iinenumerate(range(0,len(x_k27me3),1500)):gids=x_k27me3[i:i+1500].indexplotProfile('h3k27me3',gids,'gfp',0,c[n],ax)plt.legend(['1-1500','1500-3000','3000-4500','4500-6000','6000-7500','7500-8161'])#RNA pol II Ser5p window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)x1_ps5_new=k4me3_minusoutlier[k4me3_minusoutlier['x2']>50]['x2']x1_ps5_new=x1_ps5_new.sort_values(ascending=False)fig,ax=plt.subplots(figsize=[10,5])c=sns.color_palette("Set1",n_colors=9,desat=.5)sns.set_style('darkgrid')forn,iinenumerate(range(0,len(x1_ps5_new),1500)):gids=x1_ps5_new[i:i+1500].indexsns.tsplot([x1_ps5[k]forkingids],time=xticks,color=c[n])sns.set_style('darkgrid')plt.legend(['1-1500','1500-3000','3000-4500','4500-6000','6000-7500','7500-8161'])Underneath is an example plot for X1 genes for all tracks, 2.5KB either side of the TSS. For publication, owing to constraints on figure size, plots were trimmed manually.
For RNA Pol II Ser5P tracks, a matrix was calculated in DeepTools to give coverage 5kb either side of the TSS.
#Plot genes with the same axis according to the maximum value. defplotExp4(tid,title,ax):sns.set(font_scale=1)sns.set_style("white")sns.set_style("ticks")sns.despine()window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)pd.Series(h3k4me3_gfp.ix[tid],index=xticks).plot(kind='area',color='#0B5394',ax=ax[0],fontsize=40)pd.Series(k36_gfp.ix[tid],index=xticks).plot(kind='area',ax=ax[1],color='#9e45b5',fontsize=40)pd.Series(h3k27me3_gfp.ix[tid],index=xticks).plot(kind='area',ax=ax[2],color='#14a507',fontsize=40)pd.Series(h3k4me1_gfp.ix[tid],index=xticks).plot(kind='area',ax=ax[3],color='red',fontsize=40)maxy=max(max(h3k4me3_gfp.ix[tid]),max(k36_gfp.ix[tid]),max(h3k27me3_gfp.ix[tid]),max(h3k4me1_gfp.ix[tid]))ax[0].set_ylim(0,maxy)ax[1].set_ylim(0,maxy)ax[2].set_ylim(0,maxy)ax[3].set_ylim(0,maxy)ax[0].xaxis.set_ticklabels([])ax[1].xaxis.set_ticklabels([xticks])ax[2].xaxis.set_ticklabels([xticks])ax[3].xaxis.set_ticklabels([xticks])ax[0].set_title(title,fontsize=25)defplotExamples4(e,w,h):n=int(round(float(len(e))/2))rows=n*4cols=2fig,ax=plt.subplots(rows,cols,figsize=[w,h])fig.subplots_adjust(hspace=0)s=0foritemine[:n]:axes=[ax[s,0],ax[s+1,0],ax[s+2,0],ax[s+3,0]]plotExp4(item[0],item[1],axes)s+=4s=0foritemine[n:]:axes=[ax[s,1],ax[s+1,1],ax[s+2,1],ax[s+3,1]]plotExp4(item[0],item[1],axes)s+=4fig.tight_layout()#example X1 gene profilesx1newExamples=[x.split('\t')forxin'''asx1.1_ox1.0.loc.31395 Smed-piwi1asx1.1_ox1.0.loc.09397 Smed-Cyclin-B1asx1.1_ox1.0.loc.03346 ercc6asx1.1_ox1.0.loc.01393 Smed-ddx52asx1.1_ox1.0.loc.00709 Smed-cdt1asx1.1_ox1.0.loc.09127 Smed-wee1asx1.1_ox1.0.loc.36387 Smed-tonsokuasx1.1_ox1.0.loc.13409 Smed-cytokenisisasx1.1_ox1.0.loc.00891 Smed-ddx24asx1.1_ox1.0.loc.19362 Smed-exonucleaseasx1.1_ox1.0.loc.10088 Smed-setd8asx1.1_ox1.0.loc.08008 Smed-mcm2'''.split('\n')]plotExamples4(x1newExamples,60,60)#Order ID categoires from high to low FACS rank category for each enrichment group x1_lid_order=[xforxindf_props['x1'].sort_values(ascending=False).index \
ifxnotinmark_outlierandxinset(x1_lids)]x2_lid_order=[xforxindf_props['x2'].sort_values(ascending=False).index \
ifxnotinmark_outlierandxinset(x2_lids)]xins_lid_order=[xforxindf_props['xins'].sort_values(ascending=False).index \
ifxnotinmark_outlierandxinset(xins_lids)]x2ins_lids=set(df_props[df_props[['x2','xins']].sum(axis=1)>=75].index)-x2_lids-xins_lids#Load mex-3 differential list following Kallist and Sleuth analysis to current genome mexFile=open('/home/share/projects/smed_lpt/whole_mex3/results.final.tab')mexFile.next()fc=[]lids=[]mex3_rnai=[]forlineinmexFile:meta=line.strip().split()ifmeta[4]!='NA':iffloat(meta[4])<=0.05:lids.append(meta[0])fc.append(float(meta[3]))mex3_rnai=pd.Series(fc,index=lids)mexFile.close()#count number of IDS with -1 log2 fold change (i.e. fc of -2 or less) mex3_down_lids=mex3_rnai[mex3_rnai<=-1].indexlen(mex3_down_lids)#count number of IDS in X2_Xins category (x2+xins FACS proportion => 75%) with <= 10% proportion in X1x2_xins_10x1=pd.DataFrame([df_props.ix[x2ins_lids][df_props['x1']<=10].index])len(x2_xins_10x1.columns)fig,ax=plt.subplots(3,2,figsize=[10,10])window=int(math.ceil(5000.0/109))xticks=range(-2500,2500,window)sns.tsplot(list(df_data[df_data['mark']=='h3k4me3']['h3k4me3_perCov']\
.ix[x1_lid_order[:1000]].dropna()),time=xticks,ax=ax[0][0],color='#0B5394')sns.tsplot(list(df_data[df_data['mark']=='h3k27me3']['h3k27me3_perCov']\
.ix[x1_lid_order[:1000]].dropna()),time=xticks,ax=ax[0][0],color='#4AB545')sns.tsplot(list(df_data[df_data['mark']=='h3k4me3']['h3k4me3_perCov']\
.ix[x2_lid_order[:1000]].dropna()),time=xticks,ax=ax[1][0],color='#0B5394')sns.tsplot(list(df_data[df_data['mark']=='h3k27me3']['h3k27me3_perCov']\
.ix[x2_lid_order[:1000]].dropna()),time=xticks,ax=ax[1][0],color='#4AB545')sns.tsplot(list(df_data[df_data['mark']=='h3k4me3']['h3k4me3_perCov']\
.ix[xins_lid_order[:1000]].dropna()),time=xticks,ax=ax[0][1],color='#0B5394')sns.tsplot(list(df_data[df_data['mark']=='h3k27me3']['h3k27me3_perCov']\
.ix[xins_lid_order[:1000]].dropna()),time=xticks,ax=ax[0][1],color='#4AB545')sns.tsplot(list(df_data[df_data['mark']=='h3k4me3']['h3k4me3_perCov']\
.ix[mex3_down_lids[:285]].dropna()),time=xticks,ax=ax[1][1],color='#0B5394')sns.tsplot(list(df_data[df_data['mark']=='h3k27me3']['h3k27me3_perCov']\
.ix[mex3_down_lids[:285]].dropna()),time=xticks,ax=ax[1][1],color='#4AB545')sns.tsplot(list(df_data[df_data['mark']=='h3k4me3']['h3k4me3_perCov']\
.ix[df_props.ix[x2ins_lids][df_props['x1']<=10].index].dropna())\
,time=xticks,ax=ax[2][1],color='#0B5394')sns.tsplot(list(df_data[df_data['mark']=='h3k27me3']['h3k27me3_perCov']\
.ix[df_props.ix[x2ins_lids][df_props['x1']<=10].index].dropna())\
,time=xticks,ax=ax[2][1],color='#4AB545')ax[0][0].set_title('top 1000 ranked X1 loci')ax[0][1].set_title('top 1000 ranked Xins loci')ax[1][0].set_title('top 1000 ranked X2 loci')ax[1][1].set_title('top 1000 ranked smed-Mex3 RNAi\n(pvalue <=0.05, fc <= 2) loci')ax[2][1].set_title('X2/Xins loci with <= 10% Xins expression')fig.tight_layout()bivalence_gfp=pd.concat([h3k4me3_gfp,h3k27me3_gfp],axis=1)bivalence_gfp.columns=['h3k4me3','h3k27me3']#calculate correlation between h3k4me3 and h3k27me3 values between -1kb and 1.5kb upstream of TSSbivalence_corr=[]forindex,rowinbivalence_gfp.iterrows():ifsum(row[0])>0andsum(row[1])>0:bivalence_corr.append([index,ss.pearsonr(row[0][30:80],row[1][30:80])[0]])bivalence_corr=pd.Series([x[1]forxinbivalence_corr],index=[x[0]forxinbivalence_corr])x1_ranking_minus_outlier=k4me3_minusoutlier[k4me3_minusoutlier['x1']>50]['x1']x1_500_minus_outlier=x1_ranking_minus_outlier.sort_values(ascending=False)x2_ranking_minus_outlier=k4me3_minusoutlier[k4me3_minusoutlier['x2']>50]['x2']x2_500_minus_outlier=x2_ranking_minus_outlier.sort_values(ascending=False)x1_500_minus_outlier_2=set(x1_500_minus_outlier.index[:500])x2_500_minus_outlier_2=set(x2_500_minus_outlier.index[:500])fig,ax=plt.subplots(figsize=[8,4])ax.set_xlim(-1,1)bivalence_corr.ix[x1_500_minus_outlier_2].plot(kind='kde',ax=ax,color='#1C637A',fontsize=12)bivalence_corr.ix[x2_500_minus_outlier_2].plot(kind='kde',ax=ax,color='#81CCE6')bivalence_corr.ix[mex3_down_lids].plot(kind='kde',ax=ax,color='red')plt.savefig('/home/anish/corr.svg')plt.legend(['top 500 X1 genes','top 500 X2 genes','fc <-2 mex3 rnai loci'])We first calculated RNA Pol II Ser5P coverage 2.5KB +/- bp from the TSS. We utilised the coverage scores, normalized to input, given at 50bp increments in the ComputeMatrix output of the DeepTools package. We then employed the following script to print out the Pausing ratio for each ID. As we TES's can be difficult to accurately predict, we utilised 2.5kb upstream of the TSS. For genes that are less than 2.5kb long, this estimation of PI is likely to be conservative. For genes that are >1kb < 2.5kb in length, we looked at the profile of RNA Pol II Ser5P to validate gene pausing.
import gzip
import numpy as np
import csv
with gzip.open('/home/anish/RNAPolII.matrix.gz') as infile:
for line in infile:
cols = line.strip().replace('nan','0').split('\t')
row = cols[6:]
first3, last = row[40:60], row[60:100]
try:
average = sum(map(float, first3)) / sum(map(float, last))
except ZeroDivisionError:
average = 0
print average