importgzip,os,sysfromIPython.displayimportHTMLimportpandasaspdimportseabornassnsimportmatplotlib.pyplotaspltimportscipy.statsasssimportscipyasspimportnumpyasnpfromcollectionsimportdefaultdictfromscipy.signalimportargrelmin,argrelmaximportmathimportmatplotlib.patchesaspatches#import custom functions for displaying tables, bash commandssys.path.append(os.path.abspath("/home/damian/"))fromdk_ipythonimport*%matplotlibinlineHTML(addToggle())The asexual genome (SmedAsxl_genome_v1.1.nt) from Smedgd2.0 was used for re-annotation. The re-annotation process consisted of utilizing available de-novo assembled transcriptome, available submitted sequences in NCBI, and 164 available RNA-seq libraries (list of RNA-seq SRA ids are provided in Supplementary File 3).
The following de-novo transcriptome data sets were used (downloaded from planmine and from Smedgd2.0):
oxford transcriptome
dresden transcriptome
smedgd asexual transcriptome
smedgd unigenes
NCBI complete CDS sequences *downloaded from nucleotide database at NCBI
NCBI CDS sequences were cleaned for poly-A tails.
The PASA pipeline was used to consolidate these datasets using GMAP as the mapper:
Launch_PASA_pipeline.pl -g SmedAsxl_genome_v1.1.nt -c pasa.alignAssembly.txt -C -R -t pasa_input.fa -f ncbi.complete.cds.seqclean.format.gids --ALIGNERS blat,gmap --CPU 32
defparseGTF(path):gtfFile=open(path)genes=defaultdict(lambda:defaultdict(list))forlineingtfFile:ifline[0]!='#':meta=line.strip().split('\t')ifmeta[2]=='exon':info=dict([(x.split('"')[0].strip(),x.split('"')[1]) \
forxinmeta[8].split(';')[:-1]])tid=info['transcript_id']gid=info['gene_id']genes[gid][tid].append((meta[0],meta[3],meta[4]))gtfFile.close()returngenespasa=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/pasa/transdecoder/smed_pasa.transdecoder.gtf')print'Number of PASA genes:',len(pasa)print'Number of PASA transcripts:',sum([len(v)fork,vinpasa.items()])RNA-seq data was mapped to the asexual S.med genome (SmedAsxl_genome_v1.1.nt) using hisat2-2.04. The following comand was used:
hisat2 --dta -p 32 -x SmedAsxl.hisat2 -1 libA.fastq.gz -2 libB.fastq.gz | samtools view -b - > lib.bam
for strand specific Aboobaker lab libraries:
hisat2 --rna-strandness FR --dta -p 32 -x SmedAsxl.hisat2 -1 libA.fastq.gz -2 libB.fastq.gz | samtools view -b - > lib.bam
Stringtie was used to assemble transcripts from aligned reads in the bam file (following sorting):
stringtie lib.sorted.bam -o lib.sorted.gtf -p 32
Each .gtf file from the previous stringtie run was merged together with the PASA annotations
stringtie --merge *.gtf > all.merged.gtf
stringtie=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/raw/merged_noPASA.gtf')printprint'Number of StringTie genes:',len(stringtie)print'Number of StringTie transcripts:',sum([len(v)fork,vinstringtie.items()])We next generated a consolidated set of transcript data for downstream analysis. StringTie was used to merge together RNA-seq assembly and PASA annotations.
There are 2,776 transcripts from the de-novo assembly dataset that do not map to the asexual genome.
stringtie=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/raw/merged_withPASA.gtf')printprint'Number of StringTie genes:',len(stringtie)print'Number of StringTie transcripts:',sum([len(v)fork,vinstringtie.items()])This annotation dataset contains a lot of potential redundancies where extremely similar genes were assembled separately with small differences in splice junctions.
Some of these features may be biological, but we should be able to identify obvious errors. The general strategy for cleaning this data is to identify extremely similar things to reduce redundancy.
We identify obviously similar things by:
The final set of annotations used for chip-seq and rna-seq contains
38,771 distinct loci
79,177 transcripts
For easier down-stream interpretation, we assume each locus is a gene for now. However, note that each locus is simply a set of transcriptionally active regions that overlap. This has the potential of including falsely fused genes or non-coding RNAs that happen to be within a gene.
To incorporate strand data for each gene, we used:
BLAST to Uniprot to observe forward/reverse frame hits
Data from 4 strand-specific RNA-seq libraries generated by Aboobaker lab was used.
rna_aboobaker_x1_gfp_072h_01
rna_aboobaker_x1_gfp_072h_02
rna_aboobaker_x1_lpt_072h_01
rna_aboobaker_x1_lpt_072h_02
The libraries were mapped to the Smed genome with hisat2:
hisat2 --rna-strandness FR --dta -p 32 -x SmedAsxl_genome_v1.1.nt -1 lib_A.fastsq.gz -2 lib_B.fastq.gz | samtools view -b - > lib.bam
Hisat2, when given the strandness flag will output a 'XS:A' field in the BAM file indicating strand information. The following script (getStrand.py) is then used to extract that information from read pairs. This is only performed on proper pairs with highest mapping quality possible.
get.Strand.py:
import sys
pair = []
for line in sys.stdin:
meta = line.strip().split()
pair.append(meta)
if len(pair) == 2:
if pair[0][0] == pair[1][0]:
if pair[0][6] == '=':
ref = pair[0][2]
aPos = int(pair[0][3])
bPos = int(pair[1][3])
strand = [x[-1] for x in pair[0][10:] if x[:-1] == 'XS:A:'][0]
start = min(aPos,bPos)
end = max(aPos,bPos) + 40
print '\t'.join([ref,str(start),str(end),strand,str(end - start + 1)])
pair = []
else:
pair.pop(0)
# samtools view -f 2 -q 60 | python getStrand.py > strand.data
Longest ORF
Strategy for assigning strand information:
Calculate orientation based on blast hits (1) and BAM results from strand specific libraries (2). Any cases where one strand has 5x the counts of the other strand is considered.
If blast is enough to calculate strand, assign the blast strand. If not, assign the BAM strand.
If both blast and BAM doesn't give strand information, we take the strand of the longest ORF.
If ORF doesn't give strand information (both strand give same size ORFs), we sum up strand counts from blast and BAM and assign the strand with most counts.
If there is still no strand assigned, we leave the gene strand as ambiguous.
This strand assignment process left 64 genes with ambiguous strands.
Number of expressed loci: 38,771
Number of loci with transdecoder evidence: 21,772
All expressed loci exons cover 99,671,961 bases (779,479,550 total genomic bases), 12.79%
All coding exons cover 28,359,137 bases (779,479,550 total genomic bases), 3.64%
proteinFile=open('/home/share/projects/smed_transcriptome/raw/ox_smed_asx1.1_1.0.transdecoder.protein.header')transdecoderTypes=defaultdict(int)forlineinproteinFile:meta=line.strip().split()transdecoderTypes[meta[1]]+=1proteinFile.close()pd.Series([x[1]forxintransdecoderTypes.items()]\
,index=[' '.join(x[0].split('_'))forxintransdecoderTypes.items()])\
.plot(kind='barh',color='#2B6C82',fontsize=15,title='TransDecoder protein completeness')Figure shows homology of newly annotated S. mediterranea loci to 248 Core Eukaryotic Genes (CEGs) identified in Parra et al. 2009.
Y- axis is e-value of hit, x-axis is percent coverage of CEGMA alignment to annotated sequence.
deftileCoords(coords):formatCoords=[]forcoordincoords:coord=[int(coord[0]),int(coord[1])]formatCoords.append(('s',min(coord)))formatCoords.append(('e',max(coord)))formatCoords.sort(key=lambdax:x[0],reverse=True)formatCoords.sort(key=lambdax:x[1])count=0posA=0out=[]forposinformatCoords:ifcount==0:posA=pos[1]ifpos[0]=='s':count+=1ifpos[0]=='e':count-=1ifcount==0:out.append((posA,pos[1]))returnoutdefcegmaStat(path,t):cegmaFile=open(path)hits=defaultdict(list)forlineincegmaFile:meta=line.strip().split()k=(meta[0],meta[1])hits[k].append([int(meta[6]),int(meta[7]),float(meta[-2]),int(meta[2])])kogs=defaultdict(list)forhit,datainhits.items():tCoords=tileCoords(data)evalue=min([x[2]forxindata])klength=data[0][3]hitLength=sum([x[1]-x[0]+1forxintCoords])kogs[hit[0]].append((hitLength/float(klength),evalue,hit[1]))bestKog=defaultdict(lambda:(0,0,0))forkog,datainkogs.items():data.sort(key=lambdax:x[0],reverse=True)kid=kog.split('_')[-1]ifbestKog[kid][0]<data[0][0]:bestKog[kid]=(data[0][0],data[0][1],data[0][2])df=[]forkog,datainbestKog.items():df.append([kog,data[0],data[1],data[2]])df=pd.DataFrame(df)df.columns=['KogID','coverage','evalue','tid']df.plot(kind='scatter',x='coverage',y='evalue',logy=True,fontsize=12,figsize=[10,6],\
title=t+' '+str(len(df))+' / 248 CEGMA genes found')cegmaStat('/home/share/projects/smed_transcriptome/final/cegma/ox_smed.cegma','S.mediterranea annotations CEGMA (Nucleotide): ')cegmaStat('/home/share/projects/smed_transcriptome/final/cegma/ox_smed_prot.cegma','S.mediterranea annotations CEGMA (Nucleotide): ')We sought to compare our expression based annotated genome to the previous annotations Smed GD 2.0 constructed by homology analysis in MAKER. The following 5 datasets were used in the comparison:
The figure shows the percentage of annotations left over (y-axis) at a range of expression thresholds (x-axis). The expression value is the mean estimated counts (from Kallisto output) per KB from 164 RNA-seq libraries.
smedgdFile=open('/home/share/projects/smed_transcriptome/comp/smedgd.estCountPerKB')smedgd_estCountKB=[]forlineinsmedgdFile:meta=line.strip().split()smedgd_estCountKB.append([meta[0],float(meta[1])])smedgdFile.close()oxFile=open('/home/share/projects/smed_transcriptome/comp/ox.loci.estCountPerKB')ox_estCountKB=[]forlineinoxFile:meta=line.strip().split()ox_estCountKB.append([meta[0],float(meta[1])])oxFile.close()smedgdSpecific=open('/home/share/projects/smed_transcriptome/comp/smedgd.specific.gids').read().strip().split('\n')smedgdSpecific=[x+'-mRNA-1'forxinsmedgdSpecific]ox_codingLids=set(open('/home/share/projects/smed_transcriptome/comp/ox_coding.lids').read().strip().split('\n'))oxSpecific=open('/home/share/projects/smed_transcriptome/comp/ox.specific.tids').read().strip().split('\n')oxSpecific=list(set(['.'.join(x.strip().split('.')[:-1])forxinoxSpecific]))df_smedgd=pd.Series([x[1]forxinsmedgd_estCountKB],index=[x[0]forxinsmedgd_estCountKB])df_ox=pd.Series([x[1]forxinox_estCountKB],index=[x[0]forxinox_estCountKB])defcumulativeSeries(s):o=[]total=float(len(s))foriinrange(1,200,1):o.append([i,len(s[s<i])/total*100])returnosmedgd_threshold=pd.DataFrame(cumulativeSeries(df_smedgd),columns=['threshold','percent'])smedgdSp_threshold=pd.DataFrame(cumulativeSeries(df_smedgd.loc[smedgdSpecific]),columns=['threshold','percent'])ox_threshold=pd.DataFrame(cumulativeSeries(df_ox),columns=['threshold','percent'])oxCoding_threshold=pd.DataFrame(cumulativeSeries(df_ox.loc[ox_codingLids]),columns=['threshold','percent'])oxSpecific_threshold=pd.DataFrame(cumulativeSeries(df_ox.loc[oxSpecific]),columns=['threshold','percent'])fig,ax=plt.subplots(figsize=[10,5])smedgd_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)smedgdSp_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)ox_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)oxCoding_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)oxSpecific_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)plt.legend(['SmedGD Maker annotations','SmedGD MAKER specific annotations','Oxford annotation','Oxford coding annotations','Oxford specific annotations'])ax.set_title('Cumulative percentage of annotations for Smed GD 2.0 and new Oxford loci at a range of expression thresholds')