importgzip,os,sysfromIPython.displayimportHTMLimportpandasaspdimportseabornassnsimportmatplotlib.pyplotaspltimportscipy.statsasssimportscipyasspimportnumpyasnpfromcollectionsimportdefaultdictfromscipy.signalimportargrelmin,argrelmaximportmathimportmatplotlib.patchesaspatches#import custom functions for displaying tables, bash commandssys.path.append(os.path.abspath("/home/damian/"))fromdk_ipythonimport*%matplotlibinlineHTML(addToggle())

Supplementary Data¶

Re-annotation of the planarian asexual genome¶

The asexual genome (SmedAsxl_genome_v1.1.nt) from Smedgd2.0 was used for re-annotation. The re-annotation process consisted of utilizing available de-novo assembled transcriptome, available submitted sequences in NCBI, and 164 available RNA-seq libraries (list of RNA-seq SRA ids are provided in Supplementary File 3).

a. Consolidation of de-novo transcriptomes and submitted NCBI sequences:

The following de-novo transcriptome data sets were used (downloaded from planmine and from Smedgd2.0):

oxford transcriptome
dresden transcriptome
smedgd asexual transcriptome
smedgd unigenes
NCBI complete CDS sequences *downloaded from nucleotide database at NCBI

NCBI CDS sequences were cleaned for poly-A tails.

The PASA pipeline was used to consolidate these datasets using GMAP as the mapper:

Launch_PASA_pipeline.pl -g SmedAsxl_genome_v1.1.nt -c pasa.alignAssembly.txt -C -R -t pasa_input.fa -f ncbi.complete.cds.seqclean.format.gids --ALIGNERS blat,gmap --CPU 32

defparseGTF(path):gtfFile=open(path)genes=defaultdict(lambda:defaultdict(list))forlineingtfFile:ifline[0]!='#':meta=line.strip().split('\t')ifmeta[2]=='exon':info=dict([(x.split('"')[0].strip(),x.split('"')[1]) \
                             forxinmeta[8].split(';')[:-1]])tid=info['transcript_id']gid=info['gene_id']genes[gid][tid].append((meta[0],meta[3],meta[4]))gtfFile.close()returngenes

pasa=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/pasa/transdecoder/smed_pasa.transdecoder.gtf')print'Number of PASA genes:',len(pasa)print'Number of PASA transcripts:',sum([len(v)fork,vinpasa.items()])

Number of PASA genes: 67037
Number of PASA transcripts: 67037

b. RNA-seq assembly:

RNA-seq data was mapped to the asexual S.med genome (SmedAsxl_genome_v1.1.nt) using hisat2-2.04. The following comand was used:

hisat2 --dta -p 32 -x SmedAsxl.hisat2 -1 libA.fastq.gz -2 libB.fastq.gz | samtools view -b - > lib.bam

for strand specific Aboobaker lab libraries:

hisat2 --rna-strandness FR --dta -p 32 -x SmedAsxl.hisat2 -1 libA.fastq.gz -2 libB.fastq.gz | samtools view -b - > lib.bam

Stringtie was used to assemble transcripts from aligned reads in the bam file (following sorting):

stringtie lib.sorted.bam -o lib.sorted.gtf -p 32

Each .gtf file from the previous stringtie run was merged together with the PASA annotations

stringtie --merge *.gtf > all.merged.gtf

stringtie=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/raw/merged_noPASA.gtf')printprint'Number of StringTie genes:',len(stringtie)print'Number of StringTie transcripts:',sum([len(v)fork,vinstringtie.items()])


Number of StringTie genes: 47427
Number of StringTie transcripts: 91464

c-consolidate-pasa-and-rna-seq-assemblyc. Consolidate PASA and RNA-seq assembly

We next generated a consolidated set of transcript data for downstream analysis. StringTie was used to merge together RNA-seq assembly and PASA annotations.

There are 2,776 transcripts from the de-novo assembly dataset that do not map to the asexual genome.

stringtie=parseGTF('/home/share/projects/smed_transcriptome/raw/annotation/raw/merged_withPASA.gtf')printprint'Number of StringTie genes:',len(stringtie)print'Number of StringTie transcripts:',sum([len(v)fork,vinstringtie.items()])


Number of StringTie genes: 52178
Number of StringTie transcripts: 97010

d-clean-consolidated-annotationsd. Clean consolidated annotations

This annotation dataset contains a lot of potential redundancies where extremely similar genes were assembled separately with small differences in splice junctions.

Some of these features may be biological, but we should be able to identify obvious errors. The general strategy for cleaning this data is to identify extremely similar things to reduce redundancy.

We identify obviously similar things by:

Cluster overlapping transcripts into loci using mergeBed
For each cluster of overlapping transcripts, we calculate pair-wise distances (based on intron coordinates) for all transcripts in the cluster. This pair-wise distance is similar to a jaccard index. For each pair of transcripts (A and B). We find the length of intron intersection between A and B (i_AB). Then we divide this intersection by the sum length of A introns (L_A) and sum length of B introns (L_B), giving us two values (j_A = i_AB / L_A and j_B = i_AB / LB).
j_A and j_B are 1 if the intron of A or B are completely within each other. If both j_A and j_B are 1, then both A and B have the exact same intron structure. We filter all pair-wise comparisons for comparisons where both j_A and j_B are greater than or equal to 0.9.
Using the filtered comparisons as edges, we create a graph and find maximal cliques of the graphs. These cliques represent clusters of transcripts that are highly similar to each other in terms of intron structure.
From these cliques, we then find a representative transcripts to keep while discarding other transcripts. We utilize transcript length, ORF length, and lowest Uniprot BLAST e-value as criteria (lowest to highest priority) for picking the cluster representative.

The final set of annotations used for chip-seq and rna-seq contains

38,771 distinct loci
79,177 transcripts

For easier down-stream interpretation, we assume each locus is a gene for now. However, note that each locus is simply a set of transcriptionally active regions that overlap. This has the potential of including falsely fused genes or non-coding RNAs that happen to be within a gene.

e-incorporation-of-strand-information-e. Incorporation of strand information:

To incorporate strand data for each gene, we used:

BLAST to Uniprot to observe forward/reverse frame hits

Data from 4 strand-specific RNA-seq libraries generated by Aboobaker lab was used.

 rna_aboobaker_x1_gfp_072h_01
 rna_aboobaker_x1_gfp_072h_02
 rna_aboobaker_x1_lpt_072h_01
 rna_aboobaker_x1_lpt_072h_02
 The libraries were mapped to the Smed genome with hisat2:
 hisat2 --rna-strandness FR --dta -p 32 -x SmedAsxl_genome_v1.1.nt -1 lib_A.fastsq.gz -2 lib_B.fastq.gz | samtools view -b - > lib.bam

   Hisat2, when given the strandness flag will output a 'XS:A' field in the BAM file indicating strand information. The following script (getStrand.py) is then used to extract that information from read pairs. This is only performed on proper pairs with highest mapping quality possible.

 get.Strand.py: 

 import sys
     pair = []
     for line in sys.stdin:
         meta = line.strip().split()
         pair.append(meta)

         if len(pair) == 2:
                 if pair[0][0] == pair[1][0]:
                         if pair[0][6] == '=':
                                 ref = pair[0][2]
                                 aPos = int(pair[0][3])
                                 bPos = int(pair[1][3])

                                 strand = [x[-1] for x in pair[0][10:] if x[:-1] == 'XS:A:'][0]

                                 start = min(aPos,bPos)
                                 end = max(aPos,bPos) + 40

                                 print '\t'.join([ref,str(start),str(end),strand,str(end - start + 1)])

                         pair = []
                 else:
                         pair.pop(0)

        # samtools view -f 2 -q 60 | python getStrand.py > strand.data

Longest ORF

Strategy for assigning strand information:

Calculate orientation based on blast hits (1) and BAM results from strand specific libraries (2). Any cases where one strand has 5x the counts of the other strand is considered.
If blast is enough to calculate strand, assign the blast strand. If not, assign the BAM strand.
If both blast and BAM doesn't give strand information, we take the strand of the longest ORF.
If ORF doesn't give strand information (both strand give same size ORFs), we sum up strand counts from blast and BAM and assign the strand with most counts.
If there is still no strand assigned, we leave the gene strand as ambiguous.
This strand assignment process left 64 genes with ambiguous strands.

f. Final annotation metrics:

Number of expressed loci: 38,771
Number of loci with transdecoder evidence: 21,772
All expressed loci exons cover 99,671,961 bases (779,479,550 total genomic bases), 12.79%
All coding exons cover 28,359,137 bases (779,479,550 total genomic bases), 3.64%

Number of proteins predicted by Transdecoder to have ORFs that are complete, 5' partial, 3' partial or internal

proteinFile=open('/home/share/projects/smed_transcriptome/raw/ox_smed_asx1.1_1.0.transdecoder.protein.header')transdecoderTypes=defaultdict(int)forlineinproteinFile:meta=line.strip().split()transdecoderTypes[meta[1]]+=1proteinFile.close()pd.Series([x[1]forxintransdecoderTypes.items()]\
          ,index=[' '.join(x[0].split('_'))forxintransdecoderTypes.items()])\
.plot(kind='barh',color='#2B6C82',fontsize=15,title='TransDecoder protein completeness')

<matplotlib.axes._subplots.AxesSubplot at 0x15dd0d10>

CEGMA homology of annotated nucleotide and protein sequences.

Figure shows homology of newly annotated S. mediterranea loci to 248 Core Eukaryotic Genes (CEGs) identified in Parra et al. 2009.
Y- axis is e-value of hit, x-axis is percent coverage of CEGMA alignment to annotated sequence.

deftileCoords(coords):formatCoords=[]forcoordincoords:coord=[int(coord[0]),int(coord[1])]formatCoords.append(('s',min(coord)))formatCoords.append(('e',max(coord)))formatCoords.sort(key=lambdax:x[0],reverse=True)formatCoords.sort(key=lambdax:x[1])count=0posA=0out=[]forposinformatCoords:ifcount==0:posA=pos[1]ifpos[0]=='s':count+=1ifpos[0]=='e':count-=1ifcount==0:out.append((posA,pos[1]))returnoutdefcegmaStat(path,t):cegmaFile=open(path)hits=defaultdict(list)forlineincegmaFile:meta=line.strip().split()k=(meta[0],meta[1])hits[k].append([int(meta[6]),int(meta[7]),float(meta[-2]),int(meta[2])])kogs=defaultdict(list)forhit,datainhits.items():tCoords=tileCoords(data)evalue=min([x[2]forxindata])klength=data[0][3]hitLength=sum([x[1]-x[0]+1forxintCoords])kogs[hit[0]].append((hitLength/float(klength),evalue,hit[1]))bestKog=defaultdict(lambda:(0,0,0))forkog,datainkogs.items():data.sort(key=lambdax:x[0],reverse=True)kid=kog.split('_')[-1]ifbestKog[kid][0]<data[0][0]:bestKog[kid]=(data[0][0],data[0][1],data[0][2])df=[]forkog,datainbestKog.items():df.append([kog,data[0],data[1],data[2]])df=pd.DataFrame(df)df.columns=['KogID','coverage','evalue','tid']df.plot(kind='scatter',x='coverage',y='evalue',logy=True,fontsize=12,figsize=[10,6],\
                title=t+' '+str(len(df))+' / 248 CEGMA genes found')

cegmaStat('/home/share/projects/smed_transcriptome/final/cegma/ox_smed.cegma','S.mediterranea annotations CEGMA (Nucleotide): ')

cegmaStat('/home/share/projects/smed_transcriptome/final/cegma/ox_smed_prot.cegma','S.mediterranea annotations CEGMA (Nucleotide): ')

Comparison of new expression based S.med genome annotations to MAKER annotations of Smed GD 2.0

We sought to compare our expression based annotated genome to the previous annotations Smed GD 2.0 constructed by homology analysis in MAKER. The following 5 datasets were used in the comparison:

New Oxford annotations
New Oxford annotations that are likely coding predicted by TransDecoder
New Oxford specific annotations not found in MAKER annotations from Smed GD 2.0
MAKER annotations from SmedGD 2.0
MAKER specific annotations from SmedGD 2.0 not found in Oxford annotations

The figure shows the percentage of annotations left over (y-axis) at a range of expression thresholds (x-axis). The expression value is the mean estimated counts (from Kallisto output) per KB from 164 RNA-seq libraries.

smedgdFile=open('/home/share/projects/smed_transcriptome/comp/smedgd.estCountPerKB')smedgd_estCountKB=[]forlineinsmedgdFile:meta=line.strip().split()smedgd_estCountKB.append([meta[0],float(meta[1])])smedgdFile.close()oxFile=open('/home/share/projects/smed_transcriptome/comp/ox.loci.estCountPerKB')ox_estCountKB=[]forlineinoxFile:meta=line.strip().split()ox_estCountKB.append([meta[0],float(meta[1])])oxFile.close()smedgdSpecific=open('/home/share/projects/smed_transcriptome/comp/smedgd.specific.gids').read().strip().split('\n')smedgdSpecific=[x+'-mRNA-1'forxinsmedgdSpecific]ox_codingLids=set(open('/home/share/projects/smed_transcriptome/comp/ox_coding.lids').read().strip().split('\n'))oxSpecific=open('/home/share/projects/smed_transcriptome/comp/ox.specific.tids').read().strip().split('\n')oxSpecific=list(set(['.'.join(x.strip().split('.')[:-1])forxinoxSpecific]))df_smedgd=pd.Series([x[1]forxinsmedgd_estCountKB],index=[x[0]forxinsmedgd_estCountKB])df_ox=pd.Series([x[1]forxinox_estCountKB],index=[x[0]forxinox_estCountKB])defcumulativeSeries(s):o=[]total=float(len(s))foriinrange(1,200,1):o.append([i,len(s[s<i])/total*100])returno

smedgd_threshold=pd.DataFrame(cumulativeSeries(df_smedgd),columns=['threshold','percent'])smedgdSp_threshold=pd.DataFrame(cumulativeSeries(df_smedgd.loc[smedgdSpecific]),columns=['threshold','percent'])ox_threshold=pd.DataFrame(cumulativeSeries(df_ox),columns=['threshold','percent'])oxCoding_threshold=pd.DataFrame(cumulativeSeries(df_ox.loc[ox_codingLids]),columns=['threshold','percent'])oxSpecific_threshold=pd.DataFrame(cumulativeSeries(df_ox.loc[oxSpecific]),columns=['threshold','percent'])

fig,ax=plt.subplots(figsize=[10,5])smedgd_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)smedgdSp_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)ox_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)oxCoding_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)oxSpecific_threshold.plot(kind='line',x='threshold',y='percent',ax=ax)plt.legend(['SmedGD Maker annotations','SmedGD MAKER specific annotations','Oxford annotation','Oxford coding annotations','Oxford specific annotations'])ax.set_title('Cumulative percentage of annotations for Smed GD 2.0 and new Oxford loci at a range of expression thresholds')

<matplotlib.text.Text at 0x15970290>

Epigenetic analyses of planarian stem cells demonstrate conservation of bivalent histone modifications in animal stem cells