#Gene Prediction Pipeline for Hydractinia genomes
#This document covers the steps used to generate gene models for both Hydractinia species. This pipeline involves PASA and Augustus.

#Stranded RNAseq data for each species is deposited in the SRA under BioProject PRJNA812777 (H. echinata) and PRJNA807936 (H. symbiolongicarpus).
#Trinity transcripts used for each species can be found on the Hydractinia Genome Project Portal under Downloads.


#1. Generate ab initio gene predictor training sets with PASA
#H. symbiolongicarpus:
Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g primary.polished.fa --ALIGNERS blat,gmap -t transcripts.fasta.clean --CPU 24 
#Output:  symbio.assemblies.fasta, symbio.pasa_assemblies.gff3

pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta symbio.assemblies.fasta --pasa_transcripts_gff3 symbio.pasa_assemblies.gff3
#Output: symbio.assemblies.fasta.transdecoder.genome.gff3

#H. echinata:
Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g primary.polished.fa --ALIGNERS blat,gmap -t transcripts.fasta.clean --CPU 16
#Output: echinata.assemblies.fasta, echinata.pasa_assemblies.gff3

pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta echinata.assemblies.fasta --pasa_transcripts_gff3 echinata.pasa_assemblies.gff3
#Output:  echinata.assemblies.fasta.transdecoder.genome.gff3

#2. Augustus training with PASA output
#The output transdecoder .gff3 files from PASA were used for Augustus training after renaming the file to augustus-training.gff using the following commands:

#H. symbiolongicarpus:
autoAug.pl -g primary.polished.fa  -t augustus-training.gff --species=symbio -v --workingdir=/data/projects/hydractinia/RUNNING_SOFTWARE/Augustus/symbio/2017_08_15.strand_specific.170711_SONIC_HKNMYBBXX.allpaths.hc12.trinity.pasa/training_augustus/

#H. echinata:
autoAug.pl -g primary.polished.fa -t augustus-training.gff --species=echinata -v --workingdir=/data/projects/hydractinia/RUNNING_SOFTWARE/Augustus/echinata/2016_10_13.strand_specific.allpaths.hc12.trinity.pasa/training_augustus/2017_07_05.Hech_Dovetail_PBJelly_arrow_pilon_primary/

#3. Create Augustus hints file
#A hints file was created for Augustus for each species. First BLAT was run on each genome to generate an out_blat.psl file. Then the following command was run to generate the hints.gff file:

blat2hints.pl --nomult --in=out_blat.psl --out=hints.gff
		
#4. Run Augustus		
#To run Augustus, we followed the “Incorporating Illumina RNAseq into AUGUSTUS with GSNAP” pipeline specified here: http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP

#4A. Aligning reads with GSNAP:
See http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP

#4B. Filtering raw alignments:
See http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP

#4C. Creating intron hints:
#H. symbiolongicarpus:
bam2hints --intronsonly --in=160901_YOSHI_C92GMANXX.fs.bam --out=hints.gff 
			
#H. echinata:
bam2hints --intronsonly --in=HKNMYBBXX_15196318_S26_L003.fs.bam --out=hints.gff

#4D. Creating RepeatMasker hints:
See http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP

#4E. Running Augustus:
#Run Augustus on the unmasked genome, using RepeatMasker (de-novo & known) output as repeats hints. Use output generated by GSNAP step above as intron hints. Use output of PASA for training Augustus. Used BLAT-aligned output (ask Sofia which data she used to run BLAT) as exon hints.

#5. Run PASA to add and update UTRs to the Augustus gene models. 
See Step #1