#TE-Driven RNA Analysis
This is the Schone's lab pipeline to detemine Transposon driven RNA. 
This script has been made to work on a basic linux interface, however it can be adapted for use on high preformance clusters. 
!!!!! Need to create HIsat2 or STAR index to run !!!!!


### usage ###
To use the Pipeline:
sh TEDRA.sh <project-name>.list

### Before you use the Pipeline ###
One line of the script needs to be changed to match each project, 
Open TEDRA.sh and change:

species=""
geneDataBase=""
alingment=""

# Reference/ #
    To match your project
    Currently only hg19 is provided in Reference/
    Additional genomes need:
    $species.$geneDataBase.gtf          ## a provided transcript annotation list
    $species-TE.bed                     ## list of Transposons formatted     chr1   25335   25439   IAPLTR2b/LTR/ERVK   .   -
## STAR (mulimapped) or HIsat2 (unique) optional
    $species_STAR/                      ## STAR index
    $species_HIsat2/                    ## HIsat2 index
    $species-$geneDataBase-splice.txt   ## HIsat2 splice

## List file ##
    The ".list" file should be a tab seperated file with the following pattern:
    unique_sample_ID "\t" 1<or>2 "\t" /path/to/fastq_Read1.fa "\t" *(optional)/path/to/fastq_Read2.fa

1=control file
2=test files

Pipeline will work with .gz, .bz2, or unzipped fastq files

example project.list: 
Circadian_C2-rep1   1   /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C2_mRNA_M_1.merged_R1.fastq.gz     /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C2_mRNA_M_1.merged_R2.fastq.gz
Circadian_C2-rep2   1   /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C2_mRNA_M_1.merged_R1.fastq.gz     /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C2_mRNA_M_1.merged_R2.fastq.gz
Circadian_C1-rep1   2   /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C1_mRNA_M_1.merged_R1.fastq.gz     /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C1_mRNA_M_1.merged_R2.fastq.gz
Circadian_C1-rep2   2   /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C1_mRNA_M_1.merged_R1.fastq.gz     /net/isi-dcnl/ifs/user_data/dschones/bioresearch/kecostello/Circadian-db/fastq/C1_mRNA_M_1.merged_R2.fastq.gz

If you are using a samples that has been used before, be sure to give it the same label as the previous run, this sames memory and prevents making multiple copies of the same bams. 
Be sure to give unique IDs for each sample in a new for project,  If you name 2 samples from different projects "control-rep1", it will use the alignments ans stringtie from old run for further analysis. 
The old generated files are in the folder: file/ 

The pipeline can now run multiple jobs at the same time, however if you are generating the alignments and stringtie assemblies for the first time, It recommend running to let one job run first, then submit variants of the comparisions. (Time dependant on the size of the job)

### Pipeline overveiw ###

Fastq files are aligned with using a reference annotation
    Aligned unique with HIsat2 or mulitmapped 100 times with STAR
Generated Bams are sorted and transcritps are annotated using StringTie, and then merged
TE-transcripts are determined with Bedtools and Custom Scripts. Intersect 5’ end of First exons with repeatMask Annotations
Read counts are then determined using StringTie and Custom Scripts. Determined read counts for First exons of TE-transcripts, by using Stringtie -eb. Multimapped counts are Normalized by number of location the read mapped 
    ## Cryptic Transcripts
    Custom Scripts selects Transcripts with expression (>10 reads) only in one condition and an average difference between conditions >10 reads. Threshold for signal is set at 10 reads, however it can be adjusted based on read depth. Changed in scripts/Count.sh
    Selected reates are then put in "crytpic-control.txt" and "cyrptic-treat.txt"
    ##Differential TE-expression
    Custom Scripts filters out transcripts with expression only in one sample
    Filtered TE-driven transcript first exons read counts are then subjected to EdgeR analysis to determine differential expression.
    Differentially regulated TEs are then seperated by Class, Family, and Subfamily to determine potential subfamily activation. 


###Summary###
Alignment --> StringTie --> merge StirngTie assembles --> Find TE-driven Trancripts --> calculate weighted counts for each TE-transcript --> filter counts and pull out Cryptic transripts --> run statical analysis.


If you have any problems, please email kecostello@coh.org


