# TranspoFinder v0.6: Find transpochimeras

## Intro
This python program finds transcripts that spans over different set of regions
for a set of samples. With a two steps approach, it discover and counts the
occurrences of each chimeric transcripts in each sample. 

#### Chimeric transcripts discovery 
The first step of the program is to go through the given BAM files in order to
**discover** the chimeric transcripts. It starts with running *Stringtie* on all the
input bams. After that, the GTF is overlapped with both the two bedfiles
provided (using *bedtools*). The resulting GTF contains the chimeric transcripts and is saved for
further analysis. So at this point, you'll have 2 GTF for each samples, one
containing all the transcripts and the other containing only the chimeric
trasnscripts as overlapped with your 2 input beds.

> This step can be run on SLURM, check --scitas option

#### Chimeric transcripts analysis 
In a second step, theses chimeric transcripts are **analysed**. For each
transcript, its occurrences in the samples are counted. A transcript is
considered the same if it has the same number of exon and if each exon starts
at the same location within a 10bp window.

## Installation
`python setup.py` should do the trick. You need `cython` preinstalled. Also install
stringtie, bedtools and make sure gawk is installed too.

- - -

### MAC OSX
*Before anything*, you must install **XCode** and the developpers tools to be able to
run anything bioinformatic related on your mac. Go in the AppStore and install
XCode.

First, you must install Stringtie,  bedtools and gawk (cause the mac awk
implementation fails). To do so, either compile from source or
install via brew.

- - -

### Linux
Install stringtie, bedtools and python. Then just run 

```
pip3 install --user -U cython  # to install cython
python setup.py  #or use pip to install local dir
```

- - -

## Usage
Too see the help of transpo, use `transpo --help` in your terminal. Read all the
options. 

To run transpo, you need to setup and gather: 

* a **metadata** file
* gather some **bams** of interest (mapped with HISAT2)
* two **bed** files to define the chimeric transcripts. 

### Metadata 
See the metadata.xls file for example.
The metadata file is now required to run transpo. It is simply a **tab** separated
file structured like this: 

> warning: make sure to add the header at the top of the file
> warning: Make sure the samples are unique names

| bams | sample | groups | 
|------|--------|-------|
| /path/to/bam1 | sampleName1 | C |
| /path/to/bam2 | sampleName2 | KO|

The fist column contains the path to each bam of interest. The sample column is
a name you choose to give to the sample, typically KO1, KO2 etc... The group
column is where you specify to which group the sample belongs. You must have a
control group. For each group except "control", transpo will run comparisons 
and make pvalues. If you want to use another letter for the control group,
simply use the `--control` option to specify what named you used.


## Results
The results are structured like so: 

```
transpo_res
├── all_chim_trans_cat.gtf
├── all_chim_exons.bed
├── all_chim_tss.bed
├── chimeric_genes_table.xls
├── transcripts_table.xls
└── samples
     ├── BAM1_chimeric.gtf
     ├── BAM1.gtf
     └── BAM1.log
     ├── BAM2_chimeric.gtf
     ├── BAM2.gtf
     └── BAM2.log
   
    

```


In the general results folder, the important files are: 

* **chimeric_gene_table.xls**: Your results. Table with occurrence of each chimeric genes per 
  sample and statistics. Openable in Excel
* all_chim_trans_cat.gtf: concatenation of all the transcripts of all samples. Too big to use in
    browser
* transcripts_table.xls: all the transcripts of the samples with occurrences in
    samples. Too big to open with excel
* all_chim_tss.bed: bedfile with ALL the chimeric transcripts TSS
* all_chim_exons.bed: bedfile with ALL the chimeric transcripts exons (except
    the first exon of the transcripts), usefule for finding TE's that are
    exonized. 

Each sample gets its own files in the samples directory with: 

* Stringtie GTF with all detected transcripts
* GTF containing only the chimeric transcripts
* the logs

> In principle, you should never use these files, transpo is making summary statistics
for you in the first set of files described above.
