# Building a Custom FusionFilter Dataset

To build a custom FusionFilter dataset, you require a genome in FASTA format (eg. 'genome.fa') and a gene annotation set in GTF format (eg. transcripts.GTF). You'll then do the following to generate the additional resources needed:

* extract cDNA sequences for transcribed transcript isoforms, generating a cDNA FASTA file.
* lowercase-mask repetitive regions in the cDNA fasta file
* perform a BLASTN all-vs-all search of the cDNA sequences.

>If you're using a GTF file from Gencode, extract just the 'exon' records, and consider restricting the GTF to only the protein-coding and long noncoding RNA genes to reduce false positive rates.

Each of these steps is detailed below.

## Extract cDNA sequences

Run the following to extract cDNA sequences from your genome based on your GTF annotation file:

    %  FusionFilter/util/gtf_file_to_feature_seqs.pl  transcripts.GTF  genome.fa cDNA > cDNA_seqs.fa

The format of the 'cDNA_seqs.fa' should look like so:

     >ENST00000470238.1 ENSG00000000457.9 SCYL3 
     TTTCCGGACCCGTCTCTATGGTGTAGGAGAAACCCGGCCCCCAGAAGATT
     GTGGGTGTAGTGGCCACAGCCTTACAGGCAGGCAGGGGTGGTTGGTGTCA
     ACAGGGGGGCCAACAGGGTACCAGAGCCAAGACCCTCGGCCTCCTCCCCC
     ...

The FASTA header should include the transcript_id, gene_id, and gene_symbol, in that order.  This is important, as other scripts in CTAT leverage this cDNA header formatting.  If a gene_symbol is not available, the gene identifier is leveraged by CTAT.

## Mask repetitive regions within cDNAs

It is sometimes the case that regions within transcripts, such as the 5' and 3' UTRs, contain segments of repeats such as mobile elements.  In FusionFilter, we want to filter out fusion candidates between gene pairs having sequence similarity such as paralogs.  Mobile element content in UTRs can easily confound this process and lead to over-filtering.

[RepeatMasker](http://www.repeatmasker.org/) is an effective toolkit for identifying and masking repeat regions from target nucleotide sequences.  Companion repeat databases are available for many model organisms.  Using a repeat database, we would mask repetitive sequences in the cDNA FASTA file like so (eg. human target):

    RepeatMasker-open-4-0-3/RepeatMasker/RepeatMasker -pa 6 -s -species human \
                   -xsmall cDNA_seqs.fa

Explore the RepeatMasker documentation for installation instructions and information related to repeat databases available for your target organism.  If no repeat database is available for your target organism, you can consider creating a de novo repeat database using a tool such as [RepeatScout](http://bix.ucsd.edu/repeatscout/) with your genome.

The repeats masked by RepeatMasker, when run as above, are made lowercase.  When we perform the all-vs-all cDNA blastn search below, we'll include options to leverage this lowercase masking, preventing alignments from seeding within these repeat regions.

## All-vs-all BLASTN search

Using the repeat-masked 'cDNA_seqs.fa' file, perform an all-vs-all BLASTN search like so, using [BLAST+](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download):

     # make the cDNA_seqs.fa file blastable
     makeblastdb -in cDNA_seqs.fa.masked -dbtype nucl

     # perform the blastn search
     blastn -query cDNA_seqs.fa.masked -db cDNA_seqs.fa.masked \
            -max_target_seqs 10000 -outfmt 6 \
            -evalue 1e-3 -lcase_masking \
            -num_threads ${CPU} \
            -word_size 11  >  blast_pairs.outfmt6

>Set the ${CPU} parameter to the number of threads. If you want the blast much faster in parallel on a computing grid, see options such as [HpcGridRunner](http://hpcgridrunner.github.io)

Replace the transcript identifiers in the blast output with gene symbols (and gzipping output) by then running:

     FusionFilter/util/blast_outfmt6_replace_trans_id_w_gene_symbol.pl \
             cDNA_seqs.fa blast_pairs.outfmt6  | gzip > blast_pairs.gene_syms.outfmt6.gz


## Optional, include Fusion annotations

If you would like to include annotations for known fusions, create a file containing the format:

     geneA--geneB(tab)some annotation text that describes this fusion
     ...

and you can also include individual gene annotations like so:

     geneA(tab)any annotation I want to include for this gene symbol
     ...


For example:
```
    ATIC--ALK       Cosmic{samples=99,mutations=12,papers=20},chimerdb_pubmed{Anaplastic large cell lymphoma (ALCL),Inflammatory myofibroblastic tumour}
    ATL2--HNRPLL    YOSHIHARA_TCGA_num_samples[BRCA:1|LUAD:1],{Klijn_CCL:Lung=1}
    ATM     ATM_serine/threonine_kinase,ArcherDX_panel,FoundationOne_panel
    ATL1    atlastin_GTPase_1
    ATL2    atlastin_GTPase_2
    ATL3    atlastin_GTPase_3
```



## Prep the Custom FusionFilter Dataset

Using your initial genome.fa and transcripts.GTF file, along with your cDNA_seqs.fa, blast-pair data, and Pfam results, run the following to build the required indexes:

     FusionFilter/prep_genome_lib.pl \
               --genome_fa genome.fa \
               --gtf transcripts.GTF \
               --blast_pairs blast_pairs.gene_syms.outfmt6.gz 
               (and optionally) --fusion_annot_lib /path/to/file/containing/fusion_annotations.txt


## Optional: Include Pfam domains for proteins

The above build process created a file 'ref_annot.pep' in the ctat_genome_lib_dir/ area. Copy this ref_annot.pep file to your workspace and use it in a search of Pfam like so:

    % hmmscan --cpu 4 --domtblout PFAM.domtblout.dat Pfam-A.hmm ref_annot.pep 

>Note, if you don't already have Pfam-A.hmm or hmmscan installed, download [Pfam](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz), install [hmmer3](http://hmmer.org/download.html), and 'hmmpress Pfam-A.hmm' before searching it as per above.

Then gzip the Pfam results:

    %  gzip PFAM.domtblout.dat

Finally, index this file for use in the CTAT genome lib like so:

    %  FusionFilter/util/index_pfam_domain_info.pl  \
        --pfam_domains PFAM.domtblout.dat.gz \
        --genome_lib_dir ctat_genome_lib_build_dir

And now you're ready to run any of the CTAT Fusion-finding utilities, setting the --genome_lib_dir parameter of the CTAT fusion tools to the 'ctat_genome_lib_build_dir' defined and populated above.


In case you're curious about any data formatting issues or required contents, please see examples for data sets we provide source data files and fully built CTAT genome libs at <https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/>