This is the README file for the pipeline imlementing the STM (Scaffolding using
Translation Mapping) method.

CONTACT INFORMATION:
===============================================================================
Yann Surget-Groba yann.surget@unige.ch http://www.surget-groba.ch


SYSTEM REQUIREMENTS:
===============================================================================
The STM pipeline is written in Python and should work on any operating system
provided that Python is installed and a C compiler is available. It depends on two external modules that need to be installed before use, Biopython (>= 1.51) and Parallel Python (tested with v2.5.7).
The STM pipeline has been developed on a Linux system and tested on Linux and OS X operating systems runing Python 2.5 and Python 2.6.


INSTALLATION:
================================================================================
1) Install Python (it should be already installed on most Unix-like operating systems). You can download it from www.python.org
2) Install Biopython (>=1.51) and its dependencies following instructions on www.biopython.org. If using a Debian system (squeeze or sid) or Ubuntu (karmic or lucid) simply type: sudo aptitude install python-biopython
3) Install PP following instructions on www.parallelpython (if using a Debian/Ubuntu system type: sudo aptitude install python-pp).
4) Uncompress the STM archive (tar xvfz stm.tar.gz). It will create a STM directory containing the python scripts, CAP code source (cap.c) and a pre-compiled version of CAP is provided for Linux amd64 systems (cap). For other systems you need to compile the code source provided (cap.c). If using the GNU C compiler type: gcc cap.c -o cap


USAGE:
================================================================================
*Step 1: Run blastx on your assembled contigs (and optionally unassembled reads) against the reference proteome. Output format needs to be set to xml. Since just the first alignment will be used you can limit the number of alignments displayed to 1. A typical command line would be something like:

blastx -db reference_proteome -query contigs_file -out blast_records.xml -evalue 1e-5 -outfmt 5 -num_alignments 1 -num_descriptions 1

*Step 2: get coordinates of contigs/reads on reference proteins:

mapBlastHit.py -i blast_records.xml -o mapfile -s seqfile [-d idCutoff -c coverageCutoff -v]

parameters description:
    -i blast_records.xml: blast report file in xml format
    -o mapfile: output file; will be overwritten if exists
    -s seqfile: fasta file with contig sequences
    optional parameters:
    -d idCutoff (integer): minimum percent identity between the query and the subject sequence [default = 60]
    -c coverageCutoff (integer): minimum percentage of the query sequence covered by the hit [default = 90]
    -v: optional flag to print run informations
    
the output of this script is composed of 2 files:
	- mapfile: with the coordinates of mapped contigs/reads
	- mapfile_orphans.fa: with the sequences of unmapped contigs/reads.
	
*Step 3: Scaffolding step:

map2contigs.py -i mapfile -s seqfile -o scaffolds.fa -r reference [-j nJobs -n n local CPUs -h remote_hosts,]

parameters description:
    -i mapfile: output file from mapBlastHit.py
    -s seqfile: fasta file with contig sequences
    -o scaffolds: output file; will be overwritten if exists
    optional parameters for parallel processing (see documentation of module pp for details on how to setup ppservers):
    -j number of jobs (default 1)
    -n number of local CPU to use (default 1)
    -h name of remote ppservers (coma separated list if several, default none)
    
the ouput of this script is composed of 3 files:
	- scaffolds: fasta file with scaffolds constructed
	- scaffolds_rejected.fa: fasta file with contigs/reads assembled with cap but not scaffolded
	- scaffolds_orphans.fa: fasta file with unassembled and unscaffolded contigs/reads
	
*Step 4 (for STM+ flavor only):
if reads were used (STM+ flavor) they need first to be removed for the 2 files mapfile_orphans.fa and scaffolds_orphans.fa. To do that you can use the provided script sortOrphans.py:

sortOrphans.py -i infile [-p prefix]
    -i infile: fasta file containing orphan contigs and reads
    -p prefix: prefix of the read names (default HWI)
    
the output of this script is composed of 2 files:
	- infile_reads.fa containing reads (this can be discarded)
	- infile_contigs.fa containing contigs
	
*step 5: Build final assembly file:
scaffolds, cap assembled contigs and non-processed contigs need to be merged into a single file. Use for exemple the follwing command:

cat scaffolds.fa scaffolds_rejected.fa mapfile_orphans.fa scaffolds_orphans.fa > finalAssembly.fa

N.B. if using the STM+ flavor, map_file_orphans.fa and scaffolds_orphans.fa must be replaced by the corresponding files containing only contigs (otherwise unassembled reads will be present in the final assembly file)
