==========================================================================
Random Field Aligner (RFA)
==========================================================================
1. overview
2. install + dependencies
3. config file and input file format
4. (quick) example
5. modifications for cluster
6. demultiplexing SRA deposited data

--------------------------------------------------------------------------
1. overview
--------------------------------------------------------------------------

This is the source code package that implements RFA to confidently align
short reads resulting from read cloud protocols, and subsequently enables
accurate variant discovery within repeated sequences.

RFA performs the following steps to align a set of read cloud wells:

1. wells are first aligned using bowtie2

2. features are extracted from uniquely mapped read clouds to learn prior
P(M) of long fragments resulting from the protocol

3. wells are aligned separately with our aligner (using the read cloud
model) 

  a. bowtie2 is used again to create candidate long fragments in each well
  and candidate short read alignments to these long fragments

  b. MAP inference is performed over an MRF to converge on final
  alignments for each short read 

  c. probability queries are computed to assign confidence scores

  d. final *bam and results files are produced

To run on large genomes, RFA requires a queue submission frontend together
with an NFS partition to efficiently align read cloud wells in parallel.
A local running mode is included in order to test on a smaller subset of
wells on a single node.

RFA consists of a master, rfa.py, that orchestrates actions across jobs to
carry out tasks for estimating parameters, aliging wells, and simulating
wells from a read cloud protocol.  rfa.py produces worker scripts+configs
in a build directory, that use rfa-worker.py as an entry point, and can be
run standalone.  The main source for each worker task (collect well stats,
simulate wells, align wells with RFA, etc) can be found in mlib/worker.

--------------------------------------------------------------------------
2. install + dependencies
--------------------------------------------------------------------------

RFA is written in python (version >= 2.7 required) and has the following
external non-python dependencies:

* bowtie2 (>= 2.2.4)
* samtools (>= 0.1.18)
* picard (>= 1.92)
  - NOTE: $PICARDPATH must be set in shell env to path of directory holding
    picard *jar files

The set of python dependencies is maintained in setup.py and can be
installed as a standard python package using setuptools setup.py

--------------------------------------------------------------------------
3. config + input file formats
--------------------------------------------------------------------------

Config files are written in *json and must be specified with every rfa.py
operation.  A default config file is provided in confs/default.json

Brief descriptions of the config fields:

* technology: moleculo is the only technology currently used for
development.  In theory, a different read cloud technology can be
overloaded provided the input *fastq files match the format specified
below

* referenceFasta_path: absolute path to genome *fasta

* referenceBowtieIndex_path: absolute path to prefix of precomputed
bowtie2 FMindex *fasta

* sampleInfo_map: a map of lane IDs (must be integers) to lane info map,
which is required to specify two keys: 

  1) fqDir_path: absolute path to fastq files for this lane.  Short reads
  must be paired end and separated into separate files per well and must
  be suffixed with '1.fq' and '2.fq'.  As an example, wells 1 and 2 must
  have files 001_1.fq 001_2.fq and 002_1.fq 002_2.fq respectively in this
  directory.

  2) fqFnameFilter_str: a simple regex with a regex capture group around
  the integer well identifier (well IDs must be encoded in the file name).
  For the examle above, the value would be '(...)' to be capture the
  wellIDs 1 and 2 from the filename.

* simParams_map: a map specifiying the simulating paramters, which is
can specify the following keys:

  1) numWells (required): number of wells to simulate

  2) numBarcodesPerWell (default = 1): number of unique barcodes per well.
  This is always 1 for Moleculo, but may be higher for technologies that
  have multiple barcodes per well

  3) genomeCovPerWell (default = estimated empirically): percentage of
  target genome size to target for each well.  This is estimated to be
  around 0.02 for Moleculo, and is automatically estimated empirically
  from the provided wells in sampleInfo_map

* resultsDir_path (required): path to dump all results files including
final *bams.  If not running in local mode, this must be on an NFS
partition

NOTE a single cloud model is learned per input configuration.  If running
multiple lanes of sequencing, the protocol properties must be the same across.
If the properties are different (fragment distribution, short read coverage of
long fragments, etc), then a separate config must be created for each.

--------------------------------------------------------------------------
4. example
--------------------------------------------------------------------------

NOTE:  append --local to all commands to run serially in local mode for
testing.  To run on a cluster, must modify source as described below

1. First estimate parameters empirically from wells specified in
config file confs/default.json

% python rfa.py estimate -c confs/default.json -b scratch.rfa

2. Generate simulated wells

% python rfa.py sim -c confs/default.json -b scratch.rfa

3. Align simulted wells

% python rfa.py align --mode sim -c confs/default.json -b scratch.rfa

4. Align real wells

% python rfa.py align --mode sample -c confs/default.json -b scratch.rfa

--------------------------------------------------------------------------
5. modifications for cluster
--------------------------------------------------------------------------

The following files can be modified in a straightforward way to adapt RFA to
run on a cluster submission interface:

* mlib/common/job.sh
* mlib/common/job.py

job.sh is a wrapper to setup necessary environment variables

NOTE all workers assume an environment variable $TMPDIR is populated with a
valid path to a temporary directory (must be unique for that job) to cd to and
work out of

job.py is a wrapper around the cluster submission interface, which is SGE as
is, and the submit() and getQstatStatus() functions can be modified with the
appropriate submission commands.

--------------------------------------------------------------------------
6. demultiplexing SRA deposited data
--------------------------------------------------------------------------

Each lane of Moleculo sequencing deposited on SRA multiplexes all the wells
together and the prefix of the query name for each read indicates the source
well file for that read.  For example, a read in the deposited lane file
l1_1.fastq with the query name:

@002_1.fastq:002_193_s_1_2215-1098460/1

should be demultiplexed into file 002_1.fastq along with all the other reads
in l1_1.fastq with that same prefix.  A simple standalone python script
included in scripts/fastqsplit.py unpacks a specified fastq into the component
wells for a provided target directory.  The unpacked wells can then be used as
input to RFA.

