This Readme file describes how to use CNVer.  

To install, run

1. ./configure
2. make
3. make install

Quick Start for running with the human hg18 reference:

1. You must first download and install the cs2 flow solver package version 4.3 from http://www.igsystems.com/cs2/cs2-4.3.tar .
2. Donwload the human companion package from the CNVer website if you haven't yet done so.
3. Map the reads from the donor.
4. Run run_pipeline.sh.  You will be prompted to set all the appropriate parameters.


Inputs description:

Read mappings:
It is assumed that the mapping to the genome has been done. 
We used bowtie version 0.10.1 (Langmead et al. 2009) with parameters "-v 2 -a -m 600 --best --strata". 
This allows up to two mismatches, but only includes mappings in the best strata. 
That is, mappings with the minimum number of mismatches (e.g. if there is at least one exact mapping then only exact mappings will be included).
Also, any reads that map to more than 600 locations are omitted from the results. 

For storing the read alignments, we use the bowtie concise format. 
Each line is in the format 'read_id{-|+}:<chr,pos,number_of_mismatches>'
The chr value has to be 0 for chr1, 1 for chr2, ..., 21 for chr22, 23 for chrX, and 24 for chrY.
For example, "3+:<24,11910070,2>" represents an alignment of read 3 to position 11910070 of the positive strand of chrY with two mismatches. 
It is assumed that matepairs have ids in the form of 2x and 2x+1 for the x'th matepair (i.e. 0 and 1 are matepairs, 2 and 3 are matepairs, etc...)
The file has to be sorted according to increasing read_id.
The number_of_mismatches value is not actually used by our algorithm.
The mappings can be stored in multiple files, however, for any given matepair, all its mappings have to be within the same file. 
The list of mapping files must be stored in a separate text file.

Reference genome:
There should be one fasta file per chromosome.
The human hg18 reference is included in the human companion package.

Axt self-alignments:
These contain all significant local alignments of the reference genome to itself.
Alignments for the human hg18 reference are available in the human companion package.
For a description of the file format, see http://genome.ucsc.edu/goldenPath/help/axt.html

Contig break file:
These are annotations of regions in the reference with unreliable sequence that will be ignored by CNVer.
This is included in the human companion package.

Repeat files:
These are annotation of regions in the reference that should be ignored by CNVer because they have high copy count.
This is included in the human companion package.

Analyzing the solution:
The CNV regions are output into a .cnv file. 
To find out the DOC ratio over a region, use the doc_analyzer program.
To find the absolute copy counts in a region, use the segment_walker.sh program.
The segment walker program outputs, for the given regions, the sequence edges of the graph that correspond to the walk of that region.
For each edge, it outputs the amount of times it appears in the reference, the amount of times it appears in the donor, its total length,
the length of just the unmasked part, the DOC ratio along that egde, and the right point of the edge minus the start point of the region (the EndOffset).

To parallelize the creation of linking clusters: 
In run_pipeline.sh, there is a point where the script invokes three other scripts, sequentially.  
These are conc_script, sort_script, and link_script.  
You can choose to run the run_pipeline.sh script up to that point, run these three scripts manually, and then start the run_pipeline.sh script from
the point right afterwards.
Each of the three scripts contaion a list of jobs, but the jobs within one script can be run in parallel.
However, the jobs in one script must finish before the jobs in the next one can be executed.

Debugging and indexing of clusters:
If you want to investigate the contents of the clusters, you must run the scripts indexmaps_script and indexclus_script.
The jobs in the scripts can be run in parallel in any order.
Following that, change the parameters in cl_info.sh to their appropriate value (for now you have to do this manually).
Then, you may use cl_info.sh to lookup information about a specific cluster using the format
cl_info.sh cluster <chr> <link_id> <link_type(0-3)>
The link_id, chr, and type uniqely identify each cluster and are available from each line of the links.chr* files.







For any questions, please contact cnver@cs.toronto.edu.


