This is the README for RecoverY scripts

INSTALL : 

To install RecoverY, clone the GorillaY_project directory with the following commands

git clone https://github.com/makovalab-psu/GorillaY_project.git
cd GorillaY_project/RecoverY/


DEPENDENCIES :

RecoverY has been tested with the following dependencies :

dsk version 2.0.2 (included in the dependency/ folder of RecoverY)

bash version 4.2.37 
python version 2.7.3 
Rscript version 2.15.1

Earlier versions of these might break RecoverY (not tested)


EXAMPLE :

To test if RecoverY works on your system, please run the following commands from the GorillaY_project/RecoverY/ folder:

./rundsk.sh flowsortedY_data/r1.fastq

./recoverY_main.sh flowsortedY_data/r1.fastq flowsortedY_data/r2.fastq tables/pre_Threshold_reference_table 1


The first script generates the directory "dsk_output" and populates it with a .h5 and .histo file. Additionally, it creates the directory "tables" which contains the file "pre_Threshold_reference_table". Finally, it generates the k-mer abundance plot "myplot.jpg". 

The second script generates the directory "post_RecoverY" and creates the R1 and R2 files which contain Y-specific read pairs. Additionally, it creates the "post_Threshold_reference_table" file in the "tables" directory




RUNTIME INSTRUCTIONS : 

1. First create the k-mer table with :

./rundsk.sh READS_R1.FASTQ

This script also produces the k-mer abundance plot "myplot.jpg"

Next, pick a THRESHOLD_VALUE based on the abundance peak of the k-mer plot


2. Finally, run the main script :

FIRST, set k-mer size and strictness in py_scripts/classify_as_Y_chr.py 
Strictness is a fraction of the read length. It is in the range (0,1)


./recoverY_main.sh READS_R1.FASTQ READS_R2.FASTQ PRE_THRESHOLD_REF_TBL THRESHOLD_VALUE



THINGS TO NOTE :

- on re-running, these scripts over-write output from previous runs. Please backup your data from previous rundsk and recovery_main runs if needed

- the setting for k-mer size needs to be consistent across dsk and recoverY. Current default is 25. Any modifications made in the rundsk.sh file to kmer size must be replicated in py_scripts/classify_as_Y_chr.py

- the setting for strictness in py_scripts/classify_as_Y_chr.py can be changed. The optimal strictness found by our testing is between a third to two-thirds of the readlength. It is currently set to 0.5 and we encourage you to modify this according to your needs.





TROUBLESHOOTING : 

error : Insufficient arguments

Typing the name of the bash script provides information on number and order of arguments 



error : ./rundsk.sh: line N: dsk: command not found
	./rundsk.sh: line N+2: h5dump: command not found
	./rundsk.sh: line N+4: dsk2ascii: command not found

Please install dsk 2.0.2 or later : http://minia.genouest.org/dsk/







Please e-mail szr165@psu.edu with questions or error reports












