This README file gives instructions to reproduce our analysis. There are two directories.

The directory TRUSEQ contains scripts and related files that will help reproduce our analysis using TruSeq data. 
The directory TRIOPHASING contains scripts and related files that will help reproduce our analysis using trio data.
Please also see the Methods section of our paper for ftp links to various 1000 Genomes Project VCF files used in our analysis. Note that for the scripts to work properly vcf-tools or at least tabix should be installed. We use it in our scripts. 

We first describe TruSeq analysis.
The first step is to take the AGE aligned fragments with each
deletion event and select the most balanced and perfectly aligned fragment for each deletion event. 
The AGE output is in the directory TRUSEQ/MOLAGE_FILES. Working within the TRUSEQ directory will ensure correct run of the
scripts. Comments in respective scripts will give information on various arguments used, etc. The names of output files can
be chosen by the user. Here we only give illustrative names. 

./snpProcAGEscrpt ageAlgndFleLst.txt prfSeqs.txt

./snpPrntVCFscrpt prfSeqs.txt NA12878_GenoTyp.txt output.TruSeqSNPs.vcf

Next, create the regions file that is useful for vcf-isec commands. These regions flank given deletion events. 
These regions are determined using the particular TruSeq fragment's start and end coordinate. As mentioned in the 
main text and Methods section in our paper, we only consider heterozygous deletion events. Hence only regions flanking
these are output. 

./snpPrntRegnScrpt prfSeqs.txt NA12878_GenoTyp.txt output.Regions.txt

The output text file contains pairs of regions in the order, left followed by right region, around a given deletion event. 

It is upto to the user to use his/her favorite tools to obtain the intersection and complement of the TruSeq SNPs with the 
GATK call set VCF files (see Methods). We used vcf-tools and below are the required steps. 

$HOME/path_to_vcf-tools/vcf-sort -c output.TruSeqSNPs.vcf > output.TruSeqSNPs.sorted.vcf
$HOME/path_to_vcf-tools/bgzip -c output.TruSeqSNPs.sorted.vcf > output.TruSeqSNPs.sorted.vcf.gz
$HOME/path_to_vcf-tools/tabix -p vcf output.TruSeqSNPs.sorted.vcf.gz
$HOME/path_to_vcf-tools/vcf-isec -f -n =2 GATKCALLSET.vcf.gz output.TruSeqSNPs.sorted.vcf.gz -r output.Regions.txt > intersection.GATK.TruSeqSNPs.vcf
$HOME/path_to_vcf-tools/vcf-isec -f -c GATKCALLSET.vcf.gz output.TruSeqSNPs.sorted.vcf.gz -r output.Regions.txt > complement.GATK.TruSeqSNPs.vcf

To obtain heterozygous SNPs from these use this script snpVCF2VCFscrpt. For example, 
./snpVCF2VCFscrpt intersection.GATK.TruSeqSNPs.vcf output.Regions.txt 10000 output.intersection.GATK.TruSeq_HET_SNPs.vcf

In this example, the window size is 10 kbp. This can be chosen differently, say for example, 6 kbp. We found that 10 kbp
was enough to gather all the relevant SNPs from the TruSeq fragments. 

The files output.intersection.GATK.TruSeq_HET_SNPs.vcf and output.complement.GATK.TruSeq_HET_SNPs.vcf will contain the 
in-phase and out-of-phase SNPs.

To obtain plots of heterozygous SNP densities or histograms, the SNP counts need to be saved as numpy arrays. For this the script
snpCntVCFscrpt should be used. Its logic is similar to the above snpVCF2VCFscrpt and it only counts heterozygous SNPs. 
The arrays saved should be moved and renamed, otherwise they will be overwritten during the next run of the script. 

./snpCntVCFscrpt intersection.GATK.TruSeqSNPs.vcf output.Regions.txt 10000

Note that two arrays are saved. The second array keeps track of TruSeq fragment lengths and is used for normalization. 

For making plots, we provide a python script, snpPlotScrpt. This used python's matplotlib module. We recommend using it with
ipython invoked using --pylab option. But other ways to run this script are possible.
Type ./snpPlotScrpt for usage information. Also comments in script are helpful. 

We next describe trio analysis. Change directory to TRIOPHASING, for the relevant scripts and files. 

To phase the variants based on genotype of the parents,

./phasGATKNA12878varntScrpt GATKCALLSET.vcf.gz output.NA12878phasedGATKvariants.vcf output.NA12878notPhasedVariants.txt output.NA12878inConsistentVariants.txt

The GATK call sets used are mentioned in the Methods section on trio phasing. The phased output vcf files should be further 
processed with vcf-tools. 

$HOME/path_to_vcf-tools/bgzip -c output.NA12878phasedGATKvariants.vcf > output.NA12878phasedGATKvariants.vcf.gz
$HOME/path_to_vcf-tools/tabix -p vcf output.NA12878phasedGATKvariants.vcf.gz

The deletions have to be phased too. For example, for NA12878,
 
./phasNA12878DelWithTrioScrpt NA12878_GenoTyp.txt output.NA12878phasedDel.txt output.NA12878NotPhased.txt > output.NA12878inConsistentDel.txt

Next we make the regions using the phased Deletion events. The output file is just as mentioned above for TruSeq
analysis, with alternating left and right regions (the very first region is left). 
Of course, in this case we can specify the window size. We have used values of WIN equal to 100000, 50000, 10000.  

./phasDelMakRegionScrpt output.NA12878phasedDel.txt WIN > output.NA12878phasedWINkbpRegns.txt

Finally the phasing of the variants with the phased deletions is achieved with the following script,

./phasSNPandDelScrpt output.NA12878phasedWINkbpRegns.txt output.NA12878phasedGATKvariants.vcf.gz output.NA12878inPhsWINkbpHET_SNPs.vcf output.NA12878outPhsWINkbpHET_SNPs.vcf

The in-phase and out-of-phase files are in the respectively named files. In their names the WIN can be replaced with the 
appropriate numbers like 100 or 50 or 10 (the units are kbp). Keeping the sample name and inPhs/outPhs as part of the name 
is essential as the script for counting below, uses these strings to name the appropriate output numpy arrays. 

To obtain plots, a script written for these trio phased SNPs should be used. 

./phasSNPcntScrpt output.NA12878phasedWINkbpRegns.txt output.NA12878inPhsWINkbpHET_SNPs.vcf WIN

Here the output is only one numpy array. WIN should correspond to that used with phasDelMakRegionScrpt. 

For out of phase,
./phasSNPcntScrpt output.NA12878phasedWINkbpRegns.txt output.NA12878outPhsWINkbpHET_SNPs.vcf WIN

For plotting phasSNPplotScrpt should be used. Other plotting instructions are as above. 
