-------------------------------------------------------------------------------------------------
SNIP-Seq: SNP IDENTIFICATION AND GENOTYPE CALLING FROM POPULATION SEQUENCING of TARGETED LOCI 
-------------------------------------------------------------------------------------------------

Author: Vikas Bansal (vbansal@scripps.edu), last modified Feb 4 2010

This is a python program to detect SNPs and assign genotypes using the sequenced reads 
from a population of sequenced samples. The method utilizes the set of base calls across all samples
simultaneously to identify SNPs at each position and then call genotypes. 
The program has been tested on several datasets that involved sequencing of 50-300 samples 
across few hundred kilobases of the human genome using PCR. The method can be applied
to targeted sequencing of several megabases or even population sequencing of small genomes.

INPUT FORMAT: 
------------------------------------

The program starts off from the aligned reads for each sample. The alignments can be generated
using any alignment program (MAQ/BWA/?) but are required to be in the SAM format. 
From the SAM file for each sample, one needs to generate a pileup file that summarizes the information about the
reads covering each position in a single line. This pileup file is similar to a MAQ pileup file and each line
contains information about the base calls covering one reference position. 

Each line of this pileup file has 10 columns: 

Column 1: chromosome/locus 
Column 2: position 
Column 3: ref-base 
Column 4: coverage 
Column 5: basecalls (starting with @ and encoded using the MAQ pileup format) 
Column 6: qualityvalues (starting with @ and encoded as Sanger format)
Column 7: mapping quality values of the read covering each basecall (encoded as chr(MQ)+33) 
Column 8: position-in-read (comma separated list of integers) 
Column 9: # of mismatches of reads (encoded as chr(mismatches)+48 ) 
Column 10: length-of-read (encoded as chr(length), therefore should be greater than 33 and less than 127)  This column is required only for variable length reads. 

1 1108154 A 16 @,,.,........C.,. @11>(;<3A98A5'?=> @{{y{{{{{{{ss{{s{ 36,35,5,26,15,15,23,24,25,25,26,27,10,30,6,35, 1032000021021020 %%%%%%%%%%%%%%%%

Note that the method requires the sequence data to be generated using the same sequencing platform. 
The method can handle variable read lengths, however for best results, it is preferrable to have
reads of the same length. The method requires each sample to have the same number of lines in the pileup file. 

CONVERTING FROM SAM TO PILEUP FORMAT:
-------------------------------------

python sam_to_pileup.py samplename.sorted.sam refsequence.fasta readlength > samplename.pileup 

This script has to be run independently for each sequenced sample. Note that this script works for targeted sequencing
but may need to be modified to work for every dataset. 

RUNNING SNIP-Seq:
------------------------------------

python SNIP-seq.py "directory_with_pileup_files/*pileup" number_sequencing_cycles alternateSNPsfile dbSNPfile > outputfile


INPUT ARGUMENTS: 
-----------------------------------

directory_with_pileup_files: contains the pileup files for all samples. The name of each sample should be unique.

number_sequencing_cycles: This represents the maximum value encountered in Column 8 of the pileup file. For single-end reads this is the same as readlength. 

alternateSNPsfile: SNP genotype calls using MAQ or other method. The format of this file is "sampleid locusname position refallele genotype score..."
If this file is not available, specify this option as "NULL" 

dbSNPfile: list of SNPs in the sequenced region that are present in dbSNP. The format of this file is "rsid locusname position ......" 


PARSING OUTPUT:
------------------------------------

grep 'variant [01] 1' outputfile > snps.file  

This returns a list of all SNPs detected by SNIP-Seq across all sequenced samples. It also outputs statistics about the best likelihood
score for the SNP in any sequenced sample, the two alleles, the number of samples in which the alternate allele was found, etc. 

python process_variants.py outputfile > snps_and_genotypes.file

This script post-processes the output of SNIP-Seq to determine all clean SNPs and the corresponding genotypes for each sample.
It also filters out SNPs that are artifacts of misalignments due to indels. Such SNPs represent a significant fraction
of the total number of SNPs.  The output of this program is a list of SNPs and the corresponding genotype for each sample.

----------------------------------------------------------------------------------------------------------------------------------



