PURPOSE: re-map originally unmappable reads. Some of these reads likely failed mapping from accruing spate of mismatches from hyperediting.

This pipeline outputs a sam file containing the hyperedited reads appended to end of the the originally mapped sam file.

The pipeline is written by Stephen Tran (Grace Xiao Lab at UCLA).


DEPENDENCIES:
	All scripts use python 2.7 (python 2.6 might also work)
	hisat2 must be in the bash PATH variable
	python package itertools	
	requires a reference genome in fasta format with an indexed fai file
	samtools

BEFORE RUNNING THIS PIPELINE:
	1) Abstract unmapped reads into new fastq files. Make sure that nucleotides are represented in uppercase (A C T G) versus (a c t g).

	2) user needs to create the following two files: hisat_index_normal_genome and hisat_index_reverse_complement_genome
		follow these steps:
			Let's pretend you are using hg19 and have the hg19 fasta file in hg19.fa
			a. make reverse complement file of genome. => hg19.fa hg19.reverse_comp.fa
			b. Change As to Gs in both reference genome: 
				hg19.As_to_Gs.fa
				and
				hg19.reverse_comp.As_to_Gs.fa
			c. build hisat index files on both forward and reverse genomes: use
				hisat2-build hg19.As_to_Gs.fa hg19.As_to_Gs.fa
				and
				hisat2-build hg19.reverse_comp.As_to_Gs.fa hg19.reverse_comp.As_to_Gs.fa
			d. hisat_index_normal_genome is hg19.As_to_Gs.fa
			   hisat_index_reverse_complement_genome is hg19.reverse_comp.As_to_Gs.fa

HOW TO RUN THIS PIPELINE:
	* run hyperediting pipeline 
		use: ./run_hyperediting_pipeline_EX.sh <input_unmapped_fastq1> <input_unmapped_fastq2> <phred_encoding> <hisat_index_normal_genome> <hisat_index_reverse_complement_genome> <ref.fa> <hg19> <output_prefix> <original_sam_file>
			input_unmapped_fastq1 : input unmapped fastq R1 reads
			input_unmapped_fastq2 : input unmapped fastq R2 reads
			hisat_index_normal_genome : name of the hisat index of the normal genome
			hisat_index_reverse_complement_genome : name of hisat index of the reverse complement genome
			ref.fa : reference sequence of the normal genome in fasta format
			output_prefix : final output will be output_prefix_name.sam 
			original_sam_file : originally mapped sam file before hyperediting pipeline. Must be sam format

