This R script and two python scripts perform the processes described in steps 1, 2 and 4 of 
the section "Analysis of single-cell islet RNA-seq data (ICA + clustering)." Both python 
scripts require at least Python version 2.7.5, as well as the Python modules pysam, scipy, 
and Bio.Seq. The R script doesn't require any specific libraries, and was tested with
R 3.2.2.

extractValidR2reads_wCorrection.py (step 1):
============================================
This script extracts the two 8-base barcode segments and UMI from the R1 read, splices the
barcode segments into a single 16-base barcode, and attaches the barcode and UMI to the
read name of the corresponding R2 read name with the format ':<barcode>:<UMI>'. To check
the validity of the barcodes, it first checks to see if the barcode is present in the input
barcode dictionary. If both barcodes are included in the dictionary, the UMI is then
checked to make sure it doesn't contain any Ns. If both of these conditions are met, the
barcode and UMI are appended to the corresponding R2 read name as described above. If 
either segment of the barcode is not found in the dictionary, the script further checks to
see if the barcodes are within a one base Hamming distance of a single barcode in the
dictionary. If so, they are corrected and processing continues as above. If the barcodes
are not correctable, or the UMI contains any Ns, the read is discarded.

The script assumes that the R1 and R2 files are parallel, such that the Nth read in each
file make up a matched paired end read, and that the only difference in the two file names
is 'R1' in the R1 file name and 'R2' in the R2 file. The output 'valid reads' files have
exactly the same name as the input R2 file, but with 'R2' replaced with 'R2_valid'. In
addition, the output file will be a fastq file, even if the input files are gzipped. Finally,
a barcode counts file is also output, with the same name as the input R1 file, with
'_bcCounts.txt' appended to the name.

Usage: python extractValidR2reads_wCorrection.py -i <inFile> -b <bcDict> -o <oDir> 
where:
	<inFile> = input R1 file (fastq or gzipped fastq)
	<bcDict> = barcode dictionary file (gel_barcode1_list.txt)
	<oDir> = the directory in which to place the output files


selectValidBarcodes.R (step 2):
===============================
This step requires visual inspection of the data. To the best of our knowledge, no one has
developed an analytical method for identifying the 'valid' barcodes, so we use this simple 
method. An example barcode counts file output from step 1 is provided (ratIslet_bcCounts.txt).

The script should be run interactively to allow the user to adjust the value of bcThresh (the 
reads per barcode threshold). It reads in the barcode counts file from step 1, and the reads 
per barcode are displayed in two ways. First the unsorted file is displayed, with reads per 
barcode displayed on a log scale. The approximate midpoint of the obvious gap between the top 
band of 'valid' barcodes and the other barcodes is identified (~15000). A red line is drawn to 
indicate the selected threshold. 

Note that the threshold is identified visually, and is somewhat arbitrary. The main goal of 
this step is simply to remove the majority of 'bad' barcodes to reduce the downstream processing 
and file sizes. Barcodes (cells) with too few total UMIs can be filtered out in later analysis 
steps. Adjusting the threshold up or down within this gap region tends to have a fairly minor 
effect on the total number of barcodes selected. For example, with this data set, setting the 
threshold to 10,000 results in 1130 valid barcodes, and increasing to 20,000 reduces the number 
of valid barcodes to 985 - a loss of only ~13% of barcodes/cells for a 2-fold change in the 
threshold. For this example we set the threshold to 15,000 reads per barcode, for 1036 valid 
barcodes.

A second method of identifying valid barcodes is also demonstrated. First, the barcodes are 
sorted by the total number of reads per barcode (low-to-high) and plotted on a log scale. For 
illustration, the threshold (selected above), is marked with a red line. To make it easier to 
identify the cutoff visually, the curve is replotted showing only the top 5000 barcodes. The red 
line again indicates the threhold selected above. If using this method, the threshold should be 
set somewhere in the region below the 'knee' of the curve, which corresponds to the gap region 
found with the previous method.


filter_lowCountBC_bam.py (step 4):
==================================
After genome alignments are performed on the valid read files produced in step 1, this script
is used to remove any aligned reads from barcodes with fewer than a threshold number of
corresponding reads, found in step 2.  The output file is simply the input file, filtered to 
remove any barcodes that had fewer than the threshold number of reads. 

Usage: filter_lowCountBC_bam.py -i <inFile> -o <oFile> -b <countsFile> -n <threshold>
Where:
	<inFile> = input bam file name
	<oFile> = output filtered bam file name
	<countsFile> = barcode counts file from step 1
	<threshold> = minimum number of reads required for a barcode
