The scripts here allow for the creation of fragment BED files from pair-end FASTQs.
Additionally, the creation of stranded counts for a supplied set of intervals (for example bins) can be accomplished with a R-script.

Needed resources:
bowtie2 v 2.2.5 (newer versions will likely work but have not been tested)
samtools 1.8 (again newer versions likely OK but not tested)
R v3.6.1 for Linux
R package: rtracklayer v1.44.3 and dependencies

"process_STARR_pair_fqs_to_frag_beds.sh" is used create fragment beds.
"process_STARR_pair_fqs_to_frag_beds.sh" has hard paths that need to editted to match locations of files and executables.  They are noted by comments in the file.


example usage (enable executable permission):

process_STARR_pair_fqs_to_frag_beds.sh <fq_pair1> <fq_pair2> <path_to_bowtie2_ref> <output_bam_path> <out_frag_bed_path> <num_parallel_processes_for_bowtie2>

<fq_pair1>	input FASTQ pair 1
<fq_pair2>	input FASTQ pair 1
<path_to_bowtie2_ref>	bowtie2 reference. HTT locus plue E Coli genome is included (/HTT_locus_and Ecoli_bowtie2_ref)
<out_frag_bed_path>	path to created read sorted alignment BAM file
<out_frag_bed_path>	path to created fragment BED file (gzipped)
<num_parallel_processes_for_bowtie2> integer; number of parallel processes bowtie2 will run for alignments

"run_frag_bed_to_GenomicRanges_bins.sh" is used to count reads in intervals (bins) for a input fragment bed and set of intervals defined by a GenomicRanges object supplied in an R save file.
"run_frag_bed_to_GenomicRanges_bins.sh" has a Linux module command that needs to be editted to match user's system.  This noted by comments in the file.

example usage (enable executable permission):

run_frag_bed_to_GenomicRanges_bins.sh <frag_bed_file> <ref_GR_obj> <out_overlap_obj>

<frag_bed_file>	fragment BED file created by "process_STARR_pair_fqs_to_frag_beds.sh"
<ref_GR_obj>	an R save file containing ONLY a GenomicRanges object to map read counts to; 290 bp bins spanning the HTT locus is provided as: HTT_loc_tiled_GR.Rvar
<out_overlap_obj> 	output R save file name containing stranded reads mapping to intervals in <ref_GR_obj>

The GenomicRanges object in <out_overlap_obj> can be loaded and assigned to variable with this R snippet:

my_GR <- get(load(<out_overlap_obj>))


#Alu analysis

A BED file of Alu positions in hg 38 was created from the UCSC Repeat Masker track using "match_genome_to_rptconsensus_coords.pl". The BED file was converted to a GenomicRanges object in R.

For Johson et al. the Alu analysis was performed using "run_frag_bed_to_GenomicRanges_bins.sh" on fragment BED files as well. However <ref_GR_obj> in this case is a GenomicRanges object containing every Alu base position in h38. This file is ~820 MB and is not included in these Supplmentary Files.

For Van Arensbergen et al. data, the Alu base level bin data was generated from bigWig files using "sub_map_steensel_to_alu_pos.R". This script has hard paths that need to be changed and are noted by comments within the script.  Again, the reference GenomicRanges object is very large and not included in these Supplementary Files.

example usage:

cat sub_map_steensel_to_alu_pos.R | R --vanilla <path_to_bigWig>

<path_to_bigWig> a bigWig from the Van Arensbergen et al. GEO submission











