LGT project - Press, Queitsch, and Borenstein (2015).
for inquiries, please email elbo@uw.edu or maximp@uw.edu.
last updated 2/22/2016

In this readme:

1) description of files and directories

2) broad overview of methods, description of a pipeline for reproducing our
analysis.

1) DESCRIPTION OF FILES
	First, a note: We have retained the intermediate data files that can be used to actually reproduce the analysis exactly as we did it. However, they are quite large (>1 GB), and the journal does not allow a third-party repository. Consequently, the archive whose readme you are reading contains primarily code. If you would like ALL of these files, in a file structure that allows you to run each component of the analysis in-place, email maximilian [at] alumni [dot] reed [dot] edu and you will be linked to the repository where they are hosted.

	A) Directories:
		I) "code/" contains the code for doing the actual
analysis. There is also some code that is not actually used in
there. Some of these scripts are called by the driver scripts that are present
in the main directory (which contains this readme file). 
		II) "gainLoss_run_files" has two directories, each of which
includes the source data on which gainLoss is run for different purposes (full
PGCE network inference vs. PGCE model training).
		III) "code/driver_scripts" contains the driver scripts which actually
run and make calls on other scripts in the "code/" directory, and data
distributed around various directories.
		IV) the "run_HGT_datascript.*" files are scripts and reports
for reproducibility of the analysis. The .R file is the driver script, which
may be passed to spin() in the knitr library. The .Rmd file (which will be produced
by the spin() function) may be passed to knit() in the same way. The .md file is a markdown file of results, including refs to figures. These can be used to generate a HTML or PDF report. (Note: this report only
handles processing the data following obtaining p-values for potential PGCEs
dependencies- the preprocessing and computation of p-values is handled
elsewhere, for simplicity; see below). Note that the intermediate data that the 
driver script relies on is not provided in this file structure. 

2) METHODS AND PIPELINE

This is an attempt to actually lay out each step of the analysis for the LGT project.
I have tried to provide run time and memory requirement estimates for MY
dataset on the hardware available to me (~5000 genes and a tree with ~600 tips), though they will of course vary substantially when using other datasets or hardware.

======
FULL ANALYSIS
to reproduce the analysis, look at the script
code/driver_scripts/REPRODUCE_ANALYSIS.sh. This script in principle runs the
whole workflow after the gainLoss step, if you don't mind waiting ~3 weeks
for it to finish. Below I explain the components of the workflow in more
detail, and how they might be run independently of one another (recommended).

if all you want to do is see how I generate the figures from the processed
data, you can look at code/driver_scripts/run_HGT_datascript.R, which can be
run with knitr to produce a report that reproduces figures and tables. This
requires ~4G memory.

Figures generated for the report will be shown in the folder figure/.

The run_HGT_datascript.R driver seems to be running into problems with HTML
conversion for me, but you can readily use the .md file and the figure/ directory with
pandoc or another converter to generate the HTML (which is what I did). Alternately, you might use the auto-generated .Rmd produced as input to another of the family of knitr
functions. So, if the script crashes for you at the end after creating the .md
file, that is why, don't worry. 

Some scripts in the following outline have versions with the _REPRODUCE.* suffix. This 
suffix indicates that this script will exactly reproduce our results as they are, by using the outputs of our original simulation. One can alternately rerun the simulations de novo, which in principle might lead to slightly different results.
=====

1) run gainLoss on MO tree and kegg data: i ignore some trivial preprocessing
to get data in gainLoss-acceptable input formats.
	a) all source data and paramFile are in gainLoss_run_files/allSpecies/.
	b) outputs are directed to gainLoss_results/MOtree_GLrun/, which is
	   empty in this file structure.
	c) preprocessing should give you .meta files (metadata) mapping
positions in the gainLoss fasta input to KOs/COGs/whatever orthology you like.

###
# DRIVER SCRIPT FOR STEP 1: just the command
# $ ./code/gainLoss.VR01.266.Unix gainLoss_run_files/allSpecies/paramFile_KEGG2013_MOtree
# NB: this will take a long time to fully finish(weeks), though it will start producing 
# useful results after a week or so. takes a fair amount of memory (I used 12G)
# We are only using the ancestral reconstruction, rates, and re-estimated tree from this step.
# In my runs the job eventually just crashes because the tree is too large to work with the 
# gainLoss built-in CTMC (if I understand correctly what happens), but at this point the program
# has already produced all of the output we need once you have the reconstruction.
###

2) post-process the gainLoss real data output.
	0a) MAKE SURE that you have a file TheTree.INodes.ph (the tree) in
your gainLoss output directory
	0b) MAKE SURE that you have the .meta file from step 1 in your output
directory.
	a) extract data for each gene for each node, use it to compute probabilities for each branch.
	b) MMM to get C_ij probabilistic count matrix.
	c) get gain/prevalence data, use to filter the data

###
# DRIVER SCRIPT FOR STEP 2: run_HGT_preprocess.R 
# TAKES <10 MINUTES, <3G memory
###

3) simulate null genes based on rates, gain/prevalence.
	a) simulate the evolution
	b) postprocess simulations into probabilities for each branch.
	c) MMM again, into 100 10000x10000 chunks.
	d) record prevalence/gain data

4) construct null distributions from null genes.
	a) assign real and simulated genes to each other and to 'bins' for computational convenience.
	b) sort the 10000x10000 chunks into the 'bins'.
	c) assemble the bins in individual files.

###
# DRIVER SCRIPT FOR STEPS 3 AND 4:
# code/driver_scripts/run_HGT_simulate[_REPRODUCE].R
# this will take a good while (maybe a day or two, for the scale of analysis
# that i have done).  dependencies: R>=2.15.3, libraries: MASS,ape.
# 
# this is the most computationally complicated (if not most time-intensive) step of the pipeline.
# takes 6-8G memory, ~24 hours
# NB: this takes a lot of memory and space.  We have kept our original
# simulated data, which is probably better/easier to use in replicating
# results.  Repetition of the simulation step gave similar results.
###

5) test hypotheses, build network. 
	a) get p-values.

#######
# this is the next most computationally intensive step, largely due to the
# number of binfiles and lookups that are being dealt with. Total time ~150
# hrs,
# relatively easy to parallelize such that individual jobs are ~10 hrs on
# average.  But not all jobs will be created equal, and efficiencies will be
# lost with cutting the jobs smaller.
#
# DRIVER SCRIPT: code/driver_scripts/run_hypotest.R.  run time variable.  4G memory?
# it is currently set up to be run through a shell script that feeds it
# parameters and is called thusly:
# $ qsub -t 1-14 -l mfree=4G code/qsub_hypotest_full.sh
# this in turn is set up to work on our cluster at UW GS, where the -t 1-14
# parameter tells the job to run the script 14 times with parameter values of
# 1-14 replacing $SGE_TASK_ID in the shell script, and -l mfree=4G tells it to
# run with at least 4G memory. 
# the reason to be run with parameter values "-t 1-14" is because there are 14
# rough "binfiles", or files for storing the null distributions, and it was
# convenient to split these up into 14 so that they could be read into memory.
# if you look into the code you can see how these are read in and used.
######


6) Analysis of PGCEs. 
	a) assessment of PGCE network, basic manipulations, process it.
	b) permutation tests for function.
	c) pathway-pathway interactions.
	d) evaluation of specific interactions.
	e) topological sort.
	f) enrichment analysis of ranks.
	g) plotting ranks on tree.
	h) prediction analysis.

#########
# Driver script for step 6: run_HGT_datascript.R.  run
# time: ~1hr. 4-6G memory to be safe. written to be used by the knitr function
# "spin()" to generate a .md file and figures: 
# $ Rscript "library(knitr); spin('code/driver_scripts/run_HGT_datascript.R')
# if you want to generate a full pdf report and have pandoc installed, you can
# just run the following command:
# $ pandoc -o press_pgce_report.pdf run_HGT_datascript.md
# and it will give you a semi-interpretable report of all the commands run and
# all the figures, re-generated.
#
# for now it is set up to be run independently, drawing on intermediate data not 
# included here.
# 
########

