Documentation for SIFTER scripts written by Philip Johnson, 
edited by Barbara Engelhardt
Questions: bee@compbio.berkeley.edu

====================================================================
PYTHON QUICK START
====================================================================

The PYTHON scripts (all written by BEE) were quick-and-dirty scripts
that seem to work, but might have to be modified slightly depending on
their purpose. I am finding PYTHON much easier to maneuver than Perl,
so I'm including these scripts here (including scripts to run SIFTER
with command lines etc.) Use these at your own risk; email me with
questions. These are only for extracting experimental evidence (TAS,
IMP, IDA); edit code to change.

To create .pli and .nhx files for SIFTER from a release of Pfam and
GOA uniprot databases, download the appropriate databases to a local
directory and use the files here to convert them to SIFTER format. All
paths need to be changed to your local paths. In some cases,
third-party code may need to be downloaded (e.g., FastTree,
BioPython).

* Create a file with the Pfam family ids you would like to convert
* Change all the paths/filenames in the python scripts 
* Run pfam2sifter.py to create the .pli files 
* Run pull_alignments.py followed by build_trees.py (if .nex files 
  are not there) and then clean_trees.py 
* Run SIFTER on these as encoded in run_hundred_families.py 
  (cross-validation)

====================================================================
Perl Documentation
====================================================================

Initial revision: 24 June 2005
Current revision: 6 April 2006

====================================================================
QUICK START
====================================================================

If you just want to run these scripts & don't care how they work,
follow these basic setup steps (NOTE: you may need to install the perl
modules XML::DOM and Tree::DAG_Node):

1) download Pfam-A.full, Pfam_ls, domain.pnh from PFAM

2) download gene_association.goa_uniprot from GO

3) ./pfam_index.pl -p <path-to-Pfam-A.full> -h <path-to-Pfam_ls> -g <path-to-gene_association.goa_uniprot> -s <path-to-domain.pnh>

./pfam_index.pl -p <path-to-Pfam_A.full> 

4) edit pfam2pli.pl to change the hardcoded pathes to PFAM data files (3 lines, near the top of the file)

5) From within the forester directory: patch forester.patch; make -f forester.mk; cp sdi.jar <directory-with-SIFTER-scripts> 

	NOTE FROM BEE: This step has never worked for me. Instead, replace
	the SDI.java file with the file included in here (will update
	for a new version of Forester, here 1.92), make according to
	the Forester instructions and they type:

 	% jar -cvf sdi.jar forester/tools/*.class

	to build the .jar file, and then copy it over to the scripts
	directory.

6) Make sure phylip (or whatever program you are using to build trees)
is installed on your computer (see what programs we have included in
the script by looking at pli2tree.pl command line options, or edit it
to include your own).

Now you can run the following as many times as you want:
1) ./pfam2pli.pl -i <PFAM id> | xsltproc indent.xsl -  >  <pli filename>
2) ./pli2tree.pl -i <pli file>  >  <nhx tree file>

To prepare a family given a set of sequences, (this requires fa_select to be in your path, perl to be compiled with large file support, and the hardcoded sequence path to be edited in scan_gos.mk):
1) make -f scan_gos.mk FAMILY=<path to pli file w/o .pli extension>


pli.dtd
-------
SYNOPSIS: A document type definition for the "PLI" XML files used as input to SIFTER.
DETAILS: A single PLI (initials come from ??) file completely describes a single protein family.  The original SIFTER PLI files did not follow a consistent DTD, which means validation may fail on old PLI files.  The Protein subelements ECNumber, SpeciesName, PFamNumber, and GOReal are deprecated and included only for historical purposes.


pfam2pli.pl
-----------
SYNOPSIS: extracts a single protein family from PFAM-formatted files and outputs a single XML document (called a "PLI" document) for input to SIFTER.
DETAILS: Given a PFAM accession (eg. PF00962), parses relevant information from three files (first two downloadable from the PFAM ftp site, last from GO):
	 - Pfam-A.full (family member accessions & alignments in Stockholm format)
	 - domain.pnh  (*species* trees for each PFAM family)
	 - gene_association.goa_uniprot (GO annotations for uniprot proteins)
Given the large size of these files, indexes are used to speed up parsing (see pfam_index.pl and FF_Index.pm).  Note that SIFTER may require special indenting and newlines to parse XML properly -- see indent.xsl.
OPTIONS: Normally, this script only extracts GO annotations for "molecular function."  However, this can be switched to use "biological process" or "cellular role" instead.  Also, as an alternative to extracting alignments from Pfam-A.full, one can supply a *single* Stockholm-formatted alignment in a file.  This latter feature is used when processing combined GOS and PFAM data.
MODULES: FF_Index (see below)
	 XML::DOM (standard XML Document Object Model)
URLS:	ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/README
	ftp://ftp.sanger.ac.uk/pub/databases/Pfam/userman.txt
	http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html


indent.xsl
----------
SYNOPSIS: XSL transform to indent PLI file such that SIFTER can parse it
DETAILS: In unix, use something like: | xsltproc indent.xsl - > my_indented.pli


pfam_index.pl
-------------
SYNOPSIS: indexes four large PFAM-related files
DETAILS: the files:
	 - Pfam-A.full (indexed by PFAM accession)
	 - domain.pnh  (indexed by PFAM accession)
	 - Pfam_ls (indexed by PFAM accession)
	 - gene_association.goa_uniprot (indexed by protein accession)
Creates index files that sit side-by-side with the original files but with the suffix '.idx'.


FF_Index.pm
-----------
SYNOPSIS: generic indexing routines for flat files
DETAILS: Not necessarily the most robust code, but appears to work.  Requires function pointer to create index; saves index in a simple text format (<key>\t<position>); uses unix 'sort' command to sort index; record retrieval uses binary search of index file.  See pfam_index.pl for a real-life usage example.


pli2tree.pl
-----------
SYNOPSIS: takes a PLI file and outputs a tree in New Hampshire eXtended format
DETAILS: This script has three primary steps:
	 1) gene tree creation
	 2) species tree extraction (should be in PLI file already)
	 3) reconciled tree labelling speciation vs. duplication events
The gene tree is created by called one of several potential external programs: quicktree (fastest; least memory), paup (more flexible), phylip (??).  Reconciled tree is created by called a tweaked version of FORESTER (see forester.patch below).
OPTIONS: Can specify to output gene or species tree (ie. intermediate steps) instead of the reconciled tree.  This feature is used for processing the combined GOS/PFAM data -- environment samples have no species tree, so reconcilation would require some clever trickery; workaround is to just use gene tree.  Also can specify which tree building program to run (ie. quicktree, paup, phylip).
MODULES: XML::DOM (standard XML Document Object Model)
	 Tree::DAG_Node (from CPAN)
URLS:	http://evolution.genetics.washington.edu/phylip/newicktree.html
	http://www.genetics.wustl.edu/eddy/forester/NHX.html


forester.patch
--------------
SYNOPSIS: patch to forester to avoid opening a X-window during reconcilation
DETAILS: comments out a few lines in forester/tools/SDI.java


forester.mk
-----------
SYNOPSIS: makefile for compiling forester
DETAILS: makefile produces a JAR for species duplication inference (sdi.jar) that can be moved (and called) outside the forester directory structure.  Use along the lines of 'java -jar sdi.jar'


scan_gos.mk
-----------
SYNOPSIS: makefile for processing GOS proteins with SIFTER
DETAILS: This makefile handles the many steps required to analyze the GOS dataset with SIFTER.  A rough outline follows:
	 1) hmmsearch (from HMMER) of single PFAM HMM against all GOS proteins, keeping only those hits meeting the "trusted cutoff" and e-value lower than .01
	 2) hmmalign (from HMMER) the GOS proteins identified as family members plus the original PFAM proteins and align to the PFAM HMM.
	 3) convert the resulting stockholm alignment into a PLI file
	 4) generate gene tree from PLI file


fa_select
-----------
SYNOPSIS: select specified FASTA sequences from large FASTA file
DETAILS: used by scan_gos.mk
