====================================================================
SIFTER: Statistical Inference of Function Through Evolutionary Relationships
====================================================================
Copyright (c) 2010 Barbara E Engelhardt.  All Right Reserved.
http://sifter.berkeley.edu


Under active development by:
  Barbara Engelhardt <bee@compbio.berkeley.edu>

Previous developers:
  Philip Johnson <plfjohnson@berkeley.edu>
  Steven R. Chan <steven@berkeley.edu>

Please cite new paper:
       Engelhardt BE, Srouji JE, Jordan MI, Brenner SE (2010)
       
Original paper available at PLoS Computational Biology
  http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0010045
  
  Engelhardt BE, Jordan MI, Muratore KE, Brenner SE (2005)
  Protein Molecular Function Prediction by Bayesian Phylogenomics.
  PLoS Comput Biol 1(5): e45

ICML paper describing newer version of the model available on BE's website:

  Engelhardt, BE, Jordan, MI, Brenner SE (2006)
  A statistical graphical model for predicting protein molecular function
  Proceedings of the International Conference on Machine Learning (accepted).

Server in progress: http://sifter.berkeley.edu

====================================================================
COMPILING
====================================================================

SIFTER requires at least Java 1.4 to run. We've included a makefile,
so that you can simply enter:

% make 

at your prompt to create the appropriate .jar file. The most recent
version I was compiling with jdk1.6.0. You will also need to download
JLAPACK 0.6 if you don't already have it. It is free from 

http://www.netlib.org/java/f2j/

====================================================================
USAGE
====================================================================

usage: java -jar sifter.jar [OPTIONS] FAMILYNAME
 -sfx,--scale <filename>           Set family .fx scale filename (default:
                                   data/scale-<FAMILY>.fx)
 -fx,--familyfile <filename>       Set family .fx parameter filename
                                   (default: data/infer-<FAMILY>.fx)
 -bg,--ida-background              Use only experimental annotations (IDA,
                                   IMP, TAS, IGI, IPI) to generate candidate 
				   functions
 -iea,--with-iea                   Use protein annotations inferred by
                                   electronic annotation.
 -igi,--with-igi                   Use protein annotations from those
                                   inferred from genetic interaction.
 -ipi,--with-ipi                   Use protein annotations from those
                                   inferred from physical interaction.
 -cutoff,--cutoff <number>         Cutoff delta for gradient ascent in EM
                                   (M-step) (default: 0.0050)
 -step,--step <number>             Step size for gradient ascent in EM
                                   (M-step) (default: 0.01)
 -tas,--with-tas                   Use protein annotations from traceable
                                   author statements.
 -nas,--with-nas                   Use protein annotations from
                                   non-traceable author statements.
 -folds,--folds <number>           Number of folds in cross validation,
                                   leave-one-out is 0 (default: 0)
 -x,--xvalidation                  Use cross-validation with EM.
 --help                            Show help for arguments. (More help is
                                   available via README.txt)
 -em,--em                          Perform EM to estimate parameters
 -g,--generate                     Generates a set of input parameters for
                                   the inference problem.
 -iter,--iter <number>             Number of iterations. At the moment,
                                   this applies only to EM. (default: 10000)
 -nex,--reconciled <filename>      Set reconciled .nex tree (default:
                                   reconciled/reconciled_<FAMILY>.nex)
 -ontology,--ontology <filename>   Specify which ontology file you want
                                   (default: "data/function.ontology")
 -output,--output <filename>       Set output file (default:
                                   output/default.rdata)
 -pli,--protein <filename>         Set protein file (default:
                                   proteins/proteinfamily_<FAMILY>.pli)
 -truncation,--truncation <number>   Number of functions to truncate to in
                                     approximation (default: 4)
 -v,--verbose                      Verbose operation.

Note about command line options: use option name with "--", except in
the case of verbose (use -v).


 This program does not necessarily require the
 original data structure by Barbara. i.e.
 	data/
 	lib/
	output/
	proteins/
 	reconciled/
 but if it is giving you problems, make sure these directories exist
 under the top SIFTER directory.

====================================================================
SETUP
====================================================================

To setup the files, databases, ontologies etc. required to run SIFTER,
see the README in the scripts/ directory.

After you have successfully 

      * generated a phylogeny (and put it in 
                               <SIFTER>/reconciled/reconciled-<FAMILY>.nex)
      * generated a .pli file (and put it in
			       <SIFTER>/proteins/proteinfamily-<FAMILY>.pli)
      * download the appropriate ontology (from Gene Ontology, renamed 
        if necessary and placed in <SIFTER>/data/function.ontology)
      * made the java code

then type:

% java -jar sifter.jar <FAMILY> --generate

which will generate the parameter files. Feel free to edit them before
running SIFTER (as they are set to default values, although robust in practice):

% java -jar sifter.jar <FAMILY> -v

Then play around with learning parameters, different
datasets/phylogenies, cross validation, etc.

We have included files for a family called "test" to run here.

====================================================================
DEBUGGING
====================================================================

Please send any problems/comments/questions to bee@compbio.berkeley.edu

Also, if you have updated/improved/written any code, I'd love to see
it and incorporate it into the next release (with due credit).

====================================================================
SIFTER output
====================================================================

The output for SIFTER (running inference) is a tab-delimited file
(default: output/default.rdata) with the following columns:

<NODE NAME> <POSTERIOR FN1> ... <POSTERIOR FNm> <MAX POSTERIOR PREDICTION>

The order of the functions is identical to the order in the transition
matrix parameter file (data/infer-<FAMILY>.fx): see initial row for
specific order.

The output for SIFTER (running EM) is the scale and transition matrix
parameter files. Where they are exactly is output at the end of EM.

The output for SIFTER (running leave-one-out cross-validation) is in
the command line: the percentage of left-out elements that are correct
according to their annotations. However, these are often wrong, so I'd
would advise you to double check them manually from the output (I was
lazy enough not to consider ties in posterior probabilities).

