
InterPool: interpreting pooling results.
Nicolas Thierry-Mieg (Nicolas.Thierry-Mieg@imag.fr).
09/10/2006.


*************
Prerequisits:
*************

InterPool has the following dependencies:
- Perl (in the Makefile).

In addition it uses the following, but you probably don't
care about this except if you're a developer:
- expat (XML library) is needed for everything that deals with
signature files (ipoolDecoding, ipoolSimulations, ipoolValidation), 
but version 2.0.0 is included in this distribution, and statically
linked into the interpool binaries. So, you don't need to do
anything for this one;
- etags (for emacs; TAGS should be added as a dependancy to the 
all: target line in the Makefile to develop using etags);
- GMP (GNU Multiple Precision library) is needed for ipoolEntropy
(not currently distributed);
- doxygen can be used to generate the documentation (hopefully,
but in fact I don't use it myself anymore so my comments may not parse
too well).


If you use interPool with a pooling design other than
STD, you must undef USING_STD (in config.h).
Ideally, you should implement fast calculations in 
the few functions that currently take advantage of the specific
properties of STD (all in design.c; just grep USING_STD to find them),
and place them under an ifdef USING_XXX for your design.
Otherwise, simply undefining USING_STD will make decoding work for
any design, but it will be slower.


**********
Compiling:
**********

Everything in this section happens in the src/ subdir.

You may edit the Makefile to set ARCH to whatever is
best for your machine. This is fed to gcc (eg -march=k8).
See gcc manpage for possible choices.

If you have icc (the intel C compiler), you can use it instead 
of gcc by changing CC and CFLAGS in the Makefile. On just a few 
tests using a P4 2.6GHz, icc-generated code seems ~20% faster in
my hands (this is icc 9.1 vs gcc 4.02), so it might be worth a 
try for you.

You might want to customize the cost vector in distance.h
(values of DIST_NEG, DIST_FAINT, DIST_WEAK and DIST_POS).
The current default (2-1-2-4) favours sensitivity but may yield
more false positives. An alternative possibility is to use
the Hamming distance (1-1-1-1), although it may yield more
false negatives. This mostly influences the results 
if your pooling design is "borderline" relative to
the number of positives and errors that you have in your
observations (whether real or simulated).
This is described in the InterPool paper (Thierry-Mieg
and Bailly, Bioinformatics 2008).

You may also want to modify errorModel.h. This is used
in simulations and validations. It specifies the 
fractions of observations that should be POS (==STRONG)
or WEAK among true or false positives, and similarly for
negatives. Read the file for more info.

Then, type make.
Please report any errors to Nicolas.Thierry-Mieg@imag.fr, and 
include "interPool" in the subject. Thanks!


Using InterPool:
****************

This distribution of InterPool contains 4 programs and one script:
ipoolSTD, ipoolDecoding, ipoolSimulations, ipoolValidation, and 
analyze.pl (in the scripts/ subdir).
(If you use cygwin, add .exe at the end of the program names 
discussed here).


- ipoolSTD
*******
creates STD design files in the format expected by the other programs
(see "File formats" below).
The files are created in designFiles/ subdir.
The designs are as defined in (Thierry-Mieg N, BMC Bioinformatics 7:28).

USAGE: ipoolSTD n q k
where n is the number of variables, q is the number of pools per layer,
and k is the number of layers (ie redundancy).
The number of pools is therefore q*k.


EXAMPLE: The pooling design STD(940;13;13) is produced by calling:

src/ipoolSTD 940 13 13

This creates the file designFiles/STD.n940.q13.k13. A correct copy
of the expected file is included, you can compare the two files with:
diff designFiles/STD.n940.q13.k13 designFiles/STD.n940.q13.k13_ok
You should get no output (files are identical).


- ipoolDecoding 
************
decode (ie interpret) one signature. Results will go in the
InterPool.Results/ subdir (created if necessary, customizable
in config.h).

USAGE: ipoolDecoding n nbPools designFile sigFile
n: number of variables,
nbPools: number of pools,
designFile: a design file in interPool format,
sigFile: a signature in V 2.0 (XML) format.


EXAMPLE: you can decode the example signature Sigs/example.sig,
which corresponds to the pooling design STD(940;13;13), by calling:

src/ipoolDecoding 940 169 designFiles/STD.n940.q13.k13 Sigs/example.sig

The result file (InterPool.Results/example.sig.Decoded) is plain text, 
it should find a single interactor (variable 66) and a few false-positive 
and false-negative observations.
A copy of the expected output is included for comparaison:

diff InterPool.Results/example.sig.Decoded InterPool.Results/example.sig.Decoded_ok

The only difference should be the line with the date and time.


TROUBLESHOOTING: If you get an error message about random() on this 
example, you probably have a system where random doesn't behave like 
it does on GNU systems. You can solve this by editing myrand.c: in myrandom, 
replace the call to random() by rand(), and similarly in plantSeed
replace srandom(seed) by srand(seed). This may decrease the 
randomness, but should solve the issue.


- ipoolSimulations 
***************
perform nsim simulations, ie:
1. randomly choose nbPosVars among the n variables - these are the
"simulated positives".
2. calculate the expected noiseless result of testing the nbPools pools 
defined in designFile, given the simulated positives.
3. randomly flip the outcomes of some positive and negative pools 
(these simulate the false-negatives and false-positives respectively), 
as specified by falsePosFrac and falseNegFrac. The result of this
step is the "simulated signature".
4. decode the simulated signature, as you would a real observed
signature, to obtain the "decoded positives".
5. compare the decoded and simulated positives, output comparison 
results in the InterPool.Results/ subdir.

USAGE: ipoolSimulations n nbPools designFile nbPosVars falsePosFrac falseNegFrac nsim
n: number of variables;
nbPools: number of pools;
designFile: a design file in interPool format;
nbPosVars: number of positive variables;
falsePosFrac: fraction of neg pools that are erroneously observed as pos or weak;
falseNegFrac: fraction of pos pools that are erroneously observed as neg or faint;
nsim: number of simms to perform.


EXAMPLE: assume you are considering using the 169 pools of STD(940;13;13) 
to smart-pool 940 objects, among which 3 might be positive, and you expect 
5% false-positives and 20% false-negatives with your assay. To get a rough 
idea whether this pooling design would be powerful enough, you could perform 
100 simulations corresponding to this specification by calling:

src/ipoolSimulations 940 169 designFiles/STD.n940.q13.k13 3 0.05 0.2 100

This might take a minute depending on your hardware.
You will then obtain a plain text result file in the InterPool.Results/ subdir.


- analyze.pl 
************
Typically, we want to perform large numbers of simulations, using several
different specifications (ie numbers of positives, error-rates), before
choosing a pooling design. Therefore the detailed output files produced
by ipoolSimulations become unwieldy. This small script parses the files
generated by ipoolSimulations, and outputs to stdout a summary of
the results.

USAGE: analyze.pl [resultFile]
If the optional resultFile is specified, it must be the name of a 
file produced by ipoolSimulations.
If it is omitted, analyze.pl will analyze all files present in
the current directory.


EXAMPLE: The result file produced by the above example for 
ipoolSimulations can be summarized with analyze.pl, by calling:

scripts/analyze.pl InterPool.Results/<resultFile>

where you replace <resultFile> by the real result file name (should
be STD.n940.q13.k13.pos3.fp13.fn3.nsim100.seedXXX where XXX is the
value of the random seed used).
You should find that STD(940;13;13) can cope with the specified
numbers of positives and error-rates, although some rare mis-taggings
may occur.
Alternately, if you have performed many simulation runs with
various parameters, you could analyze all results at once by:
cd InterPool.Results ; ../scripts/analyze.pl


- ipoolValidation 
**************
This is similar to ipoolSimulations except it decodes the simulated 
signatures with 2 different methods (algorithms), and compares 
their results. Of course these should be identical whatever the algorithms.
This is useful for cross-validating the decoding algorithms themselves.
You won't need to use it for a biological project: it is for developers.



*************
File formats:
*************

The format for a design file is:
one line per pool,
each line is a list of variable numbers (from 0 to n-1) separated by ':'.
There is an example in the designFiles/ subdir.
Other examples can be made using ipoolSTD.


The format for a signature file (which holds the result of an 
observation) is an XML format. There was an older text-only
format for signatures, so the current format is called 2.0.
There is a dtd in Sigs/signature.dtd.
There is an example signature in the Sigs/ subdir.
ipoolDecoding only cares for the <signature version=> and the 
<values> data: version must be "2.0", and <values> holds the
pool numbers (from 0 to nbPools-1) that were observed as 
strong positives (STRONG), weak positives (WEAK), negatives
but with a very faint signal (these are considered negative but 
with a higher chance of being false-negative) (FAINT), and 
regular negatives (NEG).



********************************************
Stuff that should be OK (but you can check):
********************************************

For optimal performance, MOT (in types.h) should correspond to
your architecture (32-bit or 64-bit?). Currently we use
unsigned long, which should be OK for most systems, but
you may want to check that it's OK for you.
You can print the sizes of various types on your system with:

cd src ; make printSizes ; printSizes


Some macros and stuff should be tuned to make production
builds. Here is a list of things to watch out for:

In every file that has a #define DEBUG, you should undef it.
Idem for files that have #define SANITY.

In unitClosures.c, the following 2 macros should be undef'd 
(they are for debugging and optimizing):
#undef BUILD_EMPTY_UNIT_INIT
#undef RESIZE_CHECK_NOMOVE

If you are running out of memory (or swapping): you can try 
defining DO_RESIZE_UNITS in unitClosures.c.

You should undef CONST_SEED in config.h (if defined, myrand.c
uses a constant seed, this is good for profiling but not for 
production!).

