

      AnChro   -   June 2015


-------------------------------------------------------------------------------
Table of Contents
-------------------------------------------------------------------------------
 1. Overview 
 2. Input Description
 3. AnChro Usage 
 4. Output Description


-------------------------------------------------------------------------------
 1. Overview
-------------------------------------------------------------------------------
AnChro is a software tool that given two genomes G1 and G2 and a set of 
reference genomes {G3..Gn}, or rather, oriented synteny blocks shared by these
pairwise comparisons - G1/G2, G1/G3..G1/Gn, G2/G3..G2/Gn -, reconstructs the 
gene order of the associated ancestral genome A.

AnChro requires genomes description files and sets of several pairwise 
synteny blocks. 
(i)   If synteny blocks have been computed with SynChro already, you can 
      directly run AnChro (see 3.) using SynChro's output as input for AnChro.
(ii)  To reconstruct synteny blocks with SynChro, look at the SynChro's webpage 
      (http://www.lcgb.upmc.fr/CHROnicle/SynChro.html).
(iii) AnChro is not made to work on other synteny blocks that the ones
      reconstructed by SynChro, however, you can use any files, as long as, 
      they respect the same format and they replace the specific ones (see 2.).

AnChro is set to allow only G1/G2 comparisons that take less than 5 minutes to  
be computed. The idea is that if the validation of the different possible paths 
and cycles takes longer, it is an indication that theses paths and cycles are 
not very informative (because too long), i.e. the genomes are too distant to 
allow an interesting ancestral reconstruction (more complete than the one 
constituted of the shared synteny blocks only).


-------------------------------------------------------------------------------
 2. Input Description
-------------------------------------------------------------------------------
Each species name is abbreviated by a 4 letter short name ("NAME" in the 
following)

Three different files are needed to run AnChro :
(1) The 'NAME.def' files (one file for each {G1..Gn} genome)
(2) The 'NAM1.NAM2.orth.pairs' files (one file for each pairwise genome 
      comparison - see above)
(3) The 'NAM1.NAM2.orth.synt' files (one file for each pairwise genome 
      comparison - see above)

If you want to use AnChro on synteny blocks that were not computed by SynChro, 
you must change the content of the (2) and (3) files 
(These files are described - as (5) and (6) - in README_SynChro.txt that you 
can download on the SynChro's webpage : 
http://www.lcgb.upmc.fr/CHROnicle/SynChro.html)
(These files can be found in the "CladeName/11Blocks/DeltaD/OrthBlocks/")


-------------------------------------------------------------------------------
 3. Usage 
------------------------------------------------------------------------------- 
(a) Reconstruct an ancestral genome, or a list of ancestral genomes, for a 
given pair (Delta',Detla"), by choosing/defining G1,G2,G3..Gn among the n 
species of a given clade (those with a '.def' description file present in the 
"CHROnicle/CladeName/01Genomes/" directory),
by going to the "CHROnicle/Programs/4AnChro/" directory and executing:

   ./AnChro.py CladeName Delta' Delta" a

 Delta' : is an int >= 0 (usually between 1 and 6) associated to the SynChro's 
          synteny blocks that you wish to use to reconstruct cycles/paths 
          between G1 and G2 (ie for your rearrangements identification).
          Note that you must have reconstructed these delta-associated G1/G2 
          synteny blocks before runnning AnChro (see SynChro's webpage to do it).
          Note = The more high Delta' is, the more complete (in term of number 
          of scaffold) the ancestral reconstruction is :
          - this is true for two reasons : (i) the more Delta' increases, 
            the larger the synteny blocks are, the fewer the ancestral 
            adjacencies to be identified are ; (ii) a higher Delta', imply a 
            more important abstraction of the micro-rearrangements,
            implying an easier identification of the macro-rearrangements
          - but it can be wrong as higher Delta' values can either enlarge 
            wrongly synteny blocks, adding noise to breakpoints definition or 
            include important synteny blocks, which may be deleted.
Delta" :  is an int >= 0 (usually between 1 and 6) associated to the SynChro's 
          synteny blocks that you wish to use to compare G1/G2 identified 
          breakpoints to the reference genomes {G3..Gn} (the smaller Delta" is, 
          the more incomplete -but safer- is the ancestral reconstruction) 
          Note = The more Delta" increases, the more complete (in term of 
          number of scaffold) the ancestral reconstruction is :
          - this is true as the more Delta" increases, the larger the synteny 
            blocks are and the bigger the number of ancestral adjacencies 
            "found" in the reference genomes is
          - but it can be wrong as higher Delta" values increase the cScore 
            noise, making harder to recover the signal.
      a : can be equal to:
          - 1, which allows to identify an specific ancestor (by giving his name, 
             and the genomes G1, G2, G3.. Gn)
          - a textfile path/name, where each line contains an ancestral name and 
             numbers associated to the genomes G1, G2, G3.. Gn (all space-
             separated)
             (you can find an example of such a file in CHROnicle/4AnChro/)
             (The numbers of the genomes must be the ones that are displayed by 
             executing AnChro.py with a=1. So be careful, if you add a genome 
             you may need to write again this file).

(ex: ./AnChro.py Yeast 2 3 ExempleFileForAnChro.txt)

This will create different outputs in 
"CHROnicle/CladeName/40Ancestors/A/NAM1.NAM2.3..n/Delta'Delta"/":
   -"1PacksBlocks/" which describes the pack step and the synteny blocks after 
     the pack step (i.e. without the deleted block and with 'non-overlapping' 
     signed blocks only)
   -"2Rearrangements/" which describes cycles, breakpoint comparisons to 
     reference genomes, contradictions, induced rearrangements
   -"3PostMacro/" which contains the ancestral genome (describes either with 
     the genes of G1 or G2, or as a succession of synteny blocks) after the 
     resolution of the identified macro-rearrangements
   -"4Micro/" which contains the details of the management of the 
     micro-rearrangements (the small inversions included in the synteny blocks)
   -"5Ancestor/" which contains the final description of the ancestral genome 
     with in particular, the score associated to each gene adjacency and a 
     summary file which recapitulates the ancestral reconstruction 
     (and the identified rearrangements between A and G1, A and G2)
     

(b) Reconstruct a serie of ancestral genomes, for different Delta' and Detla'', 
by going to the "CHROnicle/Programs/4AnChro/" directory and executing:

   ./AllAncestors.py CladeName a Delta'_b Delta"_b

       a : a textfile path/name where each line contains an ancestral name and 
           numbers associated to the genomes G1, G2, G3.. Gn (all space-
           separated)
           (you can find an example of such a file in CHROnicle/4AnChro/)
           (The assoicated numbers to genomes must be the ones that are displayed by 
           executing AnChro.py with a=1. So be careful, if you add a genome 
           you may need to write again this file).
Delta'_b : is an interval of Delta' values (all >= 0) associated to the 
           SynChro's synteny blocks that you wish to use to reconstruct 
           cycles/paths between G1 and G2.
           Note that you must have reconstructed these delta-associated G1/G2 
           synteny blocks before runnning AnChro (see SynChro's webpage to do it).
Delta"_b : is an interval of Delta" values (all >= 0) associated to the 
           SynChro's synteny blocks that you wish to use to compare G1/G2 
           breakpoints to the reference genomes {G3..Gn}. 

(ex: ./AllAncestors.py Yeast ExempleFileForAnChro.txt 3-3 2-4)

For each ancestor defined in the 'a' file (here : ExempleFileForAnChro.txt), 
reconstructions associated to every possible (Delta',Delta") pairs is done
(here : (3,2), (3,3) and (3,4)).


(c) Then, do the summary of the different ancestral reconstruction, runned with 
AnChro.py and/or AllAncestors.py, obtained with different definitions and  
different delta values:

   ./Summary.py CladeName
   
(ex: ./Summary.py Yeast)

This will create two output files : 
- "CHROnicle/CladeName/40Ancestors/Summary_CladeName_Rea.txt" where the 
   [#rea_G1 : number of macro-rearrangements from A to G1
   (micro), : number of micro-rearrangements/inversions from A to G1
   #rea_G2  : number of macro-rearrangements from A to G2
   (micro), : number of micro-rearrangements/inversions from A to G2
   #rea     : number of macro-rearrangements not well-localized 
              (either from A to G1 or from A to G2) 
   (micro)] : number of micro-rearrangements/inversions not well-localized 
              (either from A to G1 or from A to G2)
is given for each ancestor definition (line) 
and each (Delta',Delta") pairs (column) computed previously.   

- "CHROnicle/CladeName/40Ancestors/Summary_CladeName_ChrGene.txt" where the 
   number of (scaffolds,genes) 
is given for each ancestor definition (line) 
and each (Delta',Delta") pairs (column) computed previously.


-------------------------------------------------------------------------------
 4. Output Description
-------------------------------------------------------------------------------
The output files for an ancestor A defined by the genomes G1,G2,{G3..Gn} for 
a given pair of (Delta',Delta"), in 
"CHROnicle/CladeName/40Ancestors/A/NAM1.NAM2.3..n/Delta'Delta"/", are
(note that the (14), (15) and (16) outputs may be the most interesting ones) :

In the "/1PacksBlocks/" directory : 
(1) NAM1.NAM2.3..n.Packs.txt
    -> details on packs 
for each of them :
- one line with :
  ([list of the possible blocks ordering in G1],[list of the possible blocks 
     ordering in G2])   
  ([list of the associated score in G1 
     (sum of the average+std of the length of the smallest cycles passing by 
     all brkpts associated to the combination)],
     [list of the associated score in G2])
- a second line with for each pack :
  its name and the index (starting from 0) of the validated combination. 

(2) NAM1.NAM2.3..n.BlockspostPack.txt
    -> blocks post packs (ie 'not overlapping', not included anymore 
       at least "virtually"...)
for each block, 7 tab-separated columns :
   1 name   (as B00001G1) starting with a B, followed by its ID and 
            finishing by G1 or G2 depending if it is defined on the 
            genome GEN1 or GEN2
            followed by 'vp' or 'vn' if "virtual"
   2 sign   positive 1 or negative -1 (or 0 if unsigned)
   3 chr    chromosome
   4 start  the number/id of its first gene 
   5 end    the number/id of its last gene 
   6 prev   the name of the previous block (0 if telomeric)
   7 next   the name of the next block (0 if telomeric)

In the "/2Rearrangements" directory : 
(3) NAM1.NAM2.3..n.contr.txt
    -> list of conflicting breakpoints (A,B) and (A,C) found both in one or 
       more reference genomes
for each of them (for (A,B)_G1 as for (A,C)_G2):
- its description [genome, chr, (the 2 blocks surrounding the brkpt), 
  (the 2 gene ids surrounding the brkpt), (the 2 gene name surrounding the brkpt),
  (the 2 nucleotide coordinates surrounding the brkpt)]
- follows by the reference genome name in which the adjacency has been found 
  and its associated cScore (see the paper for details on cScore)

(4) NAM1.NAM2.3..n.cycles.txt
    -> the cycles/paths validated by the algorithm classifed by size
       their number 
       their associated breakpoints (defined either between 2 blocks, which may 
       be "virtual" ones, or by one telomeric block) as (B00015G1,B00016G1)
(as [breakpoints of GEN1;breakpoints of GEN2])

(5) NAM1.NAM2.3..n.details.txt
    -> cycles/paths details with :
- for each brkpt, their associated cScore 
- the validated breakpoints (the ones with higher cScores)
- and the deduced rearrangements

(6) NAM1.NAM2.3..n.rea.txt
    -> details on ancestral/NON and post-rearrangement/POST breakpoints for 
       each cycle/path (or NorP when we cannot know)
for each breakpoint :
  [NON/POST/NorP, modified cScore according to the other bkpts of the cycle 
  (original cScore), genome, chr, (the 2 blocks surrounding the brkpt), 
  (the 2 gene ids surrounding the brkpt), (the 2 gene name surrounding the 
  brkpt), (the 2 nucleotide coordinates surrounding the brkpt)]
    
In the "/3PostMacro" directory : 
(7) MacroBlocks.txt
    -> description of the reconstructed ancestral scaffolds as a succession of synteny blocks 
       (the ones described in (2) NAM1.NAM2.3..n.BlockspostPack.txt)
       with the cScore associated to each (validated ancestral) adjacency blocks
       (the one on the same line and the one on the next line)

(8) NAM1.def (resp. NAM2.def)
    -> definition of the ancestral genome at gene level
       with the genes of G1 (resp. G2) 
       (included in the synteny blocks described in (7))
this .def are identical to the one used by SynChro (see README_SynChro.txt)
but where the 3 id columns correspond respectively to : 
   - the feature number of the gene in the genome G, 
   - the gene number in the scaffold in A
   - the gene number in the genome in A
and with 2 additionnal columns, corresponding to the score associated to the 
gene adjacency (resp. previous and next ones): 
   - if genes where adjacent in the same synteny block : the score is 1.01,
   - if gene are adjacent because they respectively finish and start two 
     adjacent synteny blocks :
     the score is equal to the cScore associated to this blocks adajcency

(9) NAM1.NAM2.orth.pairs and NAM1.NAM2.orth.synth
    -> description of the synteny block reconstructed with Delta=0 
       between this two very similar genome : NAM1.def and NAM2.def

In the "/4Micro" directory :
(10) NAM1.predel (resp. NAM2.predel)
     -> file similar to the (8) one, but simplify and containing only genes 
     with homolog in G2 (resp. G1)
where the 3 id columns correspond respectively to : 
   - the feature number of the gene in the genome G, 
   - the previous (before deletion of non-homologous gene) gene number in the 
     genome in A
   - the new gene number in the genome in A

(11) MicroBlocks.txt
     -> details of the micro-synteny blocks 
for each of them:
   - block id
   - [chr
   - sign, : 1 if it is the same in G1 and G2, else 0
   - %sim, : average of the similarities between ortholog defining the block
   - len]  : number of genes in the block
   - borne_G1 : (serie of) bornes of the block in A_G1 (defined with the new 
     gene ids) 
   - borne_G2 : (serie of) bornes of the block in A_G2 (defined with the new 
     gene ids) 

(12) NAM1.postdel (resp. NAM2.postdel)
     -> file similar to the (10) one, but after deletion of every duplicated 
        genes
where the 3 id columns correspond respectively to : 
   - the feature number of the gene in the genome G, 
   - the previous (before deletion of non-homologous gene) gene number in the 
     genome in A
   - the new gene number in the genome in A

(13) MicroDetails.txt
     -> details of the micro-inversions
after having deleted every duplicated and badly-placed gene, inversion can be 
defined and reference genomes can be used to determine whether if it is a 
inversion between A and G1 or between A and G2. 

In the "/5Ancestor" directory :
(14) A.NAM1.NAM2.def
     -> ancestral genome 
one gene per line :
   - chr     : chromosome id 
   - id_G1   : the feature number of the gene in G1
   - id_G2   : the feature number of the gene in G2
   - strand  : 1 or -1
   - id/ch   : the gene number in the chromosome in A
   - id/g    : the gene number in the genome in A
   - scMacro : scores associated to the gene adjacencies at the macro-level
   - scMicro : scores associated to the gene adjacencies at the micro-level

(15) A.NAM1.def, A.NAM1.prt and A.NAM1.ch (A.NAM2.def, A.NAM2.prt and A.NAM2.ch)
     -> final ancestral genome A defined with G1' genes (resp. G2' genes)
These files can be used as SynChro, PhyChro, ReChro and AnChro' inputs.

(16) A.NAM1.NAM2.sum
     -> summary of the ancestral reconstruction




