Title : Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution
Authors: Chris Papadopoulos, Isabelle Callebaut, Jean-Christophe Gelly, Isabelle Hatin, Olivier Namy, Maxime Renard, Olivier Lespinet, Anne Lopes


This pipeline was generated in order to reconstruct the ancestral sequences of de novo 
genes of S.cerevisiae. The aim is to detect non genic regions on the genome of the 
neighboring species of S.cerevisiae which correspond to de novo genes in S.cerevisiae. 
To do so we use blast in order to detect homologous regions. First we search the CDS 
sequences with blastp, then the Intergenic regions with tblastn and finally all the 
ORFs detected stop-to-stop with size more than 12 nucleotides. Like this we were able to
detect anchors on the genome (based on sequence homology) and extract the genomic non
coding sequence for every de novo gene.

In the second part of the script we work for every de novo gene separetly. We generate 
nucleotide sequence multiple alignment of each de novo gene with its non coding parteners
at the neighboring species using MACSE. The MSA was used from PhyML in order to generate
one phylogenetic tree per de novo gene. Then, the ancestral nucleotide sequence of 
every de novo gene was reconstructed using PRANK. Finally, the reconstructed ancestral 
nucleotide sequence was translated into ancIGORFs (3 possible reading frames - stop to stop)
and the ancIGORFs which gave birth to the de novo gene were identified by homology using 
the Lalign tool. 

All the ancestral reconstruction procedure described is presented step-by-step in the file 
Reconstruction_pipeline.txt found in the principal directory. 
ATTENTION   /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
This protocole IS NOT a script that can be launched by a terminal as it is. Is an 
indicative code step-by-step of the procedure, the softwares and the parameters used 
for the ancestral reconstruction and identification. 
/!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\

In order to use this pipeline you need to pre-install the following softwares:  
	1. Blast
	2. ORFtrack and ORFget (from package ORFmine : https://github.com/i2bc/ORFmine)
	3. pyHCA (from https://github.com/T-B-F/pyHCA)   
	4. macse
	5. seqret
	6. PhyML
	7. PRANK
	8. LALIGN
	


DIRECTORY ORGANIZATION:
	
	A. inputs: 
	   In this directory we have all the input files to be used for the analysis:
		a. genomes : The genomes fasta files for all the species 
		b. annotations : The genomes annotation GFF files for all the species
		c. CDS_protein_fasta : The fasta file of all the CDS sequences for every species
		d. denovo.pfasta : The protein sequence fasta file of all the de novo genes
		e. denovo_fastas : The protein sequence fasta file per de novo genes
		
	B. scripts:
	   In this directory we have all the home made scripts used for the pipeline:
	   a. Reconstruction_pipeline.sh : The general pipeline with the commands to follow step-by-step
	   b. Extract_sequences.py
	   c. Extract_IGR.py
	   d. Extract_ancestors_fragments.py
	   e. Detect_IGORFs_on_denovo.py
	   f. Create_table_to_keep.R
	   g. tools.py
	   
	C. intermediate: 
	   In this directory will be generated all the intermediate files during the procedure
	   of the ancestral reconstruction. For more details of these intermediate files read the
	   Reconstruction_pipeline.txt which explains step-by-step all the commands and all the 
	   expected outputs. We offer all the intermediate files generated by one run of the 
	   pipeline with the given 6 genomes. 
	   
	D. AncFragments: 
	   Contains the ancIGORFs sequences as reconstructed per de novo gene for S. cerevisiae.
	   a. ${gene_name}.frags : The reconstructed ancIGORFs per de novo gene
	   b. ${gene_name}.frags_ali : The reconstructed ancIGORFs per de novo gene alligned on the de novo gene sequence
	   
	E. AncFragments_HCA.tab : The final table containing the information for all the ancIGORFs
	   of all the de novo genes reconstructed. the columns in the table file are the following: 
	   1. De novo gene name ; 
	   2. ancIGORF name ; 
	   3. Ancestor species ; 
	   4. Localization on the sequence ; 
	   5. HCA score of de novo gene ;  
	   6. HCA score of ancIGORF ; 
	   7. coverage of the de novo gene by this ancIGORF ;  
	   8. coverage of the ancIGORF by this de novo gene ; 
	   9. HCA barcode of de novo gene; 
	   10. HCA barcode of ancIGORF; 
	   11. ancIGORF aminoacid sequence

	F. Reconstruction_pipeline.txt : step-by-step ancestral reconstruction procedure pipeline

 
	



