Title : Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution
Author: Chris Papadopoulos, Isabelle Callebaut, Jean-Christophe Gelly, Isabelle Hatin, 
		Olivier Namy, Maxime Renard, Olivier Lespinet, Anne Lopes
		
		
In this directory there are all the necessary data and script in order to repeat all the 
analyses and generate all the figures of the results (and supplemental) presented 
in the manuscript. All the data files can be found in the directory inputs. 
All the figures are generated by the R script Papadopoulos_et_al_ANALYSIS.R and will be
stored in the outputs directory. 

	ATTENTION /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	   The "files.path" variable must be modified as it is the relative path for all 
	   the data tables and all the figures that will be generated. 
	   Once you downloaded the directory you can modify the line 256 where the files.path
	   variable is and set your own directory path. 
	/!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\

DIRECTORY ORGANIZATION:

	A. inputs : All the necessary input files in order to generate all the figures. 
	
		a. Abundance : Data for the abundance of every protein in the cell 
			--> 4932-WHOLE_ORGANISM-integrated.tab  (by PaxDB)
			--> CYTOPLASM_NAMES.txt  (by UniProt)
			
		b. pfasta : Protein sequences in fasta file format
			--> AncIGORF.pfasta 
			--> Scer_CDS.pfasta
			--> Scer_IGORF.pfasta
			
		c. RiboSeq_periodicity_tabs : The counts of RPF reads for each nucleotide at the 
		   beginning of every CDS sequence. The reads are localized on the sequence.
		   R1 until R5 are the five experiments used for our analyses. 
		   --> R1_counts_mapping.tab  ===>  for GSM2147982 data
		   --> R2_counts_mapping.tab  ===>  for GSM2147983 data
		   --> R3_counts_mapping.tab  ===>  personal data under submission 
		   --> R4_counts_mapping.tab  ===>  personal data under submission
		   --> R5_counts_mapping.tab  ===>  for GSM1850252 data
		   
		d. RiboSeq_Reads_tables : The counts of RPF reads for each CDS sequence cumulated 
		   and for each one of the three reading frames of the ORF (P0,P1,P2). 
		   R1 until R5 are the five experiments used for our analyses. 
		   The R6 should be neglected as it does not present good quality of data unlike 
		   the other 5 experiments. 
		   --> Scer_transcriptome_genes_riboreplicas_2020-12-10.tab
		   
		e. Tables : All the data tables for the analyses and the figures generation. 
		    They contain the foldability HCA score, Aggregation and Disorder propensity,
		    the number of predicted transmembrane domains, the sequence size. For the IGORFs
		    also the translation status ("non_translated","selectively","occasionally")
			
			--> CDS.csv : CSV file containing all the information about the CDS of S.cerevisiae
			--> IGORF.csv : CSV file containing all the information about the IGORFs of S.cerevisiae
			--> AncIGORF.csv : CSV file containing all the information about the ancIGORFs 
			    that gave birth to S.cerevisiae de novo genes.
			--> disprot_v7_protein_predictors_minsize30.tab : Information for reference disorder sequences
			--> globular.tab : Information for reference globular sequences
			--> Transmembrane_helices_20_nonreduntant.tab : Information for reference transmembrane helices sequences
			
	B. outputs : In this directory are generated all the results figures (and supplemental) 
	   presented in the manuscript generated by the script Papadopoulos_et_al_ANALYSIS.R
	   
	C. Papadopoulos_et_al_ANALYSIS.R : the script R which loads all the data files stored 
	   in the inputs directory and creates the figures of the manuscript. 
	   
	   ATTENTION /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	   The "files.path" variable must be modified as it is the relative path for all the 
	   data tables and all the figures that will be generated. 
	   /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	   
	   ATTENTION /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	   The R script presented needs some R packages to be pre-installed before launching. 
	   More precisely you need to install: 
	   		1. seqinr
	   		2. stringr
	   		3. dplyr
	   		4. MASS  
			5. ineq
			6. ggpubr
			7. ggplot2
			8. cowplot
			9. fmsb
	   /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\


	   ATTENTION /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	   In the Papadopoulos_et_al_ANALYSIS.R script the HIGHLY translated IGORFs variables 
	   are systematically called with the name "selectively". This is because the initial
	   name was "selectively translated" but latter changed into "highly translated" for
	   simplicity. For this analysis the terms highly and selectively translated refer to 
	   the exact same object. 
	   /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\ /!\
	     

		   