# FishComparativeAtlas snakemake workflow archive: dataset and pipeline freeze at the time of publication

This archive allows both to reproduce the generation of the fish comparative atlas and/or to directly inspect the result.

## Pipeline

FishComparativeAtlas is a snakemake pipeline to trace the evolution of sister duplicated chromosomes derived from whole genome duplication in teleost genomes. The snakemake workflow is defined in the file `Snakefile` and calls python scripts stored in `src/`. This archive contains the version v1.0.0 of the FishComparativeAtlas code, as used to generate the 74 teleost genomes comparative atlas.

The conda environment to run the FishComparativeAtlas pipeline is provided in `envs/fish_atlas.yaml`.

## Input data

The paths to all input data are stored in the snakemake configuration file `config_altas74_fish.yaml`.

### Main inputs

The main inputs to the FishComparativeAtlas pipeline are:

- ancestral chromosomes (pre-TGD) mapped on 4 teleost genomes (taken from Nakatani and McLysaght 2017), stored in `data/MacrosyntenyTGD/`,

- the SCORPiOs-corrected gene trees with genes of the 74 teleosts and 33 outgroups, in `data/atlas_74fish/SCORPiOs_corrected_forest_5_complete_tags.nhx`,
 
- the species tree used to build and reconcile the gene trees, in  `data/atlas_74fish/species_tree.nwk`.

- the genes coordinates files for all 74 teleosts, in `data/atlas_74fish/genes/`.

### Additional inputs

Additional inputs in the `data/atlas_74fish/` folder include:

- `data/atlas_74fish/rename_chr.txt`, a conversion table to rename pre-TGD ancestral chromosomes in `data/MacrosyntenyTGD/`, to make them consistent with previous ancestral chromosome names published in (Nakatani and McLysaght 2017).

- `data/atlas_74fish/assembly_conversion/`, files to convert genes and coordinates from ensembl75 (data in Nakatani and McLysaght 2017) to ensembl95 (data in our gene trees).

- `data/atlas_74fish/CompAtlas_stats_trees_noSCORPiOs.txt`, annotation statistics from a previous run of the FishComparativeAtlas pipeline on phylogenetic gene trees built with TreeBeSt but without SCORPiOs correction of WGD duplication nodes.

## Output

The generated comparative atlas is stored in `output/comparative_atlas.tsv`. It is a tab-delimited file with 3 columns: the unique identifier of the post-duplication gene family, all teleost genes in the family and the predicted post-duplication ancestral chromosome (1a, 1b, 2a...).

Gene names can be crossed with the genes coordinates files (`data/atlas_74fish/genes/`) to obtain the genes to species correspondance.

## Reproducing the output

- Create and activate the conda environment (alternatively you can manually install the dependencies listed in `config_atlas74_fish.yaml`):

    `conda install mamba`

    `mamba env create -f envs/fish_atlas.yaml`

    `conda activate fish_atlas`

- Run FishComparativeAtlas (~ 5 minutes):

    `snakemake --configfile config_atlas74_fish.yaml --cores 4`

The output file `out_atlas74_fish/comparative_atlas.tsv` will be generated, along with figures with genomic annotations and statistics in `out_atlas74_fish/figures`.


## References

FishComparativeAtlas takes as input the pre-TGD ancestral chromosomes predictions from:

- [Nakatani and McLysaght 2017](https://academic.oup.com/bioinformatics/article/33/14/i369/3953974): Nakatani Y, McLysaght A. 2017. Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes. Bioinformatics 33:i369–i378.
