# README

This repository contains the code used to perform the analysis for the Lake Malawi cichlid pangenome.

The most important subfolder is `script/` where most of the code is located. Within this folder, there are several subdirectories covering many early parts of the analysis (graph construction, preparing reference genomes), as well as `R_SCRIPT/` that hosts R scripts for more downstream analyses and plots.



## Descriptions of scripts

### A_preliminary
Download data and reference genomes, following by preprocessing like indexing, creating chromosome aliases and BUSCO analysis.

### B_minigraph
Build pangenome graphs using `minigraph`, do variant calling and calculate spanning coverage.

### C_repmodel_repmask
Annotate potential transposable elements with RepeatModeler and RepeatMasker. Tested different parameters and options for RepeatModeler2 (LTR option, RepBase).

### D_pseudoreference
Generate pseudogenomes for all species based on the reference graph, to calculate SV/TE enrichment and do presence-absence variation analysis. This makes use of the `ODGI` software.

### E_zebra_bb_checks
Rerun selected analysis with mayZeb as a backbone, including graph construction (1), PAV analysis (2 & 3).

### malawi_haplochromines_conserv
Snakemake workflow to perform pairwise comparisons of species by constructing bi-assembly graphs and calculating sequence conserved between them.

### malawi_haplochromines_sweep
Parameter sweep to test minigraph's robustness across backbone choices, different permutations of adding species and minimum variant size (L).

### pcr_mapping_validation
Select certain bubbles to validate with PCR. Generate FASTA sequences of the different alleles at the bubbles for reference and comparison.

### te_permutation_shuffle
Snakemake workflow to test for statistical enrichment of TEs in SVs through a shuffling randomisation algorithm.

### wider_context_comparisons
Snakemake workflow to construct multiassembly graphs in other organisms (great apes, chicken) and the wider East African radiation for comparisons to contextualise other graph statistics.



## Description of R scripts and Markdown
See the `script/R_SCRIPT/` directory.

### 0: early stage diagnostics, preprocessing, transposon annotation
* Create data structures required for certain analysis (`0a-createCichlidTxDB.R`, `0a-extractTxDB_asBED.R`) 
* Explore parameter sweep results (`0b-parameter_sweep_analysis.Rmd`)
* Annotate genome with different RepeatModeler parameters (`0c-compare_RepModel_params.Rmd`)
* Identify potentially problematic gene and TE annotations (`0d-identifyFalseGenes.R`, `0d-identify_artifact_DNA_TEfamily.R`)
* Miscellaneous analysis like contig size distribution and BUSCO (`0e-misc.Rmd`)

### 1: preprocessing of SVs and TEs
* Preprocess bubbles identified from minigraph, explore properties of the multiassembly graph, mostly around showing that the graph has low overall complexity (`1a-preprocess_SV.Rmd`)
* Preprocess transposon annotations across species and explore global level stats, as annotated by the astCal-specific TE library (`1b-preprocess_TE.Rmd`).
* Inspect some complex bubbles manually (`1c-explore_complex_bubble.R`)

### 2: analysis of structural variants 
* Further explore species relationships through structural variant alleles, characterise the type and diversity of SVs, sample frequency (`2a-analyse_sv_properties.Rmd`)
* Link structural variants to gene features (`2b-link_sv_to_genes.Rmd`)
* See how many nodes and alleles are shared between assemblies (`2c-allele_sharing.Rmd`)
* Bootstrapping to estimate significance of SV and gene feature overlap (`2d-estimate_gene_sv_overlap_significance.R`)
* Functional gene ontology enrichment (`2e-run_gene_ontology_SV.R`)

### 3: intersection of SVs and TEs
* Compute transposon composition in flexible regions of species (`3a-calculate_TE_composition.Rmd`)
* Permutation test for statistical TE enrichment (`3b-plot_permutation_test.Rmd`)
* Produce gene lists for TE/SV and gene intersections (`3c-link_te_and_sv_to_genes.R`)

### 4: presence-absence variation analysis
This is done for both astCal and mayZeb backbones.

* Preprocessing: find backbone regions and segments with coverage (`4a-pav_preprocessing.R`, `4a-pav_preprocessing_zebra.R`)
* The actual PAV analysis (`4b-pav_plots.Rmd`, `4b-pav_plots_zebra.Rmd`)
* PFAM protein domains in private genes (`4c-pfam_plots.Rmd`)
* Identify artefactual private genes that actually have an ortholog. Also perform Gene Ontology. (`4d-double_check_private_genes.R`)

### 5: pangenome growth and overall view
* Describe the growth of the pangenome (`5a-pangenome_growth.Rmd`)
* Plot SV and TE density along chromosomes (`5b-plot_density_by_chr.Rmd`)

### 6: TE subclass expansions
* Detect polymorphic TE insertion events across genomes, anchored to astCal backbone (`6a-TE_subclass_expensions.Rmd`)
* Gene Ontology analysis, but reveals nothing significant (`6b-TE_gene_ontology.R`)

### 7: Phylogenetic tree and shared events
* Constructs presence-absence matrix from SV data (`7a-prepare_phylogenetic_data.R`)
* Build tree and plot the results (`7b-build_phylogenetic_tree-docker.R`, `7c-plot_phylogenetic_tree.R`)



## Environment setup

Most scripts here need to be run in one of the Python or R environments.

### Python
The Python one is set up using `conda`, and environment `yml` files are found in the root directory.

```sh
mamba env create -f env.yml
mamba env update -f env.yml
```

### R

For R, in order to use the `renv` virtual environment, start a new RStudio project in this `script/` directory. The idea behind `renv` is that it detects what packages are used in this directory and subdirectories, and writes them into a lock file, which can be used to restore the environment.

```R
renv::init()
renv::status()
renv::restore()
```

Bioconductor packages may need to be installed separately:. More info [here](https://rstudio.github.io/renv/reference/install.html).
```R
renv::install("bioc::____")
```

These files need to be committed whenever the R environment is changed.
```
.Rprofile
renv.lock
renv/activate.R
renv_dependencies.R  # lists packages not imported explicitly and Bioconductor packages
```

### Bash

Most BASH scripts make use on environmental global variables that need to be loaded.
```sh
source source_alias.sh  # contains aliases to main directories in this repo
```

Most scripts in this repository use symlinks to access the large data files that cannot be commited to this repository.
```sh
# create a symlink for storage of big files / those to avoid commit
ln -s /Volumes/path/to/storage/ storage
```

