# NMRDMR Snakemake Workflow : Cross-Species Regulatory Elements from Histone Mark Peaks

[![Snakemake](https://img.shields.io/badge/snakemake-≥5.13-brightgreen.svg)](https://snakemake.bitbucket.io)
[![Snakemake-Report](https://img.shields.io/badge/snakemake-report-green.svg)](report.html)

The pipeline starts from **H3K4me3, H3K4me1 & H3K27ac peaks** in different species, and outputs **promoters, enhancers and primed enhancers sets** for each species, all mapped to a **common coordinate system** using a reference (mouse). **H3k27ac reads density** for the sets of **orthologous promoters and enhancers** are then extracted from .bam files and **normalized** across species and replicates.

## Table of content

  - [Description](#description)
  - [Installation](#installation)
  - [Usage](#usage)
  - [Authors](#authors)
  - [References](#references)

## Description

The workflow image is provided in `nmrdmr_pipeline_dag.pdf`. The pipeline takes as input (i) a samplesheet with samples information (with a rigid format, see the `data/NMRDMR_DatasetSummary_Villar_210816.txt` samplesheet as an example, including infos related to samples: species, mark, tissue, peak file name...), (ii) corresponding peaks and (iii) bam files (or a file with urls to automatically download the bams, see `data/samples.tsv`) and (iv) one tss .bed file per species (for plots, should match the pattern 'TSS.biomart.{species_name}.bed'). Note that peak files should have the .narrowPeak extension to be recognized by the pipeline.

Briefly, the code consists of 6 modules: 

- Step 1, "reproducible peaks", takes peaks files as input and finds reproducible peaks amongst replicates (blue bubbles in the workflow image, see `module_consensus_peaks.smk` for implementation details).

- Step 2, histone marks combination, combines histone marks to predict regulatory elements (yellow bubbles in the workflow image, see `module_regulatory_elements.smk` for implementation details).

- Step 3, first liftover pass, converts regulatory elements sets into one reference coordinates system, using ensembl pairwise alignment (alternatively custom .chains files, see below) and the liftover tool (red bubbles in the workflow image, see `module_liftover_to_ref.smk` for implementation details).

- Step 4, solve regulatory element overlaps, homogenize regulatory element types across species, using a majority vote procedure. This resolves cases where, for instance, orthologous sequences are defined as promoters in some species but as enhancer in others (brown bubbles in the workflow image, see `module_solve_element_overlaps.smk` for implementation details).

- Step 5, define genomic regions greylists from ChIP input .bam files, using chipseq-greylists (dark grey bubbles in the workflow image, see `module_greylists.smk` for implementation details).

- Step 6, second liftover pass, defines the set of orthologous regions, i.e. sets of regulatory elements that can be confidently aligned across all species (green bubbles in the workflow image, see `module_mappable_regions.smk` for implementation details).

- Step 7, read coverage, computes read coverage in .bam files for the sets of orthologous regulatory elements and normalizes the values across species and replicates (pink bubbles in the workflow image, see `module_read_coverage.smk` for implementation details).

The last module `module_quality_control_plots.smk` (purple bubbles in the workflow image) produces plots at key steps of the pipeline.

## Installation

### Dependencies

    - conda
    - snakemake=6.9

### Install conda

The Miniconda3 package management system manages all of the pipeline's dependencies, including python packages and other software (bedtools, liftover...).

To install Miniconda3:

- Download Miniconda3 installer for your system [here](https://docs.conda.io/en/latest/miniconda.html)

- Run the installation script: `bash Miniconda3-latest-Linux-x86_64.sh` or `bash Miniconda3-latest-MacOSX-x86_64.sh`, and accept the defaults

- Open a new terminal, run `conda update conda` and press `y` to confirm updates

### Install snakemake

To install snakemake in a conda environnment (for example in an env named `snake`), run the following commands:

- `conda install -c conda-forge mamba`

- `mamba create -c conda-forge -c bioconda -n snake snakemake==6.9`

After these, installation is complete, all that will be necessary before running the pipeline is to activate the environnment with the command `conda activate snake`.

## Usage

### Configuration

To run the pipeline, first cd to its root folder `cd nmrdmr_pipeline`. Second, define the paths to the input files and all parameters in the configuration file, using `config_nmrdmr_final.yaml` as a template.

### Using custom chain files for liftover

Custom chain files can be provided (instead of directly downloaded from Ensembl). In this case, their path should be indicated in the configuration file, as follows:

```
    chain_files: "lastZ/mmus_grcm38.v.{sps}_lastz_net.all.chain"
```


### Running

- To run the full workflow with configuration defined in `config.yaml`:

    `snakemake --configfile config.yaml --cores=10 --use-conda`


- To execute a dry-run with a full command details (usefull to see what will be run without actually running it)

    `snakemake --configfile config.yaml --use-conda -n -p`


- To execute the workflow until a particular step is done, use the `--until` flag. For instance, to run until consensus peak set have been defined:

    `snakemake --configfile config.yaml --cores=10 --use-conda --until merge_into_consensus`


- After a run, to generate an html run report with all generated figures:

    `snakemake --configfile config.yaml --report report.html`

## Versions

- beta2 version 23.08.2023 (commit 07a04d)
- beta version 19.09.2022 (commit 778147)
- alpha version 3.08.2020 (commit e86a5)


## Authors

* [**Elise Parey**](mailto:elise.parey@bio.ens.psl.eu)
* **Diego Villar Lozano**
* **Camille Berthelot**