Crunch
================

An integrated ChIP-seq analysis pipeline

Requirements
------------

In order to run the Crunch pipeline, the following software should be available in your system:

* Anduril 1 (http://www.anduril.org/site/resources/anduril1/)
* Java (https://www.java.com)
* Python 2.7 (https://www.perl.org/)
* Perl 5 (https://www.python.org/)
* R (https://www.r-project.org/)
* Bowtie 1 (http://bowtie-bio.sourceforge.net/index.shtml)
* Bedtools (https://bedtools.readthedocs.io)
* Phylogibbs (http://swissregulon.unibas.ch/sr/software)
* Motevo (http://swissregulon.unibas.ch/sr/software)

The Crunch pipeline is designed to be executed in an HPC environment running the SLURM workload manager.

Data necessary to run the CRUNCH pipeline
-----------------------------------------

To perform the Crunch analysis, the pipeline requires a number of datasets to be available on your system. Below is a list of required data files and corresponding parameter names which should be set before running the pipeline.

* The genomic sequence of the species from which the data derives and, when available, of the genomes of related species, all in fasta format. One uncompressed fasta file (.fa) should be provided for each chromosome. Paths to these data should be specified in the option name GENDIR_PATH in the config/param_[organismID].yaml file and the corresponding option name is DB in the AlignmentPipeline/conf/global_pipeline-login.conf file.

* A file specifying chromosome lengths. You can use the script "fetchChromSizes" from the UCSC Genome Browser web-site to retrieve this data. The corresponding option name is CHROMOSOME_INFO in the config/param_[organismID].yaml file.

* Output from RepeatMasker for the species of interest. Again there should be one file per chromosome in a RepeatMasker directory. For many species, these files can be loaded from UCSC Genome Browser website. The corresponding option name is REPEAT_PATH in the config/param_[organismID].yaml file.

* Annotation files containing promoter and associated gene information. Promoter annotations are available from the SwissRegulon database (swissregulon.unibas.ch) or from the ENSEMBL database (www.ensembl.org). The corresponding option name is ANNOTATION_FILE in the config/param_[organismID].yaml file.

* Files containing regulatory motifs in the form of position specific weight matrices (PSWMs). It should be one file per PSWM in a motif directory. Such motifs can be downloaded from various databases. In Crunch’s web server we use a library of motifs from the SwissRegulon (swissregulon.unibas.ch), JASPAR (jaspar.genereg.net), HOCOMOCO (hocomoco.autosome.ru), HOMER (homer.ucsd.edu/homer), UNIPROBE (the_brain.bwh.harvard.edu/uniprobe) and ENCODE (www.encodeproject.org) databases. The corresponding option name is WMLIBRARY in the config/param_[organismID].yaml file.

* Bowtie index files for the species of interest. The corresponding option name is BOWTIE_INDEX_PATH in the config/param_[organismID].yaml file.

* Pairwise alignments of the genome of the organism of interest with those of other related species. Such pairwise alignment files can be downloaded from UCSC Genome Browser web-site. The files should be in MAF format, with again one file per chromosome. The corresponding option name is PAIRWISE_ALN_DIR in the AlignmentPipeline/conf/global_pipeline-login.conf file.

Preparing configuration files
-----------------------------

To run the Crunch pipeline, you will have to add necessary information on your system's configuration into the configuration files. The configuration files are in "config/" directory and named in the following way params_[organism id].yaml. All paths in these files should be changed to match your system's configuration.


Setting up and running the Crunch pipeline for your project
--------------------

First, use a configuration template from the "config/" directory (i.e. the params_[organism id].yaml file) to make a project config file. Next, depending on the format of your input data, add paths to your input files to either the IP/BG_FASTQ_FILES, IP/BG_FASTA_FILES or IP/BG_BED_FILES section.

Then, execute the following command:
[CRUNCH_ROOT]/scripts/run_Pipeline.py project_config_file

The run_Pipeline.py script will create all necessary files to set up the Crunch pipeline for your project, as well as print further instructions on how to start the actual Crunch pipeline. Please follow these instructions.
