
README by Zane Kliesmete on 20.12.2023
e-mail: kliesmete@bio.lmu.de, zane.kliesmete@gmail.com





Most important scripts (under scripts): 

FIGURE 1

1.0_DHS_peakcalling : scripts to call peaks per tissue and then merge the overlapping coordinates to generate CRE coordinates
1.0_JAMM_DA.R : do DA analyses between tissues where a CRE was called vs not
1.1_basics.Rmd : some preprocessing of DHS tables and peak analyses; saving most important summary tables on DHS-peaks (suppl fig1)
1.2_tissueExpression.Rmd : rpkm expression value wrangling per tissue (suppl fig2)
1.3_reg2gene.Rmd: region_id to gene association; LM predicting expression (figure 2)
1.4_run_model_permutations_expression_pleiotropy.R : control permutations for expression prediction using CRE pleiotropic degree across human tissues
1.5_figure1.R : put figure 1 together (roadmap_DHS_summaries overview of the data etc)


FIGURE 3

2.0_run_liftoff_mf6.sh and 2.0_final_filter_liftoffgtf.R : generate a gtf file for macaque from the human gtf using liftOff.
2.1_DE_analysis.R : gene expression preprocessing
2.2_RLO_prepCbust.R : identification of DHS coordinates in macaque; sequence, expressed TF PWM preparation as inputs for cluster buster
ATACseq/scripts/count_table.sh : count ATAC seq reads in the CREs
2.3_OL_filtering_DA.R : more stringent 1-to-1, 0-to-0, 1-to-0 and 0-to-1 peak filtering between DHS vs ATAC for both species, and human vs macaque
2.4.0_analyse_roller2021: additional analyses regarding CRE activity across mammals based on published data from Roller et al. 2021
2.4.1_Species_openness.Rnd : openness DA vs expression DE


FIGURES 4+5

3.1_run_INSIGHT.Rmd : setting up all sequence conservation scripts 
3.2_plot_INSIGHT.Rmd : analysis + plots of the sequence conservation outcomes


FIGURES 2+6

in the ATACseq/scripts folder:

run_cbust_gorilla3.sh : running cbust
summarizeTFsPerRegion.R : summarize TFs per cluster

splitCbust.R : splitting cbust outputs per sequence for human and macaque orthologs for the following position investigations
runTFBSPosition.sh : runs calculatePositionCons.R per region (slurm)
summarizeTFBSDistances.R :  as the name says.. (+ run in slurm)

back in the scripts folder:

4.1_TFBS_diversity.Rmd : put the new figure 2 together on TFs
4.2_combineAll.Rmd : make one big summary table; generate the combined scaled seq-cons + TFBS repertoire cons + TFBS position cons plot and the hexagon spider plot.
4.3.0_analyse_ballester : analyses of TF binding across mammalian species from Ballester et al. 2014
4.3.1_TFBS_repertoire.Rmd :  put figure 5 together


FIGURE 7

5.1_figureExample.R : example gene promoter investigation and visualization



Note that raw INSIGHT outputs, CRE fasta files, raw cluster-buster outputs, per-TF-per-CRE binding site conservation summaries were too large to include, but their summarized outputs are included, as well as scripts to generate these. Contact me in case of uncertainty.
