# CARDlongread_meth_R.9vs10
Comparison of methylation calling between ONT long-read R.9 and R.10 data

## modkit_methcompare.py

This is the python script used to make the split violin plot graphs comparing modkit methylation frequencies over specified intervals for R9, R10, and bisulfite sequencing modkit files. These graphs were featured in xxx paper. 

This script can be used to compare interval-specific methylation frequences from three different modkit files, with one of the files being used for binning. 

## Input data

The script requires three different bed or bed-like files with columns for genomic position (chromosome and start/end position), probability of the target base being modified, and coverage level of the base called.  

The violin plot in the paper was made from bedMethyl files generated by modkit, a package for analysing ONT modified bases.  
More info about the modkit package and bedMethyl output file can be found at https://github.com/nanoporetech/modkit.  

The command used to generate the modkit files used in the paper are shown below:  

```
#!/bin/bash

SAMPLE_NAME=$1
REF=$2
BAM_FILE=$2
OUT_PATH=$3

ml modkit 

modkit pileup --cpg --ref ${REF} --only-tabs --threads 24 --ignore h --combine-strands ${BAM_FILE} ${OUT_PATH}${SAMPLE_NAME}.hg38.modkit.comb.bed
```

## Parameters


```--sample_name``` : sample name (string value)

```--r9_modkit``` : path to R9 bedfile

```--r10_modkit``` : path to R10 bedfile 

```--bis_modkit``` : path to bisulfite bedfile 

```--cov_min``` : minimum coverage threshold, default = 20 (int value)

```--cov_max``` : maximum coverage threshold, default = 200 (int value)

```--interval``` : number of evenly spaced intervals for binning data, default = 10 (ex. 0, 10, 20, 30, ... 100) (int value)

```--custom_interval``` : a list of custom unevenly spaced interval values for binning data (ex. [0, 5, 10, 50, 90, 95, 100])

```--binning``` : dataset to bin the graph by , either 'r9', 'r10', or 'Bisulfite' (string value)

```--bw``` : number from 0.0 - 1.0 (float value) that scales the violin plot bandwidth for more or less smoothing, default = 0.1 (float value)

```--scale``` : method to normalizes each density to determine the violin's width: 'width' = default; all violins have the same, 'area' = all violins have the same area,  = violin widths are proportional to number of observations (string value)

```--out__dir``` : output directory path


## Sample run command 

```
python modkit_methcompare.py \
--cov_min 20 \
--cov_max 200 \
--r9_modkit /path/to/r9_modkit.bed \
--r10_modkit /path/to/r10_modkit.bed \
--bis_modkit /path/to/bis_modkit.bed \
--interval 10 \
--custom_interval [0, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100]
--sample_name HG002 \
--binning Bisulfite  \
--out_dir /path/to/out_dir
```

## Graphs

The modkit_methcompare.py script generates three separate graphs that can then be overlaid to form the final figure. 

The first graph is a split violin plot of the R9 and R10 methylation proportions binned by bisulfite intervals. 
![HG002_bis_VP](https://github.com/rgenner/R9_R10/assets/87498696/e1453298-0103-4c9e-b09c-4184705bdf2c)

The second graph is a line plot with lines connecting the median interval points in each sample.  
This can be vectorized to overlay the split violin plot. 
![HG002_Bisulfitebins_lines](https://github.com/NIH-CARD/CARDlongread_meth_R.9vs10/assets/87498696/7d86bd30-a37c-4c15-8f4b-1068b4d33f95)

The last plot is a panel showing the distribution of CpG site methylation frequencies for each sample.  
This can be rotated 90° and added to the right side of the split violin graph.
![HG002_bis_VP_cov](https://github.com/rgenner/R9_R10/assets/87498696/5ac25345-2492-44bf-a02a-c3520320e7ac)


Below is the final figure assembly:
![HG002_bis_final_VP](https://github.com/rgenner/R9_R10/assets/87498696/cb812069-6732-4159-836f-d804e4e21cc9)
