# SPAD
This set of codes is used to call Speckle Associated Domains (SPADs), compare SPADs in different cell lines, and correlate SPADs with gene expression.

## Percentile normalization
SPADs are defined as genomic regions with top 5 percentile of SON TSA-Seq scores. So we first convert the 20kb-binned smoothed TSA-Seq enrichment scores (TSA-Seq_hanning_20kbx21.wig) into percentiles. This code will generate the .wig and .bw files for 20kb-binned TSA-Seq score percentiles (TSA-Seq_hanning_20kbx21_percentile.wig, TSA-Seq_hanning_20kbx21_percentile.bw)

```shell
python TSA_percentile_norm_TSA2.0.py -w TSA-Seq_hanning_20kbx21.wig -q 100 -g utilities/hg38_Gap.bed -o TSA-Seq_hanning_20kbx21_percentile -gg utilities/hg38F.genome
#Genome size file hg38F.genome was for female cell line (K562), hg38M.genome was for male cell lines (H1, HCT116, HFFc6).
```

Figure 2A (middle) was generated from the .bw files (TSA-Seq_hanning_20kbx21_percentile.bw) by this code.

## SPADs calling
Use the percentile .wig files from last step (TSA-Seq_hanning_20kbx21.wig), identify bins above 95 percentile and merge adjacent bins to call SPADs

```shell
python BigPercentiles_TSA2.0.py -w TSA-Seq_hanning_20kbx21_percentile.wig -o TSA-Seq_hanning_20kbx21_percentile -p 95 -g utilities/hg38F.genome -win 20000
#Genome size file hg38F.genome was for female cell line (K562), hg38M.genome was for male cell lines (H1, HCT116, HFFc6).
```
This code will generate a TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed file for all 20-kb bins above 95 percentiles and a TSA-Seq_hanning_20kbx21_percentile_above_95.0_mergeAdjacent.bed (and corresponding .bb) file for SPADs by merging adjacent 20-kb bins.

This code will also generate simple statistics of region size and number in histograms.

Figure 2A (bottom) was generated from the .bb files (TSA-Seq_hanning_20kbx21_percentile_above_95.0_mergeAdjacent.bb) by this code.

## Compare SPADs from different cell lines
### Take SPADs from one cell line, check region percentiles in other cell lines
Use the TSA-Seq_hanning_20kbx21_percentile_above_95.0_mergeAdjacent.bed file generated from last step for one cell line (cell0), collect all regions (SPADs) and calculate region mean percentiles (mean of multiple 20kb-binned values in a region) in other cell lines (cell1, cell2, cell3, input TSA-Seq_hanning_20kbx21_percentile.wig files), and return a box plot showing region mean percentile distribution in the other three cell lines (cell0withOthers.eps).

```shell
python checkSpadsInOtherCellLine_v2_TSA2.0.py -b cell0_TSA-Seq_hanning_20kbx21_percentile_above_95.0_mergeAdjacent.bed -w1 cell1_TSA-Seq_hanning_20kbx21_percentile.wig -w2 cell2_TSA-Seq_hanning_20kbx21_percentile.wig -w3 cell3_TSA-Seq_hanning_20kbx21_percentile.wig -o cell0withOthers.eps
```

Figure 2B, Supplementary Figures 5A-C were generated by this code.

### 4-way Venn diagram
Use the 20kb-binned TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed files from the four cell lines generated in "SPADs calling" step to generate a 4-way Venn diagram by Intervene (version 0.6.4, https://intervene.readthedocs.io/en/latest/).

```shell
intervene venn -i cell0_TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed cell1_TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed cell2_TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed cell3_TSA-Seq_hanning_20kbx21_percentile_above_95.0.bed --filenames
```

Supplementary Figure 5F was generated by this code. The resulting intervened numbers are numbers for 20kb bins. Genomic size were calculated by multiply the reported number by 20kb.

## Check data noise
Use the 20kb-binned TSA-Seq_hanning_20kbx21_percentile.wig files generated from the "Percentile normalization" step to compare two biological replicates. This is to check data noise.

```shell
python percentile_correlation_color_hist2d_TSA2.0.py -w1 rep1_TSA-Seq_hanning_20kbx21_percentile.wig -w2 rep2_TSA-Seq_hanning_20kbx21_percentile.wig -x bio_rep1 -y bio_rep2 -o bio_rep_percentile_scatter.eps
```
This code will generate a genome-wide 2D histogram showing percentile (20kb bin) correlation in the two biological replicates with color-coded number of bins (bio_rep_percentile_scatter.eps). 

```shell
python percentile_correlation_color_hist2d_90_TSA2.0.py -w1 rep1_TSA-Seq_hanning_20kbx21_percentile.wig -w2 rep2_TSA-Seq_hanning_20kbx21_percentile.wig -x bio_rep1 -y bio_rep2 -o bio_rep_above90_percentile_scatter.eps
```
This code will generate a 2D histogram showing percentile (20kb bin) correlation for top-10-percentile regions in the two biological replicates with color-coded number of bins (bio_rep_above90_percentile_scatter.eps).

Supplementary Figures 5D,E were generated by this code.

## Gene expression analysis comparing SPADs and other genomic regions

R codes and analyses by Yang Zhang (Ma lab, CMU). See subfolder "Expression" for codes and details.

Figure 2C and Supplementary Figures 5G,H,I were generated by this set of codes.


