# pathSTR-1000G

Repository for the pathSTR web app at <https://pathstr.bioinf.be>

If pathSTR is useful to you, please cite our publication:  
De Coster W, Höijer I, Bruggeman I, D'Hert S, Melin M, Ameur A, Rademakers R. 2024.  
Visualization and analysis of medically relevant tandem repeats in nanopore sequencing of control cohorts with pathSTR.  
Genome Res. doi: 10.1101/gr.279265.124.  


## aSTRonaut companion script

We provide [aSTRonaut](https://github.com/wdecoster/pathSTR/blob/main/scripts/aSTRonaut.py) for a more flexible approach to create the "sequence" repeat composition plot. This script requires VCF files, as generated by STRdust, but could be adapted to handle LongTR genotypes. Please let me know if that is of interest.

### Installation aSTRonaut

The script requires some python dependencies, which can be installed using pip or conda/mamba

```bash
pip install -r requirements.txt
# or
conda create -n astronaut pandas cyvcf2 plotly
conda activate astronaut
```

Clone the repository and run the script:

```bash
git clone https://github.com/wdecoster/pathSTR/
python pathSTR/scripts/aSTRonaut.py --help
```

### aSTRonaut usage

#### Example

```bash
python aSTRonaut.py data/*.vcf.gz --kmer 3 --repeat chr1:1000000 -o test.html
python aSTRonaut.py data/*.vcf.gz --motifs ATC,ATG,ATT --repeat chr1:1000000 -o test.html
```

```text
usage: aSTRonaut.py [-h] [-k KMER] [--motifs MOTIFS] [--repeat REPEAT] [-o OUT] [-m MINLEN] [-n NUMBER]
                    [--hide-labels] [--publication] [--title TITLE] [--sampleinfo SAMPLEINFO]
                    vcf [vcf ...]

Create a repeat sequence plot similar to the pathSTR <sequence> composition vizualization, but stand-alone

positional arguments:
  vcf                   VCF files to analyze

options:
  -h, --help                show this help message and exit
  -k, --kmer KMER           Kmer length to use for plot
  --motifs MOTIFS           Specify the motifs to plot, comma separated
  --repeat REPEAT           Chromosome and POS of repeat to plot e.g. chr1:12345
  -o, --out OUT             Output file name (html)
  -m, --minlen MINLEN       Minimal repeat length to plot
  -n, --number NUMBER       Number of kmers to plot
  --hide-labels             Hide sample labels
  --publication             Create a plot for publication
  --title TITLE             Title of the plot
  --sampleinfo SAMPLEINFO   TSV file with sample information
```

The tool takes either an integer motif length (-k, --kmer) or a predefined list of motifs (--motifs). The sampleinfo file (optional) should contain a "name" and "group" column, which will be used to highlight samples that are "case". Other values will be ignored.

We did our best to make aSTRonaut generate pretty visualizations for all scenarios, but if it doesn't look great for your data, please let us know. Feature requests are highly appreciated.

## Repeat genotyping workflow

### Installation

This repository contains the code for the pathSTR web app. For a local installation, you will need to install the dependencies as specified in requirements.txt. You can do this by running the following command in the root directory of the repository:

```bash
pip install -r requirements.txt
```

The pathSTR-1000G.smk workflow uses snakemake to manage the pipeline. You can install snakemake using pip:

```bash
pip install snakemake
```

### Usage

At the top of the file some paths have to be set, including reference genomes, the work_dir and the location of the STRdust binary.
This was not developed into a more convenient configuration file yet, as I do not anticipate many users running this pipeline. It is provided for completeness and transparency. Note that reading from remote files, in rare occassions, may fail.

The pipeline is ran as:

```bash
snakemake -s workflow/pathSTR-1000G.smk --use-conda --cores 24 --keep-going --retries 3
```

The workflow will extract those VCFs for which the sample was sequenced to at least 32Gb (or 10X estimated genome coverage), and generate a zip archive per caller/build dataset for the samples passing that filter (e.g. `pathSTR_STRdust_good_samples.zip`). Those zip archives can then be moved to be used for the PathSTR-db (`1000G.pathSTRdb`) and for the web app.

The webapp can built the database based on a file of filenames (fofn), containing the paths to the VCF files, as well as the reference genome build and genotyper. The file should not have a header line. A fofn file can be generated using the command below and then concatenated with other fofn files.

```bash
python scripts/prep_fofn.py --vcf_dir ~/local/pathSTR/STRdust/hg38/ --caller STRdust --build hg38 > genotypes.fofn
python scripts/prep_fofn.py --vcf_dir ~/local/pathSTR/STRdust/t2t/ --caller STRdust --build t2t >> genotypes.fofn
python scripts/prep_fofn.py --vcf_dir ~/local/pathSTR/LongTR/hg38/ --caller LongTR --build hg38 >> genotypes.fofn
python scripts/prep_fofn.py --vcf_dir ~/local/pathSTR/LongTR/t2t/ --caller LongTR --build t2t >> genotypes.fofn
```

The database, an hdf5 archive, can then be constructed using the command below. The sample_info file is also generated by the snakemake workflow.

```bash
python app.py --vcf genotypes.fofn --save_db 1000G.pathSTRdb --sample_info data/pathSTR_samples.tsv --store_only
```

The web app is then started with the following command, optionally with --debug to run dash in debug mode.

```bash
python app.py --db 1000G.pathSTRdb
```

## Further notes

### Preparing a list for samples for the Gustafson source

Listing all files from the AWS bucket, for both hg38 and chm13/t2t bams from the standard minimap2 pipeline

```bash
aws s3 ls 1000g-ont --no-sign-request --recursive | grep -i minimap | grep bam$  | cut -f4 -d' ' > data/list-miller-20240619.txt
```

--> results in paths to bam files, with duplicates, and for both chm13 and hg38. This is further handled by the snakemake pipeline.

### Adapting LongTR to genotype remote cram files

LongTR will check if a local path is valid, but other than that just works on remote cram files.
To remove the local path check, remove or comment out [these two lines](https://github.com/gymrek-lab/LongTR/blob/d9323818eea55cbf55ac72ee7992c6b901a25bdc/src/bam_io.cpp#L70) in bam_io.cpp before running `make`. It is possible that this [will become a feature](https://github.com/gymrek-lab/LongTR/issues/10) of LongTR and that this hack will then no longer be necessary.
