
# PanTax: Strain-level metagenomic profiling using pangenome graphs

> [!IMPORTANT]
> Here is the raw custom code for the experiment, including absolute paths.

## Database construction

+ `ref13404`, designed for simulated datasets and for real datasets including PD human gut, Omnivorous human gut, and Healthy human gut, in the `Base tasks` of multi-species profiling
`db_construction/tools/pantax/ref13404/work.sh`

+ `zymo`, designed for `Zymo1` and `Zymo2` in `Base tasks`

+ `gtdb100`, designed for `sim-high-gtdb` in `Benchmark 4`

+ `ses_sim_strain_mixtures`, designed for `S. epidermidis strain mixtures` in `Base tasks` and `Benchmark 2`

+ `ses_real_strain_mixtures`, designed for `two cultured S. epidermidis strain mixtures` in `Base tasks`

+ `refdiv`, designed for `Benchmark 6`

## Strain taxonomy

Add strains to `nodes.dmp` and `names.dmp`, assign custom strain taxids, and obtain the corresponding mappings.

The strain-level taxonomy construction for the `ref13404` and `zymo` databases is located in `db_construction/tools/kraken2/*/kraken_build.sh`.

The strain-level taxonomy construction for the `gtdb100` database is located in `gtdb_taxonomy/work.sh`.

## Benchmark

> [!IMPORTANT]
> Most of the evaluation tasks can be automatically generated by combining datasets and different tools using `scripts/auto_report.py`; specific commands can be found in `scripts/auto.sh`. Information about the datasets is available in `scripts/configs/dataset`, information about the tools is in `scripts/configs/tools`, and the corresponding tool commands are in `scripts/configs/tools/tools_work_shell`.

### Benchmark 1: Base tasks (Datasets for multi and single species strain-level taxonomic profiling)

Take PanTax as an example, illustrate the mapping between tasks, datasets, and tools.

+ Datasets for multi species strain-level taxonomic profiling evaluation tasks, Simulated datasets (sim-low, sim-high)

`sim-low` `PanTax` default : `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow_pantax_mode0.sh`

`sim-low` `PanTax` fast : `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow_pantax_mode1.sh`

`sim-high` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh_pantax_mode0.sh`

`sim-high` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh_pantax_mode1.sh`

+ Datasets for multi species strain-level taxonomic profiling evaluation tasks, Simulated datasets with introduced mutations (sim-low-mut1, sim-high-mut1, sim-low-mut2, sim-high-mut2)

`sim-low-mut1` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-sub0.001_pantax_mode0.sh`

`sim-low-mut1` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-sub0.001_pantax_mode1.sh`

`sim-high-mut1` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh-sub0.001_pantax_mode0.sh`

`sim-high-mut1` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh-sub0.001_pantax_mode1.sh`

`sim-low-mut2` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-sub0.01_pantax_mode0.sh`

`sim-low-mut2` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-sub0.01_pantax_mode1.sh`

`sim-high-mut2` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh-sub0.01_pantax_mode0.sh`

`sim-high-mut2` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simhigh-sub0.01_pantax_mode1.sh`

+ Datasets for multi species strain-level taxonomic profiling evaluation tasks, Real datasets (Zymo1 and Zymo2, PD human gut, Omnivorous human gut, Healthy human gut)

`Zymo1` `PanTax` default: `base_tasks_multi_benchmark/pantax/zymo1_strain_level_work2/zymo1_strain_level_zymo1_pantax_mode0.sh`

`Zymo1` `PanTax` fast: `base_tasks_multi_benchmark/pantax/zymo1_strain_level_work2/zymo1_strain_level_zymo1_pantax_mode1.sh`

`Zymo2` `PanTax` default: `base_tasks_multi_benchmark/pantax/zymo1_strain_level_work2/zymo1_strain_level_zymo1-log_pantax_mode0.sh`

`Zymo2` `PanTax` fast: `base_tasks_multi_benchmark/pantax/zymo1_strain_level_work2/zymo1_strain_level_zymo1-log_pantax_mode1.sh`

`PD human gut` `PanTax` default: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_pd_pantax_mode0.sh`

`PD human gut` `PanTax` fast: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_pd_pantax_mode1.sh`

`Omnivorous human gut` `PanTax` default: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_omnivorous_pantax_mode0.sh`

`Omnivorous human gut` `PanTax` fast: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_omnivorous_pantax_mode1.sh`

`Healthy human gut` `PanTax` default: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_healthy_pantax_mode0.sh`

`Healthy human gut` `PanTax`fast: `base_tasks_multi_benchmark/pantax/human_gut_strain_level_work2/human_gut_strain_level_healthy_pantax_mode1.sh`

+ Datasets for multi species strain-level taxonomic profiling evaluation tasks, Spiked-in datasets

`spiked-in` `PanTax` default: `spiked-in/pantax/spiked_in_strain_level_species666_large_pangenome_work2/spiked_in_strain_level_species666_large_pangenome_pantax.sh`

+ Dataset for single species strain-level taxonomic profiling evaluation tasks, Simulated datasets: S. epidermidis strain mixtures (3 strains, 5 strains, 10 strains)

`S. epidermidis strain mixtures` `PanTax` default: `base_tasks_single_benchmark/simulated_datasets_ses/pantax/pantax_work_dst3.sh`

+ Dataset for single species strain-level taxonomic profiling evaluation tasks, Real datasets: two cultured S. epidermidis strain mixtures

`two cultured S. epidermidis strain mixtures` `PanTax` default: `base_tasks_single_benchmark/real_datasets_ses/pantax/work_v2.sh`


### Benchmark 2: Effects of divergence versus ratio of abundances

`benchmark2/pantax/pantax_work_dst.sh`

### Benchmark 3: Effects of reducing sequencing coverage

`sim-low-sub1` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-subsample0.5_pantax_mode0.sh`

`sim-low-sub1` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-subsample0.5_pantax_mode1.sh`

`sim-low-sub2` `PanTax` default: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-subsample0.2_pantax_mode0.sh`

`sim-low-sub2` `PanTax` fast: `base_tasks_multi_benchmark/pantax/strain_level_work2/strain_level_simlow-subsample0.2_pantax_mode1.sh`

### Benchmark 4: PanTax is effective for larger reference metagenome databases

`sim-high-gtdb` `PanTax` fast: `base_tasks_multi_benchmark/pantax/gtdb_strain_level_work2/gtdb_strain_level_simhigh-gtdb_pantax_mode1.sh`

### Benchmark 5: Benchmarking PanTax against long-read metagenome assemblers for strain-level profiling

`PanTax` profiling results are reported in `Benchmark 1: Base tasks`

`sim-low` `hifiasm`: `benchmark5/hifiasm/strain_level_work/strain_level_simlow_hifiasm.sh`

`sim-high` `hifiasm`: `benchmark5/hifiasm/strain_level_work/strain_level_simhigh_hifiasm.sh`

`Zymo1` `metaMDBG`: `benchmark5/metamdbg/zymo1_strain_level_work/zymo1_strain_level_zymo1_metamdbg.sh`

### Benchmark 6: Robustness of PanTax across reference diversity benchmarks

`SimRef1` `PanTax` default: `benchmark6/pantax/reference_diversity_strain_level_work2/reference_diversity_strain_level_refdiv_pantax1.sh`

### Benchmark 7: Benchmarking the impact of graph complexity and long read sequencing technologies on alignment accuracy

Based on the results from `Benchmark 6`.
`SimRef1` `PanTax` default: `benchmark6/pantax/reference_diversity_strain_level_work2/reference_diversity_strain_level_refdiv_pantax1.sh`

### Benchmark 8: Alternative solvers benchmarking

`benchmark8/scripts/work.sh`

### Benchmark 9: Sensitivity analysis of key parameters in strain level profiling

`benchmark9`

## Evaluation
`scripts/strain_evaluation.py`

## Plots

All plotting scripts are stored in `scripts/plot_scripts`.
