# filter analysis

## consensus calling for phylogentic tree construction

Trees are generated by calling the base consesus sequence from the aligned ONT reads. The threshold as percentage of the majority base required was calculated by comparing the the different majority base thresholds to illumina requenced genomes 

```
nextflow ~/soft/ngonpipe/threshold_analysis/consensus.nf --inFiles inputs.txt --base ~/soft/ngonpipe/ 
python3 ~/soft/ngonpipe/scripts/parseDNAdiff.py diffs/
python3 ~/soft/ngonpipe/threshold_analysis/analysis.py

```

This produced a couple of plots including:

### coverage over percentage of supporting reads required to call a base

![](images/Coverage_over_supporting_reads.png)

### Number of False SNPs over percentage of supporting reads required to call a base

![](images/SNPs_over_supporting_reads.png)


## Nanopolish variant calling vcf filtering

The nanoplish variant caller produces VCF files with a QUAL column. Some lower QUAL scored SNPs are incorrect. Here I am trying to determine what threshold to use to remove false SNPs and INDELS.


```
nextflow run ~/soft/ngonpipe/threshold_analysis/variants.nf --inFiles inputs.txt
python3 ~/soft/ngonpipe/threshold_analysis/vcf_stats.py VCFs_map/ /mnt/Data2/MDR_GC/spiked/ngonpipe/basemixes/H18-208_R00000419_v3.fasta
```

This produced a couple of plots including:

### QUAL score by True of False SNPs 

![](images/SNP_validated_qualts.png)

### QUAL score over support fraction

![](images/QUAL_over_support_fraction.png)

### QUAL score over total reads

![](images/QUAL_over_reads.png)

### QUAL score by indel length

![](images/indel_lengths_qual.png)

### Relative SNP or INDEL proximity doesn't appear to be a factor for high quality False SNPs

![](images/SNP_relative_proximity.png)

### No bias from genome position 

![](images/SNP_locations.png)

## Todo

- [ ] determine cuttoff for nanopolish
    - [ ] look at range of depths by downsampling bam files, 1,2,3,4,5,6,7,8,9,10,20,30,40,50,100,200,500 etc
    - [ ] look at proximity to other INDELs or SNPs
    - [ ] Look at different references - closer reference, fewer high QUAL FPs?
- [ ] track FP and FN against Illumina
    - [ ] how does filtering affect these numbers
    - [ ] how many do we miss from filtering?

