AEP Genome Assembly

This document describes the assembly process for the strain AEP H. vulgaris genome. Our approach was as follows: we generated an initial draft assembly with Canu using Nanopore reads, we then polished the draft with Pilon using 10x linked reads, mis-assemblies were identified and broken with Tigmint using 10x data, uncollapsed haplotigs were merged with purge_haplotigs based on 10X read depth distribution, de-duplicated contigs were scaffolded with ARCS using 10X data, gaps introduced by ARCS were filled with PBJelly using Nanopore and PacBio reads, pseudo-chromosome scaffolds were assembled with the Juicer and 3d-dna pipelines using Hi-C data, the new gaps created by the Hi-C scaffolding were filled with PBJelly using PacBio and Nanopore data, and finally the genome was polished with Pilon using 10x, PacBio, and Nanopore data.

De-Novo Strain AEP H. vulgaris Genome Assembly

To generate the initial draft genome, we used Canu (v2.0) to assemble a de novo assembly from our Nanopore sequencing data:

(00_canuAssembly/runCanu.sh)

The resulting fasta file hydra_aep.canu.contigs.fasta was used as the starting point for the rest of the genome assembly process.

Resulting Genome Stats

Assembly size statistics were generated using assembly-stats (github.com/sanger-pathogens/assembly-stats)

(stats for 00_canuAssembly/hydra_aep.canu.contigs.fasta)

Polishing the Initial Draft Assembly with Pilon

While the Nanopore reads used to generate the Canu draft assembly are quite long, they are also error prone. In this step, we try to fix some of those errors using our 10X data, which was sequenced on a conventional Illumina sequencer and has much more accurate base calls.

Mapping 10X Data to the Canu Draft Genome

Pilon requires as input a bam file of reads aligned to the genome. We used the longranger pipeline (v2.2.2) to map our 10X reads. First, we prepped the Canu draft genome for mapping:

(01_initialPilon/makeCanuRef.sh)

Then we mapped the 10X fastq files:

(01_initialPilon/runLrAlign.sh)

This generated the file possorted_bam.bam that we used for the Pilon polishing.

Prepping Canu Draft for Pilon Run

We next polished the genomic using Pilon (v.1.23). Because Pilon has high memory requirements, we had to split the draft genome into ~50 Mb chunks to reduce memory overhead. First we determined contig lengths:

Then we used the following R script to split the genome fasta into ~50 Mb chunks

(01_initialPilon/makeContigGroups.r)

Running Pilon

The following script looped through the contig chunk list generated above, extracted the corresponding sequences, and ran Pilon on just that subset of subsequences.

(01_initialPilon/runPilon.sh)

For some reason (I think because the contGroups.txt didn't end with an empty line), the script missed the last chunk of contigs, so this script caught that last chunk of the genome:

(01_initialPilon/runPilonLastOne.sh)

We then combined the output files into a new, polished draft genome

cat pilOut/*fasta > canuPilon.fasta

Resulting Genome Stats

(stats for 01_initialPilon/canuPilon.fasta)

Breaking Mis-Assemblies with TigMint

Next, we wanted to cross-reference our Canu draft with our 10X reads to try and identify possible mis-assemblies. We did this using TigMint (v1.1.2).

Prepping 10X Fastq Files with Longranger

Although the TigMint pipeline handles the actual mapping of the data, it requires that the raw 10X data go through some initial processing by longranger:

(01_initialPilon/runLrBasic.sh)

The resulting processed fastq file (barcoded.fastq.gz) was unzipped and renamed to reads.fastq

Running TigMint

TigMint is executed using a makefile that's included when the software is installed. We slightly modified the config section of this make file (lines 1 to 58) and renamed the file from tigmint-make to tigmint-make-mod

(02_tigmint/tigmint-make-mod)

Prior to running TigMint, we set aside any contigs smaller than 2 Kb, which TigMint would ignore anyway

We then executed the TigMint pipeline on only the > 2 Kb contigs using the following script (bwa version used: v0.7.9a)

(02_tigmint/runTigmintMake.sh)

We then brought the small contigs back in

cat draft.tigmint.fa draft.small.fa > draft.tigmint.full.fa

Finally, we renamed the contigs to have simpler headers:

bioawk -c fastx '{ print ">scaffold-" ++i"\n"$seq }' < draft.tigmint.full.fa > draft.tigmint.final.fa

Resulting Genome Stats

(stats for 02_tigmint/tigmint.fa)

Collapsing Haplotigs with Purge_Haplotigs

In some cases, the two copies of a particular locus in a genome will be different enough in sequence composition that they will be treated as two distinct sequences by the genome assembler. For this assembly, we want to generate a haploid genome without alternative alleles. Also, some of our downstream assembly steps assume a haploid input, so if we don't address this issue it could cause mis-assemblies.

Identifying Uncollapsed Haplotigs Using 10X Data

To identify uncollapsed haplotigs in our TigMint-processed assembly, we mapped our 10X data to our draft genome and looked at the read depth distribution

To do this, we first had to prep the TigMint-processed reference genome for mapping:

(03_purgeHaplotigs/makeTigmintRef.sh)

We then used the longranger pipeline to map the 10X data to the TigMint-processed genome

(03_purgeHaplotigs/runLrAlign.sh)

The resulting bam file (possorted_bam.bam) was renamed to posSort.10x.bam

We then generated a distribution plot of read depth across the contigs in our assembly using a function provided by the Purge_Haplotigs package (v1.1.1)

(03_purgeHaplotigs/runMakePlot.sh)

posSort.10x.bam.histogram

The bimodal distribution in read depth clearly indicated that we had uncollapsed heterozygosity in the assembly.

Collapsing Haplotigs

We used the purge_haplotigs pipeline to remove this uncollapsed heterozygosity. We used the distribution plot to specify the bounds of the two populations: the population with half read density (depth from 17 to 67) and the population with full read density (depth from 67 to 185).

(03_purgeHaplotigs/runPurge.sh)

Following the removal of haplotigs, the fasta file curated.fasta containing the newly haploid assembly was created.

Resulting Genome Stats

(stats for 03_purgeHaplotigs/curated.fasta)

Scaffolding Contigs with 10X Data

We next attempted to scaffold together some of the contigs broken by TigMint. For this, we used Arcs (v1.1.1) in conjunction with our 10X data.

Mapping 10X Data to the Haplotig-Purged Genome

First we prepped the haplotig-purged genome for mapping with longranger

(04_arcs/makeRef.sh)

We then aligned our 10X data to the genome

(04_arcs/runLrAlign.sh)

Arcs requires namesorted bam files, which we generated using samtools (v1.12)

(04_arcs/nameSort.sh)

Running Arcs

We ran arcs using the following three scripts:

(04_arcs/runArcs.sh)

(04_arcs/runMakeTsvTig.sh)

Before running the last script, we ran:

touch empty.fof

(04_arcs/runLINKS.sh)

(Links version: v1.8.6)

After scaffolding, we dropped all sequences shorter than 1 Kb and gave the remaining scaffolds simpler names

Resulting Genome Stats

(stats for 04_arcs/arcs.final.fa)

Filling Gaps with PBJelly

Although the Arcs scaffolding did increase assembly contiguity, it also introduced gaps. We attempted to fill in some of those gaps with PBJelly (PBSuite v15.8.24) using our long read data. Our long read data consisted of a relatively high coverage Nanopore dataset (~40X) that we used for generating the initial draft genome as well as a relatively low coverage PacBio library (~4X). Because the PacBio data was generated using relatively error-free chemistry (v3), and because it was generated using an entirely different platform from the Nanopore data, we opted to use both for the gap filling.

Correcting Long Reads Using Canu

Long read data is fairly error-prone. To make sure the input provided to pbjelly was accurate as possible, we corrected the reads before mapping them to our draft genome. To do this, we used the read correction functionality built into the Canu assembly pipeline (v2.2-development).

Correcting PacBio Reads

Our starting file for the PacBio data was a bam file, which we had to first to convert to a fastq file before performing the correction. We did this using samtools.

(05_initialPBJ/genFasta.sh)

We then corrected the PacBio reads using the following script:

(05_initialPBJ/runCanuPB.sh)

The following text was used in the canuSpecPB.txt config file describing run parameters for the read correction pipeline

(05_initialPBJ/canuSpecPB.txt)

The corrected reads were written to the file aepPB.correctedReads.fasta

Correcting Nanopore Reads

We used the following script to correct the Nanopore reads:

(05_initialPBJ/runCanuNano.sh)

The contents of the canuSpecNano.txt document were as follows:

(05_initialPBJ/canuSpecNano.txt)

We split the corrected Nanopore reads into 9 read chunks (to facillitate parallelization with pbjelly)

(05_initialPBJ/splitReads.sh)

The resulting fasta files were named as follows:

Filling Gaps with PBJelly

PBJelly requires that the genome fasta file include quality scores, which we didn't have for our Arcs-processed assembly, so we used a utility script provided by PBJelly to generate a fake scores file:

fakeQuals.py arcs.fa arcs.qual

We also had to rename the genome file from arcs.fa to arcs.fasta (PBJelly doesn't recognize fasta files ending in .fa)

PBJelly uses a config XML document to specify a number of parameters. Our config file (config.xml) was as follows:

(05_initialPBJ/config.xml)

We experienced a previously documented parsing error when initially running the PBJelly pipeline, we addressed this by making the following changes to the Jelly.py script:

(modified version available in the file 05_initialPBJ/Jelly.py)

We then executed the PBJelly pipeline using the following script:

(05_initialPBJ/runPBJ.sh)

This pipeline produced the fasta file jelly.out.fasta. We standardized all unfilled or partially filled gaps in this output to be 100 bases long:

We then gave the genome scaffolds simpler/more uniform names

Resulting Genome Stats

(stats for 05_initialPBJ/jelly.shrink.fasta)

Scaffolding Using Hi-C Data

To go from the scaffolds in our post-PBJelly assembly to psuedo-chromosomes, we employed Hi-C data for the final scaffolding step.

Mapping Hi-C Reads Using Juicer

The Juicer pipeline requires a list of possible restriction enzyme cut site coordinates for the Hi-C protocol used to generate the sequencing data. We used a utility script provided with Juicer to identify cut sites for the Arima kit that we used to generate our libraries:

(06_hic/getCuts.sh)

We found that we needed to modify the Juicer pipeline (v1.6) for it to run on our computing cluster. We named the modified script juicerMod.sh

These are the changes we made to the original juicer script (output from diff -B juicer.sh juicerMod.sh):

(modified script available in the file 06_hic/juicerMod.sh)

We then used the following script to run the Juicer pipeline. Note that we specified a subdirectory work/ as the working directory. Within that working directory we created a fastq folder in which we placed the raw Hi-C reads.

(06_hic/runJuicer.sh)

We found we had to run the above script twice, because part of the job scheduling code didn't function properly. This caused jobs to run asynchronously after a certain point, causing the pipeline to choke. Rerunning the script after the first run errored out allowed things to be completed successfully.

Scaffolding using 3d-dna

We then took the aligned reads from the Juicer pipeline merged_nodups.txt and fed them into the 3d-dna pipeline (v180922):

(06_hic/run3dDna.sh)

This initial step produced 14 psuedo-chromosomal scaffolds:

pre_review_hic_map

We then did some slight manual rearrangement of the assembly (recorded in the file jelly.shrink.rawchrom.review.assembly) and ran the final steps of the assembly pipeline:

(06_hic/runFinalize3d.sh)

This is the resulting final assembly post-scaffolding:

final_assembly

We then removed the debris not incorporated into the pseudo-chromosomal scaffolds (small minority of overall sequence):

Resulting Genome Stats

(results for 06_hic/aepChroms.fasta)

Filling Pseudo-Chromosome Scaffold Gaps with PBJelly

The Hi-C scaffolding introduced a large number of gaps, so we performed another round of PBJelly gap filling using the same approach as before. We again used the corrected versions of the long reads for this step (created for the first PBJelly step).

For this PBJelly run, we wanted to reduce the total runtime by parallelizing the mapping step of the pipeline. To do this, we first ran the setup function within the Jelly.py script using a standard configuration file (config.xml):

(07_finalPBJ/runPBJ.sh)

(07_finalPBJ/config.xml)

We then spread the mapping step of the pipeline across ten different nodes (one for each corrected read fasta file), with a custom config file for each node.

(07_finalPBJ/runMapArray.sh)

The config for the first node was:

(07_finalPBJ/mCon1.xml)

The config for the second node (mCon2.xml) was identical to mCon1.xml except for the name of the read fasta file:

(output from diff mCon1.xml mCon2.xml)

This same pattern was also used to generated config files 3 through 10.

After mapping, we completed the pipeline using a single node with the config.xml config file

(07_finalPBJ/runScaf.sh)

After the pipeline ran, we again standardized gap sizes to be 100 Ns long and gave the pseudo-chromosomes standardized, simple names

Resulting Genome Stats

(stats for 07_finalPBJ/aepChr.gapfill.fa)

Final Polish Using Pilon

To finalize our assembly, we performed one last polishing step using Pilon. For this round we chose to use our 10X, Nanopore, and PacBio reads as input to try and maximize accuracy.

Mapping the 10X, Nanopore, and PacBio Reads

To map the 10X reads, we again used the longranger pipeline. First, we prepped the gap-filled psuedo-chromosome scaffolds for mapping:

(08_finalPilon/makeRef.sh)

We then mapped the 10X reads with the longranger align pipeline:

(08_finalPilon/runLrAlign.sh)

We next mapped both the Nanopore and PacBio reads using minimap2 (v2.17-r941). This required we perform separate indexing of the genome fasta file for the Nanopore and PacBio datasets:

(08_finalPilon/makeMMI.sh)

We again used the corrected versions of the long reads for this step (created for the first PBJelly step). This was the script we used to map the PacBio reads following genome indexing:

(08_finalPilon/mmPB.sh)

And this was the script for mapping the Nanopore reads:

(08_finalPilon/mmON.sh)

Pilon requires sorted and indexed bam files as input. minimap2 doesn't sort it's output by default, so we sorted and indexed the initial bams using samtools:

(08_finalPilon/sort.sh)

Polishing Pseudo-Chromosomes w/ Pilon

Next we looped through the fifteen psuedo-chromosome scaffolds and polished them using 10X, Nanopore, and PacBio reads:

(08_finalPilon/runPilon.sh)

We then pooled the corrected fasta files and standardized gaps to be 100 Ns long. Note that we opted to output the genome fasta without linebreaks (each chromosome takes up one line in the fasta file). Our final assembly was called aep.final.genome.fa

Final Assembly Stats

Our starting assembly stats:

(stats for 00_canuAssembly/hydra_aep.canu.contigs.fasta)

Our final assembly stats:

(stats for 08_finalPilon/aep.final.genome.fa)

Our starting BUSCO scores:

Our Final BUSCO scores:

Files Associated with This Document