Assembling and Annotating a Draft Genome Assembly for Hydra oligactis

This document describes our approach for assembling and annotating a draft genome for Hydra oligactis. This entailed assembling and polishing the genome using Oxford Nanopore reads, and generating gene models with BRAKER2 using previously published whole-animal RNA-seq data.

Downloading SRA from Hydra oligactis RNA-seq dataset from NCBI

To prepare a reference transcriptome to provide some intrinsic information about potential proteins for the protein prediction, we accessed publically available RNA-seq datasets from NCBI read archive.

*R1 and *R2 accross the three sequencing runs from each dataset were concatenated.

All four files were compressed using pigz v2.4

Next, the script "Trinity_and_Trinotate_final_version_PHIL" was invoked to 1) quality assess the reads, 2) correct them using rcorrector, 3) trim them, 4) curate them using TranscriptomeAssemblyTools, and 5) quality assess them again prior to being assembled using Trinity. The finished transcriptome was then quality assessed using BUSCO in transcriptome mode.

Basecalling the Nanopore reads

Two libraries were prepared to generate long reads for Hydra oligactis. The first library was loaded 2 times on a Nanopore flow cell, resulting in two read files (lib1_1 and lib1_2). The second library was loaded 6 times. All 8 files were basecalled using the guppy basecaller v4.5.2 using the high accuracy moodel (HAC) over a NVidia 2070Super GPU.

Quality assessement of the Nanopore reads

Next, we ran NanoPlot v1.30.1 to quality assess each of the "runs".

Concatenating all the QC-passed reads for assembly

All the reads from the 8 individual "runs" passing the guppy basecaller QC criteria (Q-Score >7) were now concatenated into a single fastq file.

assembling of the genome

Then, the Flye v2.8.3 assembler was run to generate the genome

Assembly on flye 283 finished with the following metrics: Total length: 1283640785 Fragments: 18685 Fragments N50: 272153 Largest frg: 6425239 Scaffolds: 114 Mean coverage: 17

QC with BUSCO on the newly created genome file

We checked the genome file for completeness using BUSCO v5.0.0 in genome mode

The genome file was then polished using medaka

"Raw" draft assemblies using uncorrected Nanopore reads are known to contain some errors, especially in homopolymer-regions. A software (medaka) was developped to find and correct these errors. We used the best medaka model, however, it was trained on guppy v3.0.3 basecalled data, which was not as accurate as the guppy v4.5.2 used to basecall the data from the present genome.

. $SOFTWARE/medaka/bin/activate PATH=$SOFTWARE/htslib-1.11/:$PATH PATH=$SOFTWARE/samtools-1.11/:$PATH PATH=$SOFTWARE/minimap2/:$PATH PATH=$SOFTWARE/bcftools/:$PATH

QC with BUSCO on the polished genome file

The polished genome was rechecked with BUSCO v5.0.0 to check for improvements

 

Soft-masking repetitive elements in the genome

Before generating gene models for the oligactis genome, we first soft-masked all repeats in the draft assembly. The process for generating the masked version of the genome (olig_genome.sm.fa) is described in the 02_repeatMasking.md document.

Mapping the de-novo transcriptome to the genome

The next part is the protein prediction on the new genome file. First, the newly generated transcriptome gets mapped to the genome using minimap2

Protein prediction with braker2 pipeline

Finally, we started the braker2 pipeline (in a conda environment) to predict proteins

QC of the predicted proteins

In addition, we ran BUSCO v5.2.2 (now in a conda environment) in protein mode to check for completeness of the predicted proteins.

Final information

The computational needs for this complete project did not exceed the power provided by a workstation containing * AMD Ryzen Threadripper 3970X 32-Core Processor * 256G memory * NVidia 2070Super

Files Associated with This Document