AEP Genome Gene Model Generation

This document covers the generation of gene models for the strain AEP H. vulgaris genome assembly from repeat annotation to final gene models. It also covers the generation of functional annotations for the AEP gene models. Finally, this document describes the approach to benchmark ATAC-seq and RNA-seq read mapping efficiency for the 105 and AEP assemblies. The starting point for this document is the finalized, non-masked AEP assembly. The creation of which is described in 01_aepGenomeAssembly.md.

This gene model prediction process entailed generating an initial set of annotations with BRAKER2 using protein hints from a custom metazoan proteome database and transcript hints using whole animal Hydra RNA-seq data. We then supplemented these gene models with a second set of predictions generated with exonerate using the Hydra LRv2 transcriptome and a custom database of Hydra GenBank mRNA sequences. The final set of gene models was generated with the PASA pipeline by using a new transcriptome assembly to augment the splice isoform and UTR annotations. We generated functional annotations using OrthoFinder, InterProScan, and BLAST searches.

Generating Hints for Gene Predictions

Transcriptomic Hints

Aligning Whole-Animal RNA-seq Data to the AEP Assembly

To provide transcriptomic data for the gene prediction software, we made use of four paired end whole animal RNA-seq libraries generated from various AEP-derived transgenic lines. In the file names below, W indicates data from the watermelon line, O indicates data from the operon line, IW indicates data from the inverse watermelon line, and E indicates data from the enGreen1 line.

In addition we generated two PE RNA-seq libraries from whole male and female Kiel AEP polyps (non-transgenic):

Prior to performing the analysis the file names were simplified as follows:

Reads were then processed with trimmomatic (v0.36) to remove low quality base calls and sequencing adapter contamination

(01_prepHints/trim.sh)

Following processing, R1 and R2 files were pooled:

We then prepped the AEP genome (with interspersed repeats hard-masked) for mapping with STAR (v2.7.5c)

(01_prepHints/makeRef.sh)

Next, we mapped the reads to the genome:

(01_prepHints/runAlign.sh)

Generating an AEP Transcriptome for Use in Gene Predictions

To aid in later genome annotation steps we also generated a transcriptome using this genome-mapped RNA-seq data. Although we had already generated an AEP transcriptome (LRv2 transcriptome, Siebert et al., 2019), it was not produced using any data from animals undergoing gametogenesis. This could possibly cause us to miss some transcripts specific to male or female polyps. In addition, the transcriptome was designed to have 'low redundancy', and may have omitted some splicing complexity.

We therefore sought to generate a new transcriptome that both incorporated reads from polyps producing gametes and included the full possible transcriptomic complexity in adult Hydra.

To generate the transcriptome, the mapped reads from the previous section were provided as input to the Trinity reference-guided transcriptome assembly pipeline (v2.11.0)

(01_prepHints/runTrinity.sh)

Quantifying BUSCO (v5.beta_cv1) metrics for the transcriptome indicated high levels of redundancy (to be expected in a relatively unprocessed transcriptome), but high levels of overall completeness:

Compiling a Protein Hints Database

Protein sequences from both closely and distantly related species can also provide valuable guides for gene prediction software. Our goal was to make use of diverse metazoan proteomes, with a particular focus on cnidarian species. We downloaded the following proteomes to serve as hints for gene prediction:

Translating transcriptome sources into proteomes

Notably the files H_vulgarisZurich.fa, H_oligactis.fa, H_circumcincta.fa, and H_echinata.fa were transcriptomes, and therefore needed to be translated into protein sequences first. We translated these files using Transdecoder.

Creating a Custom Protein Database to Guide ORF Selection

To guide Transdecoder's selection of possible reading frames, we used BLAST results from a custom protein database.

To generate the BLAST (v2.10.0+) database, we started with the metazoan orthoDB database:

We then supplemented these sequences with refseq mRNA entries from cnidarians (excluding Hydra), which were retrieved using the following query on refseq:

srcdb_refseq[prop] AND ("Cnidaria"[Organism] AND biomol_mrna[PROP]) NOT "Hydra vulgaris"[Organism]

Sequences returned by this query were downloaded into the file cnido_prot_sequence.fa

Before we pooled cnido_prot_sequence.fa and proteins.fa, we first removed all proteins in cnido_prot_sequence.fa that were already present in proteins.fa using CD-hit (with a 95% sequence similarity threshold; CD-hit version v4.7):

(01_prepHints/protFilt.sh)

Finally, we pooled proteins.fa and cnido_prot_sequence.fa to make a final proteins.fa file (the old proteins.fa file was removed).

Translating the H. echinata transcriptome

After compiling proteins.fa we used it to generate a BLAST-able database:

makeblastdb -in proteins.fa -dbtype nucl -title proteins -out proteins

We then ran TransDecoder (v5.2.0) on the H. echinata transcriptome and incorporated BLAST results for the candidate peptide sequences when compared to the protein database:

(01_prepHints/transDecoderHech.sh)

Translating Brown Hydra Transcriptomes

We used essentially the same approach for the brown Hydra transcriptomes, although we used diamond (v2.0.6) instead of BLAST to speed things up.

To make the diamond database from proteins.fa:

diamond makedb -d proteins --in proteins.fa

We first reduced the redundancy in the Hydra transcriptomes using CD-hit:

We then ran TransDecoder using the following script:

(01_prepHints/transDecoderClose.sh)

For all Transdecoder runs, the resulting .pep file was used as the proteomes from these species for downstream applications.

Final Proteome Compilation and Formating

After generating all the individual protein fasta files, we used a script provided as part of a standard Orthofinder (v2.5.4) installation (primary_transcript.py) to extract primary isoforms from each file (this is mainly needed for Ensemble-sourced files). We also removed stop codon symbols from the proteomes. The output was directed into a subdirectory called primary_transcripts.

Files in the primary transcripts were then concatenated into the file allPrimProts.fa

Finally, atypical non-AA characters were removed to prevent parsing errors later on:

sed -i -e '/^[^>]/s/[^AaRrNnDdCcEeQqGgHhIiLlKkMmFfPpSsTtWwYyVvBbZzJjXx]/X/g' allPrimProts.fa

Performing Ab Initio Gene Predictions

We next used BRAKER2 (v2.1.5) to generate gene models using hints provided by our new transcriptome, proteomic database, and genome-mapped RNA-seq data.

Note: The file aepAligned.sortedByCoord.out.bam created by STAR after mapping the RNA-seq data to the genome was renamed to rna.bam

(02_braker2/brakerScript.sh)

The script above was executed on the cluster within a Singularity container on which BRAKER2 was installed

(02_braker2/runBraker.sh)

Reformating/Fixing Braker2 GFF3 Files

BRAKER2 incorporates gene model predictions from both GeneMark and Augustus. Unfortunately the GeneMark models were improperly formatted in the GFF3 file produced by BRAKER2, in that they all lacked mRNA/transcript and gene rows for the gene models. In addition, some Augustus predictions also lacked mRNA rows, and all Augustus predictions lacked gene rows. We used the following R script to fix that issue

(02_braker2/brakerFixGname.R)

Some BRAKER2 gene models included incomplete ORFs with internal stop codons, which we had to filter out. For this step we used a perl script (transcript_keeper.pl) from a very useful repo of genome annotation tools to retain only the complete gene models. We also used several other tools from this repo throughout the annotation process.

We generated some stats for these initial gene models. First we looked at the number of Genes/transcripts, exon length, etc.

We also looked at BUSCO stats:

Supplementing Gene Models Using Exonerate

The stats for the Braker2 gene models were quite good; however, the number of complete BUSCOs was somewhat lower than the genome-guided transcriptome we had produced, which suggested that there were additional BUSCOs in our genome that weren't being annotated. We therefore wrote a custom pipeline to produce gene models from nucleotide alignments generated by exonerate (v2.2.0). This would allow us to use as input our previous 'gold standard' annotation (the LRv2 transcriptome) as well as any manually deposited Hydra GenBank entries to supplement the BRAKER2 annotations, hopefully filling in some of the gaps in the annotation.

Compiling the Input Sequences to Be Used for Alignment

We wanted to include manually deposited Hydra GenBank sequences because all of those sequences were experimentally validated in some way, making them high quality coding sequence predictions. To get these sequences from GenBank we started with this query on NCBI:

"Hydra vulgaris"[porgn] AND (biomol_mrna[PROP] AND ddbj_embl_genbank[filter])

We downloaded the results as a multi-genbank file (downloaded on November 13, 2020) . This file contained a large number of procedurally deposited GenBank entries that had not been experimentally validated. We filtered out those entries and exported the remaining nucleotide sequences using the following python script:

(03_exonerate/hydraAnnotations.py)

We then combined these sequences with the aep LRv2 transcriptome to generate the input for our alignment pipeline.

cat hydraAnnotations.fasta aepLRv2.fasta > query.fa

Pipeline for Generating Gene Models from Exonerate Alignments

In principle, the pipeline is simple, in that Exonerate can take an mRNA sequence and a genome sequence as input and output gene model coordinates in a GFF format; however, there are multiple complications when actually implementing this approach. First, the specific Exonerate algorithm that produces high-quality alignments (cdna2genome) is prohibitively slow when given a genome-sized search space. Second, Exonerate has finicky input requirements, including a requirement for a file that specifies the ORF coordinates for a given input transcript. Third, the GFF files produced by Exonerate need extensive formatting fixes.

Pipeline software versions for software not yet mentioned:

BedTools v2.30.0, EMBOSS v6.6.0.0, agat v0.6.1

The pipeline script is provided below:

(03_exonerate/gbMap.sh)

This script uses several supplemental Rscripts, primarily for reformatting text files.

The accessory script below fixes formatting issues with Exonerate's GFF output. In addition it converts the coordinates into their proper genomic equivalents (in the initial output the coordinates are relative to the small stretch of sequence used for the alignment).

(03_exonerate/reformatGff.R)

The accessory script below fixes the coordinates for ORFs generated by getorf in part by including the stop codon in the final ORF coordinates.

(03_exonerate/addStopCoord.R)

The accessory script below makes sure all of the gene and mRNA IDs are uniformly formatted following the AGAT conversion to GFF3. It also makes sure all rows associated with a gene model have the appropriate parent ID.

(03_exonerate/fixParents.R)

Running the Exonerate Pipeline

To prep for the pipeline, we generated a blast db from the AEP genome file that had all repeats soft-masked

makeblastdb -in aep.genome.fullsoft.fa -dbtype nucl -title AEPgenome -parse_seqids -out AEPgenome

We also preped a file with chromosome sizes (needed by bedtools):

Finally, we split our multifasta file of query sequences to run the pipeline in parallel

seqkit split -p 24 query.fa

The pipeline was executed using the following script:

(03_exonerate/slurmRunExo.sh)

We then concatenated the resulting output files from the 24 separate runs

cat query.part_0*/fullRes.gff3 > exoCat.gff3

We filtered out short or incomplete ORFs from the resulting gene predictions.

Note: AGAT can't parse fasta files with excessively large line widths. The initial genome file didn't have linebreaks, so we added them using this command seqkit seq -w 60 aep.final.genome.fa > aep.final.genome.rfmt.fa

'Bad' gene models (those models not listed in exoHeaders.txt) were removed from the exonerate GFF3 file using this R script:

(03_exonerate/subExoComp.R)

Curating and Combining Exonerate and Braker2 Gene Models

Merging the Braker2 and Exonerate Gene Models

We next needed a way to merge the Braker2 and Exonerate gene models into a unified set of predictions. In many cases the two predictions identified a gene model at the same locus, meaning we needed a way to pick the better of the two options. We did this by BLASTing the gene models against our database of proteins (used initially to provide hints to the BRAKER2 pipeline) from other species and then picking whichever gene model had the best alignment score.

First we pooled the Braker2 and Exonerate protein sequences

cat exoFilt.fa braker.prots.fa > gmCandidates.fa

Then we removed any stop codon symbols (causes errors when BLASTing)

sed -i 's/\.$//g;s/\([A-Z]\)\./\1/g;s/\.\([A-Z]\)/\1/g' gmCandidates.fa

sed -i 's/\*//g' gmCandidates.fa

Then we used diamond to align the protein models to allPrimProts.fa

(04_mergeMods/runBlast.sh)

Next, we needed to identify which gene models from the two approaches overlapped (indicating a redundancy that needed to be resolved). We did this by looking for genes whose coordinates intersected each other in the genome

We then used the following R script to reduce the redundancy the BRAKER2 and exonerate models using the BedTools and BLAST output. We kept any gene models that had no intersections, or, if they did have intersections, we kept only the gene model that had the highest alignment score from our protein database BLAST run.

(04_mergeMods/gmFilt.R)

We then pooled the filtered gene models to generate the preliminary merged set of predictions

Removing TEs and Short Gene Models

Although we performed extensive repeat masking, there were still contaminating TE proteins in our gene models. To identify and remove at least some of these TEs, we used interProScan (v5.51-85.0) to scan our preliminary protein models to identify genes with transposase domains that we could then filter out.

(04_mergeMods/comboIPR.sh)

We used the domain prediction TSV to identify gene models with a transposase domain

grep 'transpos' combined.prots.fa.tsv | cut -f 1 | sort | uniq > teIDs.txt

We also flagged any proteins shorter than 50 AAs

seqkit seq -M 50 -i combined.prots.fa | grep ">" | sed 's/>//g' > shortProts.txt

Both lists of flagged IDs were used to filter the merged gene set

Below are the updated gene model stats after the TE and short AA filtering:

Renaming Gene Models

Next we prettied up the gene model names, giving them names that roughly followed Ensembl naming conventions using a utility function from MAKER3 (v3.01.03)

To evaluate the completeness, we selected the longest isoform for each gene model, extracted protein sequences, and ran BUSCO

Including the Exonerate models gave a decent boost to completeness:

Updating Gene Models with PASA

The merged exonerate/BRAKER2 gene models were very complete based on the BUSCO metrics; however, these gene models had relatively few isoforms, meaning we were likely underestimating overall transcriptional complexity. Also, only the Exonerate models had UTRs. This motiviated us to try and incorporate more of the information from our transcriptome into our gene models. We used the PASA pipeline (v2.4.1), which provides such a functionality.

We first prepped the transcriptome we generated using Trinity for the PASA pipeline:

(05_pasaUpdate/runCleanup.sh)

We then ran the main PASA pipeline, which aligned the transcriptome to the genome

(05_pasaUpdate/runAlignment.sh)

We executed the above script from within a singularity container on a slurm computing cluster using the script below:

(05_pasaUpdate/slurmRunAlignment.sh)

The PASA pipeline was then run again with the -A flag, triggering the annotation comparison mode. In this mode, PASA compares the aligned transcripts from the transcriptome to the provided gene annotations, and updates the gene models in cases where the aligned transcripts contained more/better information (e.g., splice sites or UTR coords)

(05_pasaUpdate/runCompare.sh)

The resulting GFF3 file was named HVAEP1.geneModels.pUpdate1.gff3

Polishing and Finalizing Gene Models

In some cases PASA ended up breaking ORFs of gene models that were previously complete. We dropped the PASA updated versions of those disrupted gene models and restored them to their prior pre-PASA state.

While reviewing the PASA-updated gene models, we came across a problem in the exonerate predictions where very large introns got inserted to try and fully align the full 3' UTR sequence that was provided (sometimes these included polyA sequence which weren't removed prior to alignment). We addressed this issue by dropping all 3' UTRs shorter than 20nt that were in an exon on their own at the end of genes using the following R script:

(06_finalize/uFix.R)

Because some gene models got merged or otherwise modified since we had used MAKER to reformat the gene names, we had to adjust the gene names so that they were still numbered consecutively according to their order in the genome.

Finally, because these gene models were passed through many different programs that often added odd/unconventional tags, there were quite a few weird formatting quirks in the 9th column of the GFF3. The following R script trys to catch and correct most of those formatting issues:

(06_finalize/postRnPolish.R)

Following these modifications, we finalized the gene models and generated the final fasta and GFF files:

Final gene model stats:

Generating Functional Annotations

We next set about generating functional annotations for the AEP gene models. To make inferences about gene function, we used protein domain predictions as well as orthology/sequence similarity to genes in better annotated animal models.

Predicting Protein Domains Using InterProScan

To predict protein domains, we used the InterProScan pipeline (including the optional modules for Phobius, SignalP, and TMHMM):

(07_funAnnot/runFinalIpr.sh)

This generated the output file HVAEP1.prot.longestIso.fa.tsv, which we used as our primary resource for determining the protein domain composition of the AEP gene models.

Predicting Orthology Using OrthoFinder

Identifying orthologs is critical for understanding for any comparative genomics analyses, and can also be a useful way of preliminarily assigning functions to genes of interest. We used OrthoFinder to systematically identify orthologs for all AEP gene models in diverse metazoan species.

We assembled a total of 45 proteomes for the OrthoFinder analysis.

The sources for most of these proteomes were described above. Below are the sources for the additional proteomes that we added for the OrthoFinder analysis:

We also dropped the H_vulgarisZurich.fa that was in our original list of proteomes that we used as hints for gene model prediction.

One of the new sources was the AEP LRv2 trancriptome, which we needed to translate into protein sequence. We did this using transdecoder, similar to what was described above for other transcriptomic sources, although in this case we used the NCBI NR database instead of a custom protein database for generating BLAST hits to prioritize predicted ORFs.

(07_funAnnot/transDecoder.sh)

After we had compiled our protein sources, we reformatted the proteomes to be compatible with Orthofinder (primarily dropping stop codon symbols and spaces in header text) and selected single representative isoforms for each gene (when possible):

To make interpreting the OrthoFinder results easier, we wanted to incorporate gene names from certain well-studied species (e.g., humans, flies, etc.) into the sequence IDs used in the analysis. By doing this, we would be able to discern the identity of at least some genes in the OrthoFinder gene trees without having to first convert a complex gene ID into something more human readable.

We used a custom R script to identify proteomes that were associated with functional annotations/gene names in ensembl. We then used the gene IDs from those proteomes to download gene names (as well as GO terms) from biomart. We exported new versions of the proteome fasta files with modified headers that included the abbreviated gene name. We also exported tables that included all the metadata we downloaded (ensembl ID, short gene name, long gene name, GO terms, and uniparc ID) for each proteome as a separate reference.

(07_funAnnot/getSymbols.R)

We then ran the Orthofinder pipeline on our processed protein fasta files (with settings aimed at maximizing sensitivity/accuracy). The output from this run was placed in the directory Results_Sep15_1

(07_funAnnot/runOrthoF.sh)

Included in the input was a newick tree defining the phylogenetic relationships of all the species in the analysis:

 

orthoTree

Orthofinder can predict the species tree on its own based purely on the protein sequence input, but we found that this did not result in an accurate tree. Because the tree topology is important for getting accurate orthology predictions, we manually specified the tree structure as part of the input to the pipeline.

However, we also wanted to get an estimate of protein sequence divergence between species, so we repeated the Orthofinder analysis while omitting the manually generated tree.

(07_funAnnot/getSpecTreeDist.sh)

We then manually fixed the topology of the resulting species tree using Mesquite while retaining the Orthofinder-generated distances.

orthoTreeLength

Finally, in order to more conveniently review the gene trees generated by Orthofinder, we used the following R script to generate PDFs of tree plots based on the Newick-formatted tree files generated by Orthofinder:

(07_funAnnot/TreePlots.R)

BLASTing Against UniProt and GenBank

In addition to identifying direct orthologs using OrthoFinder, we also used BLAST to identify proteins from well annotated databases with significant sequence homology to the AEP gene models. We started with the database of manually deposited Hydra sequences from genbank (described above).

We first generated a BLAST database:

We then used the fasta file containing the nucleotide of the longest isoforms from each AEP gene model as a query for a BLASTN search:

(07_funAnnot/gbAepBlast.sh)

We also generated a BLAST database from the uniprot protein database (downloaded here on July 30th, 2021):

We then used the fasta file containing the AA sequence of the longest isoforms from each AEP gene model as a query for a diamond BLASTP search:

(07_funAnnot/upAepBlast.sh)

Combining Different Annotation Sources

We then combined our InterProScan (specifically the PANTHER and PFAM output), OrthoFinder, and BLAST results into a single functional annotation table (called HVAEP1_annotation.csv) using the R script below. This table served as our general point of reference for exploring possible functions for genes of interest.

For integrating the Orthofinder results into this table, we opted to only pull orthologs from a handful of well-studied systems—namely H. sapiens, M. musculus, X. tropicalis, D. melanogaster, and C. elegans—because of the abundance of functional data available from these systems. In cases where there were orthologs from more than one of these five species, we (somewhat arbitrarily) prioritized species based on the order they were written above. That is, an ortholog from H. sapiens was prioritized over orthologs from M. musculus, but orthologs from M. musculus were prioritized over orthologs from X. tropicalis.

We also attempted to collapse orthologs in cases where multiple members of the same gene family were assigned as orthologs to a single Hydra gene, such as collapsing wnt8a and wnt8b to wnt8 and fgf1 and fgf2 to fgf1/2.

(07_funAnnot/orthologTables.R)

Benchmarking Mapping Statistics for the AEP and 105 Genome Assemblies

In order to evaluate the differences in mapping efficiency for AEP-derrived sequencing data when using a strain AEP vs. a strain 105 genome reference, we used two different types of sequencing data—RNA-seq and ATAC-seq.

Aligning Whole-Animal RNA-seq Data to the AEP and 105 Reference Genomes

For the RNA-seq alignment benchmarking, we used data from an experiment that generated a total of 206,106,125 SE100 reads from whole adult Hydra exhibiting different sexual phenotypes. Specifically there were three male replicates (samples M1-3), three female replicates (samples F1-3), and three replicates of Hydra that weren't producing gametes (samples NS1-3).

To generate alignments with these data we first filtered the data using Trimmomatic to remove low quality base calls and adapter contamination, we then used STAR as implemented within rsem (v1.2.31) to align the filtered reads to either the strain 105 or the strain AEP assembly genomes.

Prior to running the alignment pipeline, we first had to prepare the reference sequences for the two genomes. rsem requires a gene to transcript map, which is a text file linking transcript IDs to their parent gene IDs. We generated this for the 105 reference from the gene model gtf file:

We then ran the rem-prepare-reference command to generate the reference metadata required by the STAR aligner.

(08_crossMap/prepRef.sh)

We then ran the RNA-seq mapping pipeline. The pipeline was executed on a computing cluster using the following script:

(08_crossMap/slurmMap_Dove.sh)

Below is the pipeline script executed within the computing cluster wrapper. The code is almost entirely recycled from a previous publication

(08_crossMap/RNA_Mapping_Pipeline_dovetail.sh)

After we had aligned the data to the 105 reference, we replaced the STAR genome reference with the AEP reference and re-ran the pipline.

Generating the gene to transcript map for the AEP gene models:

Generating the STAR reference for the AEP assembly:

(08_crossMap/prepRefAEP.sh)

We then simply re-ran the RNA mapping pipeline with some minor modifications:

(08_crossMap/slurmMap.sh)

The pipeline script RNA_Mapping_Pipeline.sh differed from the script used for mapping to the 105 reference (RNA_Mapping_Pipeline_dovetail.sh) in the following (superficial) ways:

(Output from diff RNA_Mapping_Pipeline.sh RNA_Mapping_Pipeline_dovetail.sh )

Aligning Whole-Animal ATAC-seq Data to the AEP and 105 Reference Genomes

For the ATAC-seq alignment benchmarking, we used the whole-animal ATAC-seq data generated for this study. We used the mapping results from the analysis described in 08_creIdentification.md for the AEP alignment statistics. For the 105 assembly, we simply replaced the AEP bowtie reference with a 105 assembly bowtie reference using the following command:

bowtie2-build Hm105_Dovetail_Assembly_1.0.fasta hydra_genome

We then re-ran the mapping pipeline.

Plotting Differences in Mapping Rates

We pulled the logs from both the RNA-seq (NS[1-3]_RNALog.final.out, F[1-3]_RNALog.final.out, and M[1-3]_RNALog.final.out from the tmp folders generated by rsem) and ATAC-seq (ATAC_Pipeline_[0-2].err) mapping pipelines and used them to create plots summarizing the mapping results using the following R script:

(08_crossMap/mapStats.R)

crossMapStats

Files Associated with This Document