Identifying and Masking Repetitive Regions in Hydra Genomes

This document covers our analysis of repetitive elements in H. vulgaris and H. oligactis genome assemblies. For the strain AEP H. vulgaris and the H. oligactis genomes, this included performing ab initio predictions of repeat families. RepeatMasker was then used to identify repetitive regions in the H. oligactis and strain AEP and strain 105 H. vulgaris genomes. The RepeatMasker output was also used to generate plots visualizing repeat landscapes in each genome.

Masking Repeats in the Strain AEP H. vulgaris assembly

Identifying Repeat Families Using RepeatModeler2

Currently, repeat databases lack extensive and well curated REs for Hydra, so using publicly available databases would likely miss a large portion of the repeats. We therefore used RepeatModeler2 (v2.0.1) to perform ab initio repeat family predictions for the AEP assembly.

Before running the primary RepeatModeler2 pipeline, we first had to index the genome fasta file:

(01_aepRep/buildRMRef.sh)

We then ran the RepeatModeler2 pipline with default settings, although we did include the optional LTR prediction pipeline:

(01_aepRep/runRM.sh)

The resulting Hydra-specific repeat file (consensi.fa.classified, which was renamed to aep-families.fa) was then used to mask the AEP assembly using RepeatMasker.

Masking Repeats using RepeatMasker

We used RepeatMasker (v4.0.7) to identify and mask repetitive sequences in the AEP assembly using repeat families predicted by RepeatModeler2:

(01_aepRep/runRMaskFull.sh)

Because there may have been some repeat families that were missed by repeatmodeler, we also ran repeatmasker using the eumetazoa Dfam repeat database (included with RepeatMasker installation, Dfam version 3.1) to try and catch some of the missed repeats.

(01_aepRep/runRMaskEuk.sh)

We then pooled the two masking results with the following command (note: the repeatmodeler results were moved to a folder called maskFull and the eumetazoa Dfam results were moved to a folder called eukMaskFull):

zcat maskFull/*cat eukMaskFull/*cat > bothMaskFull.cat

The pooled file was then processed to generate output files (including a hard masked fasta file and a repeat coordinates gff file) that combined the two masking results:

(01_aepRep/processBothFull.sh)

As part of this command, we included the -a flag, which created an optional alignment file needed to create the repeat landscape plots (used for visualization, described below).

The ProcessRepeats command generated the following statistics report:

(excerpt from 01_aepRep/bothMaskFull.tbl)

Because some analyses require that only simple or only complex (i.e., interspersed) repeats be masked, we did additional repeatmasker runs to selectively mask just one of the two repeat types.

We first masked simple repeats using the repeatmodeler libraries:

(01_aepRep/runRMaskSimple.sh)

Then we masked simple repeats using the Dfam eumetazoa library:

(01_aepRep/runRMaskEukSmpl.sh)

We then combined the repeatmasker output files:

zcat eukMaskSimp/*cat.gz maskSimp/*cat.gz > bothMaskSimp.cat

And generated the final masked product:

(01_aepRep/processBothSimp.sh)

We next performed interspersed/complex repeat masking, first with the repeatmodeler library:

(01_aepRep/runRMaskCplx.sh)

Then with the eumetazoa library:

(01_aepRep/runRMaskEukCplx.sh)

We then combined these two outputs:

zcat eukMaskCplx/*cat.gz maskCplx/*cat.gz > bothMaskCplx.cat

And generated the final masked files:

(01_aepRep/processBothCplx.sh)

Finally, we created softmasked versions for all of the above repeatmasker runs using bedtools

To enable visualization of repeat density throughout the AEP assembly, we generated a bigwig file that quantified the number of repeats present at each position along the genome (essentially just a binary classification).

(01_aepRep/repeatDensity.sh)

Masking Repeats in the Strain 105 H. vulgaris assembly

Masking Repeats Using RepeatMasker

Because we wanted to use the 105 assembly as a point of comparison for the AEP assembly, and because we needed a repeat-masked version of the 105 assembly for our whole genome alignment (described in 07_genomeConservation.md), we also performed repeat masking on the 2.0 version of the strain 105 H. vulgaris genome.

For the 105 assembly, we opted to just use the repeat families we identified using the AEP assembly. The two strains are relatively closely related, so our approach was likely sufficient to capture most repeats, with the caveat that certain very recent repeat families may have been missed.

Overall, our strain 105 masking approach was essentially identical to our approach for strain AEP.

We first masked with the repeatmodeler families:

(02_105Rep/runRMaskFull105.sh)

We then masked with the Dfam eumetazoa library:

(02_105Rep/runRMaskEuk105.sh)

We combined the two outputs:

zcat 105Full/*cat 105EukFull/*cat > bothMaskFull105.cat

And generated combined output files:

(02_105Rep/processBothFull105.sh)

This produced the following results table:

(excerpt from 02_105Rep/bothMaskFull105.tbl)

We also generated a softmasked version of the genome fasta that was used for a cross-species whole-genome alignment:

Masking Repeats in the H. oligactis Assembly

Identifying Repeat Families Using RepeatModeler2

As H. oligactis is somewhat distantly related to H. vulgaris, we opted to generate an oligactis specific repeat library using RepeatModeler2

First we prepped the oligactis fasta file:

(03_oligRep/buildRMRefOlig.sh)

We then executed the RepeatModeler2 pipeline using default settings.

Note that in the below script we're running the RepeatModeler2 pipeline through the dfam-tetools wrapper script, as opposed to the script 01_aepRep/runRM.sh that executed RepeatModeler2 through a Singularity container. Ultimately, the actual command executed by the two scripts was identical.

(03_oligRep/runRMolig.sh)

For the oligactis genome, we found that we were not able to run the RepeatModeler2 pipeline in its entirety, as it would repeatedly crash during the LTR prediction step. Because the pipeline iteratively updates its repeat family predictions over the course of multiple rounds of analysis, and because LTR prediction is the last step in the pipeline, we were able to recover repeat predictions that were equivalent to the output of a normal RepeatModeler run without invoking the optional LTRPipeline step. We used this recovered repeat family file (consensi.fa.classified, which was renamed to oligConsensi.fa.classified) for subsequent repeatmasking of the oligactis assembly.

Masking Repeats Using RepeatMasker

For the repeatmasking process, we applied the same basic approach as we did for the H. vulgaris genomes.

We first masked the genome using our set of predicted repeat families from repeatmodeler:

(03_oligRep/runRMaskFullOlig.sh)

We then performed an additional masking step using the Dfam eumetazoa repeat library:

(03_oligRep/runOligMaskEuk.sh)

We combined these two results:

zcat oligFullMask/*cat oligEuk/*cat > olig_genome_combined.fa.cat

And processed them to generate a final set of masked repeats:

(03_oligRep/runProcessOlig.sh)

This generated the following results table:

(excerpt from olig_genome_combined.fa.tbl)

We also generated a softmasked version of the genome fasta for subsequent gene prediction analyses.

Visualizing Repeat Prediction Results

To visualize our repeat annotation results, we generated repeat landscape plots, which use sequence divergence in individual repeat instances throughout a genome to infer the history of transposition events.

To generate repeat landscapes, we used the .align files generated by the repeatmasker ProcessRepeats function. Repeatmasker has built in functionality both to calculate the divergence statistics needed to create a repeat landscape (saved in a .divsum file) and to generate interactive plots of the results (saved as a .html file).

We ran these utility functions for all three genomes we had used for repeatmasker runs. First for the AEP assembly:

(04_visRep/calcKimura.sh)

Then the 105 assembly:

(04_visRep/calcKimura105.sh)

Then the oligactis assembly:

(04_visRep/calcKimuraOlig.sh)

We wanted to customize the repeat landscape plots, so we extracted the relevant results table from each .divsum file:

We then used the following R script to generated stacked bar graphs of each genome's repeat landscape. We generated two plots per genome: one that grouped repeats by class (e.g., DNA element, retro-element, etc.) and one that grouped repeats by family (CR1, LTR, Mariner, etc.).

(04_visRep/kimuraPlot.R)

(AEP repeats grouped by class)

repFamKimura105

(AEP repeats grouped by family)

repSubFamKimura105

(105 repeats grouped by class)

repFamKimuraAEP

(105 repeats grouped by family)

repSubFamKimuraAEP

(oligactis repeats grouped by class)

repFamKimuraOlig

(oligactis repeats grouped by family)

repSubFamKimuraOlig

Finally, we also generated a simple bar plot summarizing the repeat composition of each genome using the result tables generated by the repeatmasker ProcessRepeats function:

(04_visRep/repPercPlots.R)

repPercBar

Files Associated with This Document