# New *Hydra* genomes reveal conserved principles of hydrozoan transcriptional regulation

This repository contains step-by-step descriptions of all analyses and associated code related to the following manuscript:

>Cazet JF, Siebert S, Morris Little H, Bertemes P, Primack AS, Ladurner P, Achrainer M, Fredriksen MT, Moreland RT, Singh S, Zhang S, Wolfsberg TG, Schnitzler CE, Baxevanis AD, Simakov O, Hobmayer B, and Juliano CE. 2022. New *Hydra* genomes reveal conserved principles of hydrozoan transcriptional regulation. bioRxiv 2022.06.21.496857; doi: [10.1101/2022.06.21.496857](https://doi.org/10.1101/2022.06.21.496857)

The manuscript is also accompanied by a genome portal, available [here](https://research.nhgri.nih.gov/HydraAEP/), that allows users to interact with and download the data generated in this study. A BLAST server is available to search for genes of interest in the *H. oligactis* and strain AEP *H. vulgaris* gene models. The portal includes an interactive genome browser for visualizing gene models, repetitive regions, ATAC-seq and CUT&Tag peaks, ATAC-seq and CUT&Tag read density, and sequence conservation across the AEP assembly. The website also features an interactive ShinyCell portal for viewing the AEP-aligned *Hydra* single-cell atlas.

##### Structure and intent of this repository

This repository is organized around markdown documents that are each focused on one particular computational aspect of the manuscript. Each markdown includes all code used for the analysis in question along with accompanying text that explains the code's purpose and rationale. Each markdown is also accompanied by a folder that contains the original script files used for the analysis as well as files generated by the analysis itself. Descriptions for all files within each folder can be found at the bottom of the accompanying markdown document.

Our intention in generating this repository was to document the methodology we used to produce the results reported in the manuscript in sufficient detail for other researchers to reproduce our findings. However, the code is written in a manner that relies on directory/file structures and software path configurations that are specific to the systems on which the analyses were initially performed. This original file organization is not recapitulated by this repository. In addition, because of file size limitations, we are not able to provide all necessary files for every analysis via GitHub. As such, users will need to modify the paths within each script and download additional files from other sources (described below) for the code to run properly after the repository has been cloned.

##### Accessing additional necessary files

All files necessary for performing the analyses described in this repository are available through the [*Hydra vulgaris*, strain AEP genome portal](https://research.nhgri.nih.gov/HydraAEP/download/index.cgi?dl=fa). Specifically, we provide complete versions of the folders that accompany each markdown document in this repository, including all files that were too large to host on GitHub. We also provide all sequencing data as well as R binary files containing various versions of the AEP-mapped single-cell RNA-seq atlas formatted as Seurat objects (v4).

Raw sequencing data is also available via NCBI under the BioProject ID PRJNA816482. The strain AEP *H. vulgaris* genome assembly is hosted on GenBank under the accession JALDPZ010000000 and the *H. oligactis* assembly is hosted under the accession JALDAD010000000. 

##### A note on naming conventions

When preparing the new genome portal, modifications were made to the naming conventions used for genome contigs/scaffolds and gene/transcript models. Because these changes were done after all analyses for the manuscript had already been completed, the code in this repository is based around a different naming convention than the one used for the genome portal. 

The scaffold naming convention used for the *H. vulgaris*, strain AEP genome assembly and annotation process uses the prefix 'chr-' (for chromosome) followed by a number. For example, chr-1 refers to the scaffold 'chromosome 1'. Gene models were named using the format HVAEP1_G######, with 'HVAEP1' indicating the genome version (i.e., *H. vulgaris*, strain AEP, version 1), 'G' indicating that the identifier refers to a gene, and '######' being a unique padded numeric ID for a particular gene model. For example, the gene name for *wnt3* is HVAEP1_G010730. Genes are named according to their order in the genome, such that HVAEP1_G010729 is the gene immediately upstream of HVAEP1_G010730 and HVAEP1_G010731 is the the gene immediately downstream of HVAEP1_G010730. The transcript naming convention uses the format HVAEP1_T######.#, with 'HVAEP1' again indicating the genome version, 'T' indicating that the identifier refers to a transcript, '######' being the same unique numeric ID as the parent gene, and '.#' indicating the transcript isoform number. For example, the first isoform for *wnt3* in the AEP gene models is HVAEP1_T010730.1. 

On the genome portal, the AEP scaffold prefix was modified from 'chr-' to 'HVAEP', such that the scaffold chr-1 became HVAEP1. For the AEP gene and transcript models, the 'HVAEP1' prefix, which was initially intended to indicate genome version, was modified to instead reflect the scaffold that contains the gene model. In addition, underscores were replaced with a dot. Thus, the *wnt3* transcript ID HVAEP1_T010730.1 was changed to HVAEP6.T010730.1.

The contig/scaffold naming convention used for the *H. oligactis* genome assembly process uses either the prefix 'contig_' or 'scaffold\_' followed by an arbitrary number. The transcript models follow the standard AUGUSTUS format (e.g., g1842.t1), with the 'g' prefix indicating that the identifier refers to a gene model, a unique non-padded numeric ID for each gene ( '1842'), followed by a transcript isoform ID ('.t1'). 

On the genome portal, the *oligactis* contigs/scaffolds were renamed to all have the prefix 'HOLI'. The numbering was also changed to reflect contig/scaffold size, with the largest contig/scaffold being assigned the ID HOLI00001 (previously contig_18179) and the smallest being assigned the ID HOLI16314 (previously contig_5588). Gene models kept the same AUGUSTUS-formatted ID, but the parent contig/scaffold was prepended to each ID, such that g1842.t1 became HOLI00150.g1842.t1.

All files and code associated with this repository, including the downloadable files hosted on the genome portal (specifically via the 'Scripts and Data' page, available [here](https://research.nhgri.nih.gov/HydraAEP/download/index.cgi?dl=fa)), use the original naming convention, whereas all other parts of the genome portal use the new naming convention. We include tables in the folder 'ID_Conversion' that provide the necessary information for mapping the IDs from the old naming convention to their equivalent ID under the new naming convention.
