WEB SITE LINKS FOR DATA SUBMISSION, APPROPRIATE NOMENCLATURE, AND ADDITIONAL RESOURCES

Genome Research requires that data from a publication be easily available to the broader community in publicly held databases when available, and at the Genome Research Web site, and if desired at the author's Web site, when they are not. The following list of public databases and resources serves as an introductory guide to data submission and appropriate nomenclature for authors contributing to Genome Research. However, this list should not be considered to be comprehensive. If there is an additional database or resource not listed here that would be of use to authors, please contact us.

SEQUENCE DATA

All new sequence data should be submitted to and assigned an accession number(s) by an International Nucleotide Sequence Database Collaboration member (GenBank, EMBL-Bank, or DDBJ) prior to publication.

GenBank, the NIH genetic sequence database, is an annotated collection of all publicly available DNA sequences. Instructions for sequence data submission.

The EMBL Nucleotide Sequence Database (EMBL-Bank) obtains DNA and RNA sequences from direct submissions by individual researchers, genome sequencing projects and patent applications. Instructions for sequence data submission.

DNA Data Bank of Japan (DDBJ) collects DNA sequences from researchers and issues internationally recognized accession number to data submitters. Instructions for sequence data submission.

miRBase collects microRNA (miRNA) data, containing all published miRNA sequences, genomic locations and associated annotation. The miRBase Registry section provides a confidential service assigning official names for novel miRNA genes prior to publication of their discovery.

GENOTYPE/PHENOTYPE AND GENOMIC VARIATION DATA

As the study of structural variation in the genome (i.e. indels, duplications, copy number variations, inversions, translocations, etc.) has outpaced the development of standards for the collection of data, it is currently recommended that authors review the structural variation data guidelines recommended by Scherer et al., (2007) Nat Genet. 39 (7 Suppl):S7-15. Sequence variations and small indels up to 10,000 bp are typically submitted to dbSNP.

The Database of Single Nucleotide Polymorphisms (dbSNP) includes data on genetic variation such as single nucleotide polymorphisms (SNPs), small-scale insertion/deletions, polymorphic repetitive elements, and microsatellite variation in humans and other organisms. Instructions for SNP data submission.

The Database of Genotype and Phenotype (dbGaP) archives and distributes the results of studies investigating the interaction of genotype and phenotype, including genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.

Database of Genomic Variants (DGV) provides a comprehensive summary of structural variation in the human genome and serves as a catalog of control data for studies aiming to correlate genomic variation with phenotypic data.

The Human Structural Variation Database catalogues human genomic polymorphisms ascertained by experimental and computational analyses, including large-scale structural variation (LSV), copy number polymorphisms (CNPs) and intermediate-sized structural variation (ISV).

MICROARRAY DATA

The Gene Expression Omnibus (GEO) is a gene expression/molecular abundance repository and curated resource supporting MIAME-compliant data submissions, including microarray-based experiments that measure gene expression, or detect genomic gains and losses (arrayCGH), detect SNPs, or identify protein-binding genomic regions in conjunction with ChIP-chip, or locate transcribed regions. GEO also accepts non-array-based high-throughput data, including SAGE, MPSS, and some peptide profiling techniques such as MS/MS. Instructions for data submission.

ArrayExpress is a repository for MIAME-compliant microarray data available for browsing and querying. The ArrayExpress Data Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository. Instructions for data submission.

SEQUENCE READS

The Short Read Archive (SRA), to be fully deployed in 2008 by NCBI in collaboration with Ensembl, archives short read data from next-generation sequencing technologies (e.g. 454 [Roche], Illumina, ABI SOLiD, Helicos). Instructions for data submission.

dbEST, a division of GenBank, contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags (ESTs), from a number of organisms. Instructions for data submission.

PROTEOMICS AND MOLECULAR INTERACTIONS

The International Molecular Exchange Consortium (IMEx), a group of major public interaction data providers, has established standards for the collection and curation of molecular interaction data. The IMEx site provides instructions for submitting interaction data to any of the partner databases (DIP, IntAct, HPRD, MINT, MPact, BioGRID, BOND).

The Database of Interacting Proteins (DIP) catalogs experimentally determined protein interactions from a variety of sources to create a single set of protein-protein interactions.

IntAct is an open-source database system and analysis tool for freely available protein interaction data derived from literature curation or direct user submissions.

GENE AND GENE PRODUCT NOMENCLATURE

Nomenclature for genes and proteins should be in the appropriate format (including appropriate italics and/or capitalization as it applies for each organism's standard nomenclature format) in text and figures, and where available, submitted and approved by the appropriate nomenclature committees. Specific nomenclature guidelines for commonly studied organisms are listed below.

Human nomenclature guidelines from the Human Genome Organisation (HUGO) Gene Nomenclature Committee. Search for current and approved gene names/symbols.

Chicken nomenclature guidelines from the Poultry Species Committee of the National Animal Genome Research Program (NAGRP).

Rat nomenclature guidelines from the Rat Genome Nomenclature Committee (RGNC). Search for current and approved gene names/symbols.

Mouse nomenclature guidelines from the Mouse Genomic Nomenclature Committee (MGNC). Search for current and approved gene names/symbols.

Zebrafish nomenclature guidelines from the Zebrafish Nomenclature Committee (ZNC). Search for current and approved gene names/symbols.

Drosophila nomenclature guidelines adopted by FlyBase. Search for current and approved gene names/symbols.

Arabidopsis nomenclature guidelines adopted by The Arabidopsis Information Resource (TAIR). Search for current and approved gene names/symbols.

C. elegans nomenclature guidelines from WormBase and the Caenorhabditis Genetics Center (CGC). Search for current and approved gene names/symbols.

S. cerevisiae nomenclature guidelines adopted by the Saccharomyces Genome Database (SGD). Search for current and approved gene names/symbols.

Bacteria nomenclature should follow the guidelines established by Demerec et al., (1966) Genetics 54:61-76.

ADDITIONAL RESOURCES

The Gene Ontology (GO) project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

The ENCyclopedia Of DNA Elements (ENCODE) project aims to identify all functional elements in the sequence of the human genome. The recently completed pilot project phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence.

The International HapMap Project, a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom and the United States, developed a haplotype map resource to describe the common patterns of human DNA sequence variation to help researchers find genes associated with human disease and response to pharmaceuticals.

SeattleSNPs focuses on identifying, genotyping, and modeling the associations between SNPs in candidate genes and pathways underlying the human inflammatory response.

The H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts, containing curated annotations of human genes and transcripts that include gene structures, alternative splicing isoforms, non-coding functional RNAs, genetic polymorphisms (SNPs, indels and microsatellite repeats), relation with diseases, gene expression profiling, molecular evolutionary features, protein-protein interactions (PPIs) and gene families/groups.

The Cancer Genome Atlas (TCGA) is a comprehensive effort to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database of biological systems, consisting of genes and proteins, endogenous and exogenous chemicals, interaction and reaction networks, and hierarchies and relationships of various biological objects.

The Human Protein Reference Database (HPRD) is a centralized platform to depict and integrate information manually extracted from the literature regarding domain architecture, post-translational modifications, interaction networks, and disease association for each protein in the human proteome.

The Biomolecular Object Network Databank (BOND), formerly BIND, combines sequence, interaction, and related interactome data and content, containing GenBank and BIND data, as well as related tools and information.

The Reactome project is a curated resource of core pathways and reactions in human biology, as well as electronically inferred orthologous events in 22 non-human species including mouse, rat, chicken, puffer fish, C. elegans, Drosophila, yeast, two plants, and E. coli.

The Clusters of Orthologous Groups (COGs) resource was constructed by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

Online Mendelian Inheritance in Man (OMIM), a phenotypic companion to the human genome project, is a catalog of human genes and genetic disorders, focusing primarily on heritable genetic diseases.

Psuedogene.org is a comprehensive database of identified pseudogenes, utilities to identify pseudogenes, various publication data sets, and a pseudogene knowledgebase.

Repbase Update (RU) is a database of prototypic sequences representing repetitive DNA from a number of eukaryotic species, with instructions for the submission of sequence data.