Database Divisions and Homology Search Files: A Guide for the Perplexed

(Downloading may take up to 30 seconds. If the slide opens in your browser, select File -> Save As to save it.)

Click on image to view larger version.

Figure 1.Figure 1.Figure 1.
Figure 1.

An example of a genomic sequence record (DDBJ/EMBL/GenBank accession number AC000003) as it progresses from an unfinished to a finished state. (These records have been truncated for the printed journal. Full views of these sequence can be retrieved fromhttp://www.ncbi.nlm.nih.gov/Entrez/nucleotide.htmlby entering the corresponding NID numbers (excluding the initial “g”) into the query box and specifying “Sequence ID” as the search field. Using the accession number, i.e., AC000003, as the query term will always and only retrieve the latest (finished) version of the record.) (A) Phase 1 records consist of multiple sequences derived from a single genomic clone such as the insert of a cosmid vector or bacterial artificial chromosome (BAC). The entire insert is represented by a single accession number, even though at this stage it consists of multiple sequence fragments, the order and orientation of which are unknown. Such records can be identified in GenBank by the keywords HTG; HTGS_PHASE1 and are found in the HTG Division of GenBank. (B) Phase 2 records consist of ordered sequence fragments with one or more gaps and are identified by the keywords HTG; HTGS_PHASE2. (C) Phase 3 records represent finished data with no gaps and an assumed accuracy of 10–4 errors or less. When records reach this finished state, they are moved to the appropriate organismal division of GenBank, in this case the Primate (PRI) Division. The only distinctions between these records and traditional GenBank records are their size and the keyword, HTG, which indicates their origin as part of a high-throughput sequencing project. Note well that although the accession number remains constant as the genomic sequence progresses through the various stages of completion, a different nucleotide sequence identifier (NID) number is assigned to each phase (e.g. g1556454 → g2182283 → g2204282). In practice, not all laboratories employ these phase definitions and not all records go through all phases. Some records are submitted initially as finished (phase 3); others may come in initially as phase 1 and updated directly to phase 3. Also note that records tend to include more and more annotation as they progress through the process; however, this is not a requirement for finished sequence and the degree of annotation varies considerably depending upon the submitting laboratory.

This Article

  1. Genome Res. 7: 952-955

Preprint Server