Database Divisions and Homology Search Files: A Guide for the Perplexed

  1. B.F. Francis Ouellette and
  2. Mark S. Boguski
  1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 USA

The exponential growth of DNA sequence data has become a challenge for both end users and database curators alike. When one of us (M.S.B.) was finishing graduate school, GenBank® (release 42) contained a mere 6.7 Mb in 9700 sequences. However, as we write this, GenBank (Benson et al. 1997) has topped 1000 Mb in >1.6 million sequences (release 102). (Information on GenBank releases is available atftp://ncbi.nlm.nih.gov/genbank/gbrel.txt). The National Center for Biotechnology Information (NCBI) and its partners in the international database collaboration—the DNA Database of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL)—all strive to collect, manage, and distribute this data in the most efficient and usable manner possible. These organizations also provide homology search, database query, and information retrieval services that serve the general molecular biology community as well as more specialized users. Unfortunately, it is easy to become confused about the many ways in which the data are made available for downloading, homology searching, and more general information retrieval purposes. We hope to clarify some of these issues here, with an emphasis on the manner in which high-throughput genomic sequence is processed, distributed, and made available for BLAST searching. We will emphasize services provided through NCBI but also note comparable services at European Bioinformatics Institute and the slight differences between GenBank, DDBJ, and the EMBL Data Library.

Divisions of the Nucleotide Sequence Databases

The nucleotide sequence databases were originally organized around loosely defined taxonomic groupings that reflected research trends and sequencing activity of a former era. These divisions are not as biologically relevant today, but so many public and private software systems have been developed to process these divisions that the databases must be conservative when contemplating changes in the structure of data distributions. The current divisional structures of GenBank, EMBL, and DDBJ are shown in Table 1. The reader will note that not all of these divisions are taxonimically based and that certain “functional” divisions have been added over time. Notably, in recent years, new divisions were added for EST and STS data because these sequences differed from traditional GenBank entries in many ways, including the way in which people computed on the data (Boguski et al. 1993). The newest functional division, the High Throughput Genomic (HTG) Sequence Division, is described below. Additional information is available athttp://www.ncbi.nlm.nih.gov/HTGS.

Table 1.

Database Divisions

HTG

Although the issue is still a matter of some controversy (Adams and Venter 1996; Bentley 1996), a consortium of large-scale sequencing centers and their funding agencies have reached a consensus agreement (the “Bermuda Principles”) regarding data produced in publicly funded projects. This agreement states that “unfinished” sequence data be released as soon as it is “usable” for homology searching and other types of sequence analysis. Usable data are currently defined as all sequences existing in contigs of >2 kb. Preliminary data such as these can be generated quite rapidly as they usually represent automated assemblies of single-pass, shotgun sequences. However, conversion to the “finished” state (complete contiguity with an error rate of 10−4 or less) may take considerably longer; hence, the motivation to release unfinished but usable sequence earlier. This process of data generation and public release is entirely different from traditional GenBank data submission, and the international collaborators have devised and implemented a system to accommodate this new paradigm. Unfinished sequences are submitted to and stored in the HTG Division, and each record is plainly labeled to indicate the preliminary nature of the data. An example is given in Figure 1.

Figure 1.

An example of a genomic sequence record (DDBJ/EMBL/GenBank accession number AC000003) as it progresses from an unfinished to a finished state. (These records have been truncated for the printed journal. Full views of these sequence can be retrieved fromhttp://www.ncbi.nlm.nih.gov/Entrez/nucleotide.htmlby entering the corresponding NID numbers (excluding the initial “g”) into the query box and specifying “Sequence ID” as the search field. Using the accession number, i.e., AC000003, as the query term will always and only retrieve the latest (finished) version of the record.) (A) Phase 1 records consist of multiple sequences derived from a single genomic clone such as the insert of a cosmid vector or bacterial artificial chromosome (BAC). The entire insert is represented by a single accession number, even though at this stage it consists of multiple sequence fragments, the order and orientation of which are unknown. Such records can be identified in GenBank by the keywords HTG; HTGS_PHASE1 and are found in the HTG Division of GenBank. (B) Phase 2 records consist of ordered sequence fragments with one or more gaps and are identified by the keywords HTG; HTGS_PHASE2. (C) Phase 3 records represent finished data with no gaps and an assumed accuracy of 10–4 errors or less. When records reach this finished state, they are moved to the appropriate organismal division of GenBank, in this case the Primate (PRI) Division. The only distinctions between these records and traditional GenBank records are their size and the keyword, HTG, which indicates their origin as part of a high-throughput sequencing project. Note well that although the accession number remains constant as the genomic sequence progresses through the various stages of completion, a different nucleotide sequence identifier (NID) number is assigned to each phase (e.g. g1556454 → g2182283 → g2204282). In practice, not all laboratories employ these phase definitions and not all records go through all phases. Some records are submitted initially as finished (phase 3); others may come in initially as phase 1 and updated directly to phase 3. Also note that records tend to include more and more annotation as they progress through the process; however, this is not a requirement for finished sequence and the degree of annotation varies considerably depending upon the submitting laboratory.

HTG records contain sequences derived from a single genomic clone, and the entire set receives a single GenBank accession number that remains with the sequence as it progresses to the finished state. When declared finished by the submitting laboratory, these records move into the traditional repositories of finished data—the organismal divisions of GenBank—and are placed according to the biological source of the sequence. Thus, finished human sequences are distributed in the Primate (PRI) Division of GenBank (or the HUM Division for EMBL and DDBJ), whereas finished nematode and Arabidopsis sequences are found in the Invertebrate (INV) and Plant (PLN) Divisions, respectively (Table 2). It may seem rather coarse to lumpHomo sapiens with other primates and Caenorhabditis elegans with other invertebrates; but this legacy of the earlier history of GenBank is irrelevant in the face of meta-information retrieval systems such as NCBI’s Entrez that, in conjunction with NCBI’s Taxonomy Database, permits one to explore and retrieve sequence records for any of the ∼25,000 biological species in GenBank. Furthermore, new versions of the BLAST software permit homology searches based on inclusive taxonomy parameters (Zhang and Madden 1997).

Table 2.

Relationships Between Divisions and Homology Search Files

Homology Search Files at NCBI and EBI

The divisional structures of GenBank, DDBJ, and EMBL Data Library were primarily designed for the purposes of efficient data distribution and file storage. For homology search purposes, there are other, more practical and desirable ways to organize the sequence data. For example, unfinished data such as EST and HTG sequences always need to be analyzed with error-tolerant software (such as BLASTX or TBLASTN) (Altschul et al. 1994). On the other hand, finished (accurate and annotated) data may have coding features that can be automatically converted to conceptual translations in a protein database where BLASTP provides a more sensitive and specific search tool. Thus, it is inefficient to combine finished and unfinished data in a single file for homology search purposes. It is also undesirable to combine qualitatively different types of data in a single search file. STSs, for example, have their own division of GenBank, and homology searching is not the most appropriate method for querying these data (Schuler 1997).

Another important consideration in the construction of homology search files is the issue of sequence redundancy (Altschul et al. 1994). GenBank, DDBJ, and EMBL Data Library are historical archives and may contain many, nearly identical versions of the same sequence. The ′nr′ (for nonredundant) data set (Altschul et al. 1994) is NCBI’s attempt to provide a more streamlined, yet comprehensive, collection of sequences for homology search purposes. nr includes finished (but not unfinished) HTG records (Table 2). Another important example is the “month” data set that provides a rolling month view of new GenBank entries. month is provided so that one does not have to repeatedly search previously examined portions of nr to identify matches to new sequences that have apppeared since the last search was performed. month includes newly finished HTG records. Unfinished (phase 1 and phase 2) HTG data are accessible for BLAST searching at NCBI by specifying the htgs database (Table 2).

As described previously, there are slight variations in the divisional structures of the three collaborating databases (Table 1). There are also differences in the ways in which the sequence data are made available for homology searching. One important example of this is the EMBL “ALL” database (emall) that combines both finished and unfinished HTG sequences for FASTA searching (Table 2).

DDBJ, EMBL, and GenBank must be conservative in contemplating changes to the divisional structures of the databases. However, these organizations can be and have been more flexible in producing specialized collections for homology searching. Thus, the user community should view the databases listed in Table 2 as subject to changes and improvements, driven by the ever-increasing quantity and variety of new sequence data.

Other Ways to Access Data

Entrez is a meta-information system that has been described in detail elsewhere (Schuler et al. 1996; Benson et al. 1997) and allows the user to query an extensive information space characterized by six divisions: (1) DNA sequences; (2) protein sequences; (3) maps and genomes; (4) macromolecular structures; (5) biomedical literature; and (6) taxonomy. Regarding DNA sequences in Entrez, all data in GenBank, regardless of Division, are available, including unfinished HTG records. These data may be queried using accession numbers, nucleotide sequence identifiers (NIDs), authors’ names, and a variety of other key words, as well as by accessing precomputed homology search results—a concept referred to as neighboring. In the near future, NCBI hopes to make available BLASTX neighbors through its Entrez service. This would allow users to access sequence similarities between even unfinished HTG records and the proteins they may encode.

Summary

All of the data in GenBank (and EMBL and DDBJ) are made available in a variety of ways, tailored to particular uses such as efficient data submission, distribution, and sequence homology searching. Unfortunately this can be somewhat confusing for contributors, data managers, and end users, all of whom have somewhat different perspectives and needs. The international database collaborators have striven to meet the various requirements of a diverse community, but new suggestions are always welcomed and may be directed to NCBI’s service desk at info{at}ncbi.nlm.nih.gov. Information resource providers will continue to experiment with new ways in which to make sequence data more accessible and useful to the community, particularly for homology search purposes.

Footnotes

REFERENCES

| Table of Contents

Preprint Server



Navigate This Article