Construction of Reference SNP Summary Records
During the submission process, all of the flanking sequences of dbSNP submissions are compared, pairwise, with the BLAST algorithm (Altschul et al. 1990) to identify cases of independent discovery and reporting. Independent discovery is not a rare event, as many of the groups involved with SNP discovery are working from a large but common set of initial reagents, e.g., sequences from dbEST (Boguski et al. 1993), clones from the genome sequencing pipeline, and submissions from dbSTS (Olson et al. 1989). In addition, new submissions to dbSNP can consist exclusively of additional frequency or genotype data on previously submitted variations. In such cases, the submission of a variation’s flanking sequence can be substituted with a reference to the database record for which the new data apply by using the SNP_LINK line type in the SNPASSAY section. Sets of two or more identical submissions are identified by a stepwise algorithm that first checks flanking sequence for probable identity and then checks the set of STS or GenBank accession numbers that are submitted with the records to ensure that their best representatives have been identified as high-scoring pairs (HSP), in the NCBI BLAST database (Fig. 1, step A). Our current acceptance criteria have been optimized through heuristic analysis of the BLAST output returned for each pair of markers and their flanking sequence. These criteria are as follows: (1) The variable sites must be at the same position in the sequence alignment returned by ungapped BLAST; (2) there must be a maximum of five partial matches in the aligned sequence [defined as an exact match between two International Union of Pure and Applied Chemistry (IUPAC) ambiguous nucleotide codes (e.g., R-R) or a match between a nucleotide and an ambiguity code of which it is a set member (e.g., A-R)]; (3) there must be a maximum of one mismatch in the BLAST alignment (e.g., A−T, A−G, or A−C); and (4) there must be a percent identity score, P ≥ 0.89, where P = (I + M)/min(qlen,slen), Iis the number of identical matches in the alignment, M is the number of partial matches in the alignment, and min(qlen,slen) is the minimum length of the two sequences (query and subject) returned from the BLAST alignment. Criterion 4 is necessary to eliminate false matches produced by very short, but highly significant, sequence alignments returned by the BLAST algorithm. Pairs of successful matches as defined by the above criteria are then evaluated for sequence similarity in their submitter-associated accession numbers. This extra level of validation is performed when possible to ensure that our presumed pairs occur within the context of a larger region of identical sequence. Sequence similarity is currently checked by selecting the longest sequence from the set of accession numbers (STS or GenBank) for each marker, and querying the NCBI HSP database to ensure that the pair have been externally neighbored for NCBI BLAST analysis or Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) retrieval. By working through all pairs of submitted records in such a fashion, sets of two or more identical records can be collected into a single reference SNP cluster. These reference SNP records are numbered sequentially and are prefixed with rs to distinguish them from individual submission records. The annotation of reference SNP records onto other NCBI resources is accomplished by a second round of ungapped BLAST analysis of the flanking sequences against the GenBank divisions NR, EST, STS, GSS, and HTGS (Fig. 1, step B). The current acceptance criteria for this process are (1) a maximum of five partial matches; (2) a maximum of two mismatches; and (3) an identity score of P ≥ 0.95. These high-stringency criteria were adopted to reduce the false-positive hit rate in our initial pass, and they may be modified as we continue our heuristic optimization of the algorithm.











