Table 4.

Genes Identified in the ∼1.4 Mb of Finished Sequence from the Mouse WS Region

Gene[i] CpG island[ii] Mouse-human comparisons
CDS length in bp, mouse (human)[iii] CDS, % identity[iv] AA sequence, % identity[v]
AK005040 Yes1163 (NA)NANA
Gtf2ird2 Yes2811 (2673)79.382.1
Ncf1 No1173 (1170)81.482.5
Gtf2i Yes2940 (2937)87.796.8
Gtf2ird1 Yes2071 (2077)88.087.1
Cyln2 Yes3136 (3134)86.091.4
Rfc2 Yes1050 (1066)84.992.8
Wbscr15 No576 (610)74.064.6
Eif4h Yes747 (747)91.398.4
Limk1 Yes1944 (1944)88.095.2
Eln Yes2582 (2274)81.481.8
Cldn4 Yes631 (628)82.783.2
Cldn3 Yes660 (663)88.291.3
AK017044 No838 (NA)NANA
AK004244 Yes924 (NA)NANA
AK008014 No544 (529)75.3NA
Stx1a Yes867 (863)91.098.3
AK003386 Yes1135 (1045)81.274.9
AK019256 Yes530 (530)78.876.3
BE290321 Yes521 (NA)NANA
Wbscr14 Yes2595 (2559)83.981.6
Tbl2 Yes1329 (1344)85.487.8
Bcl7b Yes546 (546)88.594.6
Baz1b Yes4440 (4452)86.691.1
Fzd9 Yes1648 (1648)87.695.8
Fkbp6 Yes864 (864)81.686.0
BF522554 Yes1455 (1466)84.278.8
BE630793 Yes1211 (1212)83.2NA
Pom121 Yes3361 (3440)78.171.1
Hip1 Yes2518 (2518)87.687.6

[i] The 30 genes identified within the ∼1.4 Mb of finished sequence from the mouse WS region are listed in their order on mouse chromosome 5G1-G2 (from centromere to telomere; see Fig. 1). Of these 30 genes, 21 have been previously published (listed in Table 1and depicted in Fig. 1) or, in the case of Gtf2ird2, submitted as an annotated GenBank record (AY014963). In the case of the 9 genes previously not reported as residing in the WS region, representative GenBank accession numbers are provided (see Fig. 3).

[ii] The presence (yes) or absence (no) of an overlap between the 5′ exon of the gene and a CpG island (regions of ≥50% G+C content where the ratio of CpG dinucleotides relative to GpC is ≥60% within a 200-bp window) is indicated. In two cases (BE290321and BF522554), cDNA sequence was not available to define the 5′ exons; instead, the 5′ exons were predicted by GenScanbased on extending an existing EST (to a methionine codon).

[iii] The length of each mouse coding sequence (CDS) was established by one of several methods. If a mouse RefSeq entry was available for the gene (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html), the length of the CDS in that record was used. In the absence of a mouse RefSeq record but presence of a human gene sequence (HIP1), aBLASTZ alignment was used to identify the putative mouse coding and predicted amino acid sequences. In the absence of a human gene, other sources were used to annotate the mouse genes. For example, the rat Pom121 gene aligned with the mouse genomic sequence at >85% identity with precise exon boundaries and was therefore used to annotate the mouse Pom121 exons. Two genes (BE522554 andBE630793) were identified by a MegaBLAST search of the mouse genomic sequence against the TIGR EST database (http://www.tigr.org/tdb/tgi.shtml); the resulting information was used in conjunction with GenScan to establish the mouse gene model. The length of each human coding sequence was estimated byPipMaker (this was done for consistency because there was no corresponding human RefSeq record nor human LocusLink mRNA entry for roughly a third of the mouse genes). Of note, analyses performed using available human RefSeq records yielded the same results as those obtained using the PipMaker-predicted human coding sequences; in one case (ELN), PipMaker failed to predict a human coding sequence; in this case, the available RefSeq record was used. In one case (GTF2IRD2), PipMaker failed to predict a coding sequence and no full-length human cDNA sequence was available in GenBank; in this case, a GenScan prediction of the human coding sequence was used. In four cases (indicated by NA), none of the above means for predicting the human coding sequence was effective, most often due to the lack of available human genomic or cDNA sequence.

[iv] The tool EMBOSS(http://www.ebi.ac.uk/emboss/align), which uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length, was used to calculate the percent-identity of the mouse and human coding sequences over the aligned regions. In four cases, no human coding sequence was available for this analysis (indicated by NA).

[v] The predicted amino acid (AA) sequence derived from each orthologous mouse–human gene pair was compared usingEMBOSS. The indicated percent-identity corresponds to the percentage of the total amino acids with identical matches between the two sequences over the aligned regions. When available, the amino acid sequences were derived from RefSeq records; otherwise, matching GenBank protein records were used. In the case of BF522554, neither of these sources was available; thus, a translated version of the coding sequence predicted by PipMaker was used. WhenPipMaker failed to predict a human coding sequence for a mouse gene or no open reading frame could be found in the predicted coding sequence, BLASTX or BLASTP was used to search the National Center for Biotechnology Information database. For three genes (AK003386, AK019256, and Pom121), this yielded an aligning human protein (XP_042880, XP_042882, and XP_034753.1, respectively). In some cases (indicated by NA), amino acid sequence alignments could not generated, either because the mouse coding sequence did not provide an open reading frame that enabled an accurate prediction of a protein sequence or a human amino acid sequence could not be obtained for alignment with the predicted mouse protein.