Genes Identified in the ∼1.4 Mb of Finished Sequence from the Mouse WS Region
| Gene | CpG island | Mouse-human comparisons | |||
| CDS length in bp, mouse (human) | CDS, % identity | AA sequence, % identity | |||
| AK005040 | Yes | 1163 (NA) | NA | NA | |
| Gtf2ird2 | Yes | 2811 (2673) | 79.3 | 82.1 | |
| Ncf1 | No | 1173 (1170) | 81.4 | 82.5 | |
| Gtf2i | Yes | 2940 (2937) | 87.7 | 96.8 | |
| Gtf2ird1 | Yes | 2071 (2077) | 88.0 | 87.1 | |
| Cyln2 | Yes | 3136 (3134) | 86.0 | 91.4 | |
| Rfc2 | Yes | 1050 (1066) | 84.9 | 92.8 | |
| Wbscr15 | No | 576 (610) | 74.0 | 64.6 | |
| Eif4h | Yes | 747 (747) | 91.3 | 98.4 | |
| Limk1 | Yes | 1944 (1944) | 88.0 | 95.2 | |
| Eln | Yes | 2582 (2274) | 81.4 | 81.8 | |
| Cldn4 | Yes | 631 (628) | 82.7 | 83.2 | |
| Cldn3 | Yes | 660 (663) | 88.2 | 91.3 | |
| AK017044 | No | 838 (NA) | NA | NA | |
| AK004244 | Yes | 924 (NA) | NA | NA | |
| AK008014 | No | 544 (529) | 75.3 | NA | |
| Stx1a | Yes | 867 (863) | 91.0 | 98.3 | |
| AK003386 | Yes | 1135 (1045) | 81.2 | 74.9 | |
| AK019256 | Yes | 530 (530) | 78.8 | 76.3 | |
| BE290321 | Yes | 521 (NA) | NA | NA | |
| Wbscr14 | Yes | 2595 (2559) | 83.9 | 81.6 | |
| Tbl2 | Yes | 1329 (1344) | 85.4 | 87.8 | |
| Bcl7b | Yes | 546 (546) | 88.5 | 94.6 | |
| Baz1b | Yes | 4440 (4452) | 86.6 | 91.1 | |
| Fzd9 | Yes | 1648 (1648) | 87.6 | 95.8 | |
| Fkbp6 | Yes | 864 (864) | 81.6 | 86.0 | |
| BF522554 | Yes | 1455 (1466) | 84.2 | 78.8 | |
| BE630793 | Yes | 1211 (1212) | 83.2 | NA | |
| Pom121 | Yes | 3361 (3440) | 78.1 | 71.1 | |
| Hip1 | Yes | 2518 (2518) | 87.6 | 87.6 | |
-
↵The 30 genes identified within the ∼1.4 Mb of finished sequence from the mouse WS region are listed in their order on mouse chromosome 5G1-G2 (from centromere to telomere; see Fig. 1). Of these 30 genes, 21 have been previously published (listed in Table 1and depicted in Fig. 1) or, in the case of Gtf2ird2, submitted as an annotated GenBank record (AY014963). In the case of the 9 genes previously not reported as residing in the WS region, representative GenBank accession numbers are provided (see Fig. 3).
-
↵The presence (yes) or absence (no) of an overlap between the 5′ exon of the gene and a CpG island (regions of ≥50% G+C content where the ratio of CpG dinucleotides relative to GpC is ≥60% within a 200-bp window) is indicated. In two cases (BE290321and BF522554), cDNA sequence was not available to define the 5′ exons; instead, the 5′ exons were predicted by GenScanbased on extending an existing EST (to a methionine codon).
-
↵The length of each mouse coding sequence (CDS) was established by one of several methods. If a mouse RefSeq entry was available for the gene (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html), the length of the CDS in that record was used. In the absence of a mouse RefSeq record but presence of a human gene sequence (HIP1), aBLASTZ alignment was used to identify the putative mouse coding and predicted amino acid sequences. In the absence of a human gene, other sources were used to annotate the mouse genes. For example, the rat Pom121 gene aligned with the mouse genomic sequence at >85% identity with precise exon boundaries and was therefore used to annotate the mouse Pom121 exons. Two genes (BE522554 andBE630793) were identified by a MegaBLAST search of the mouse genomic sequence against the TIGR EST database (http://www.tigr.org/tdb/tgi.shtml); the resulting information was used in conjunction with GenScan to establish the mouse gene model. The length of each human coding sequence was estimated byPipMaker (this was done for consistency because there was no corresponding human RefSeq record nor human LocusLink mRNA entry for roughly a third of the mouse genes). Of note, analyses performed using available human RefSeq records yielded the same results as those obtained using the PipMaker-predicted human coding sequences; in one case (ELN), PipMaker failed to predict a human coding sequence; in this case, the available RefSeq record was used. In one case (GTF2IRD2), PipMaker failed to predict a coding sequence and no full-length human cDNA sequence was available in GenBank; in this case, a GenScan prediction of the human coding sequence was used. In four cases (indicated by NA), none of the above means for predicting the human coding sequence was effective, most often due to the lack of available human genomic or cDNA sequence.
-
↵The tool EMBOSS(http://www.ebi.ac.uk/emboss/align), which uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length, was used to calculate the percent-identity of the mouse and human coding sequences over the aligned regions. In four cases, no human coding sequence was available for this analysis (indicated by NA).
-
↵The predicted amino acid (AA) sequence derived from each orthologous mouse–human gene pair was compared usingEMBOSS. The indicated percent-identity corresponds to the percentage of the total amino acids with identical matches between the two sequences over the aligned regions. When available, the amino acid sequences were derived from RefSeq records; otherwise, matching GenBank protein records were used. In the case of BF522554, neither of these sources was available; thus, a translated version of the coding sequence predicted by PipMaker was used. WhenPipMaker failed to predict a human coding sequence for a mouse gene or no open reading frame could be found in the predicted coding sequence, BLASTX or BLASTP was used to search the National Center for Biotechnology Information database. For three genes (AK003386, AK019256, and Pom121), this yielded an aligning human protein (XP_042880, XP_042882, and XP_034753.1, respectively). In some cases (indicated by NA), amino acid sequence alignments could not generated, either because the mouse coding sequence did not provide an open reading frame that enabled an accurate prediction of a protein sequence or a human amino acid sequence could not be obtained for alignment with the predicted mouse protein.











