Markup | Genome Research

Table 4.

Genes Identified in the ∼1.4 Mb of Finished Sequence from the Mouse WS Region

Gene[i]		CpG island[ii]		Mouse-human comparisons
Gene[i]		CpG island[ii]		CDS length in bp, mouse (human)[iii]	CDS, % identity[iv]	AA sequence, % identity[v]
AK005040	Yes	1163 (NA)	NA	NA
Gtf2ird2	Yes	2811 (2673)	79.3	82.1
Ncf1	No	1173 (1170)	81.4	82.5
Gtf2i	Yes	2940 (2937)	87.7	96.8
Gtf2ird1	Yes	2071 (2077)	88.0	87.1
Cyln2	Yes	3136 (3134)	86.0	91.4
Rfc2	Yes	1050 (1066)	84.9	92.8
Wbscr15	No	576 (610)	74.0	64.6
Eif4h	Yes	747 (747)	91.3	98.4
Limk1	Yes	1944 (1944)	88.0	95.2
Eln	Yes	2582 (2274)	81.4	81.8
Cldn4	Yes	631 (628)	82.7	83.2
Cldn3	Yes	660 (663)	88.2	91.3
AK017044	No	838 (NA)	NA	NA
AK004244	Yes	924 (NA)	NA	NA
AK008014	No	544 (529)	75.3	NA
Stx1a	Yes	867 (863)	91.0	98.3
AK003386	Yes	1135 (1045)	81.2	74.9
AK019256	Yes	530 (530)	78.8	76.3
BE290321	Yes	521 (NA)	NA	NA
Wbscr14	Yes	2595 (2559)	83.9	81.6
Tbl2	Yes	1329 (1344)	85.4	87.8
Bcl7b	Yes	546 (546)	88.5	94.6
Baz1b	Yes	4440 (4452)	86.6	91.1
Fzd9	Yes	1648 (1648)	87.6	95.8
Fkbp6	Yes	864 (864)	81.6	86.0
BF522554	Yes	1455 (1466)	84.2	78.8
BE630793	Yes	1211 (1212)	83.2	NA
Pom121	Yes	3361 (3440)	78.1	71.1
Hip1	Yes	2518 (2518)	87.6	87.6

[i] The 30 genes identified within the ∼1.4 Mb of finished sequence from the mouse WS region are listed in their order on mouse chromosome 5G1-G2 (from centromere to telomere; see Fig. 1). Of these 30 genes, 21 have been previously published (listed in Table 1and depicted in Fig. 1) or, in the case of Gtf2ird2, submitted as an annotated GenBank record (AY014963). In the case of the 9 genes previously not reported as residing in the WS region, representative GenBank accession numbers are provided (see Fig. 3).

[ii] The presence (yes) or absence (no) of an overlap between the 5′ exon of the gene and a CpG island (regions of ≥50% G+C content where the ratio of CpG dinucleotides relative to GpC is ≥60% within a 200-bp window) is indicated. In two cases (BE290321and BF522554), cDNA sequence was not available to define the 5′ exons; instead, the 5′ exons were predicted by GenScanbased on extending an existing EST (to a methionine codon).

[iii] The length of each mouse coding sequence (CDS) was established by one of several methods. If a mouse RefSeq entry was available for the gene (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html), the length of the CDS in that record was used. In the absence of a mouse RefSeq record but presence of a human gene sequence (HIP1), aBLASTZ alignment was used to identify the putative mouse coding and predicted amino acid sequences. In the absence of a human gene, other sources were used to annotate the mouse genes. For example, the rat Pom121 gene aligned with the mouse genomic sequence at >85% identity with precise exon boundaries and was therefore used to annotate the mouse Pom121 exons. Two genes (BE522554 andBE630793) were identified by a MegaBLAST search of the mouse genomic sequence against the TIGR EST database (http://www.tigr.org/tdb/tgi.shtml); the resulting information was used in conjunction with GenScan to establish the mouse gene model. The length of each human coding sequence was estimated byPipMaker (this was done for consistency because there was no corresponding human RefSeq record nor human LocusLink mRNA entry for roughly a third of the mouse genes). Of note, analyses performed using available human RefSeq records yielded the same results as those obtained using the PipMaker-predicted human coding sequences; in one case (ELN), PipMaker failed to predict a human coding sequence; in this case, the available RefSeq record was used. In one case (GTF2IRD2), PipMaker failed to predict a coding sequence and no full-length human cDNA sequence was available in GenBank; in this case, a GenScan prediction of the human coding sequence was used. In four cases (indicated by NA), none of the above means for predicting the human coding sequence was effective, most often due to the lack of available human genomic or cDNA sequence.

[iv] The tool EMBOSS(http://www.ebi.ac.uk/emboss/align), which uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length, was used to calculate the percent-identity of the mouse and human coding sequences over the aligned regions. In four cases, no human coding sequence was available for this analysis (indicated by NA).

[v] The predicted amino acid (AA) sequence derived from each orthologous mouse–human gene pair was compared usingEMBOSS. The indicated percent-identity corresponds to the percentage of the total amino acids with identical matches between the two sequences over the aligned regions. When available, the amino acid sequences were derived from RefSeq records; otherwise, matching GenBank protein records were used. In the case of BF522554, neither of these sources was available; thus, a translated version of the coding sequence predicted by PipMaker was used. WhenPipMaker failed to predict a human coding sequence for a mouse gene or no open reading frame could be found in the predicted coding sequence, BLASTX or BLASTP was used to search the National Center for Biotechnology Information database. For three genes (AK003386, AK019256, and Pom121), this yielded an aligning human protein (XP_042880, XP_042882, and XP_034753.1, respectively). In some cases (indicated by NA), amino acid sequence alignments could not generated, either because the mouse coding sequence did not provide an open reading frame that enabled an accurate prediction of a protein sequence or a human amino acid sequence could not be obtained for alignment with the predicted mouse protein.