Generation and Comparative Analysis of ∼3.3 Mb of Mouse Genomic Sequence Orthologous to the Region of Human Chromosome 7q11.23 Implicated in Williams Syndrome

Table 4.

Genes Identified in the ∼1.4 Mb of Finished Sequence from the Mouse WS Region

Gene CpG island Mouse-human comparisons
CDS length in bp, mouse (human) CDS, % identity AA sequence, % identity
AK005040 Yes 1163 (NA) NA NA
Gtf2ird2 Yes 2811 (2673) 79.3 82.1
Ncf1 No 1173 (1170) 81.4 82.5
Gtf2i Yes 2940 (2937) 87.7 96.8
Gtf2ird1 Yes 2071 (2077) 88.0 87.1
Cyln2 Yes 3136 (3134) 86.0 91.4
Rfc2 Yes 1050 (1066) 84.9 92.8
Wbscr15 No 576 (610) 74.0 64.6
Eif4h Yes 747 (747) 91.3 98.4
Limk1 Yes 1944 (1944) 88.0 95.2
Eln Yes 2582 (2274) 81.4 81.8
Cldn4 Yes 631 (628) 82.7 83.2
Cldn3 Yes 660 (663) 88.2 91.3
AK017044 No 838 (NA) NA NA
AK004244 Yes 924 (NA) NA NA
AK008014 No 544 (529) 75.3 NA
Stx1a Yes 867 (863) 91.0 98.3
AK003386 Yes 1135 (1045) 81.2 74.9
AK019256 Yes 530 (530) 78.8 76.3
BE290321 Yes 521 (NA) NA NA
Wbscr14 Yes 2595 (2559) 83.9 81.6
Tbl2 Yes 1329 (1344) 85.4 87.8
Bcl7b Yes 546 (546) 88.5 94.6
Baz1b Yes 4440 (4452) 86.6 91.1
Fzd9 Yes 1648 (1648) 87.6 95.8
Fkbp6 Yes 864 (864) 81.6 86.0
BF522554 Yes 1455 (1466) 84.2 78.8
BE630793 Yes 1211 (1212) 83.2 NA
Pom121 Yes 3361 (3440) 78.1 71.1
Hip1 Yes 2518 (2518) 87.6 87.6
  • The 30 genes identified within the ∼1.4 Mb of finished sequence from the mouse WS region are listed in their order on mouse chromosome 5G1-G2 (from centromere to telomere; see Fig. 1). Of these 30 genes, 21 have been previously published (listed in Table 1and depicted in Fig. 1) or, in the case of Gtf2ird2, submitted as an annotated GenBank record (AY014963). In the case of the 9 genes previously not reported as residing in the WS region, representative GenBank accession numbers are provided (see Fig. 3).

  • The presence (yes) or absence (no) of an overlap between the 5′ exon of the gene and a CpG island (regions of ≥50% G+C content where the ratio of CpG dinucleotides relative to GpC is ≥60% within a 200-bp window) is indicated. In two cases (BE290321and BF522554), cDNA sequence was not available to define the 5′ exons; instead, the 5′ exons were predicted by GenScanbased on extending an existing EST (to a methionine codon).

  • The length of each mouse coding sequence (CDS) was established by one of several methods. If a mouse RefSeq entry was available for the gene (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html), the length of the CDS in that record was used. In the absence of a mouse RefSeq record but presence of a human gene sequence (HIP1), aBLASTZ alignment was used to identify the putative mouse coding and predicted amino acid sequences. In the absence of a human gene, other sources were used to annotate the mouse genes. For example, the rat Pom121 gene aligned with the mouse genomic sequence at >85% identity with precise exon boundaries and was therefore used to annotate the mouse Pom121 exons. Two genes (BE522554 andBE630793) were identified by a MegaBLAST search of the mouse genomic sequence against the TIGR EST database (http://www.tigr.org/tdb/tgi.shtml); the resulting information was used in conjunction with GenScan to establish the mouse gene model. The length of each human coding sequence was estimated byPipMaker (this was done for consistency because there was no corresponding human RefSeq record nor human LocusLink mRNA entry for roughly a third of the mouse genes). Of note, analyses performed using available human RefSeq records yielded the same results as those obtained using the PipMaker-predicted human coding sequences; in one case (ELN), PipMaker failed to predict a human coding sequence; in this case, the available RefSeq record was used. In one case (GTF2IRD2), PipMaker failed to predict a coding sequence and no full-length human cDNA sequence was available in GenBank; in this case, a GenScan prediction of the human coding sequence was used. In four cases (indicated by NA), none of the above means for predicting the human coding sequence was effective, most often due to the lack of available human genomic or cDNA sequence.

  • The tool EMBOSS(http://www.ebi.ac.uk/emboss/align), which uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length, was used to calculate the percent-identity of the mouse and human coding sequences over the aligned regions. In four cases, no human coding sequence was available for this analysis (indicated by NA).

  • The predicted amino acid (AA) sequence derived from each orthologous mouse–human gene pair was compared usingEMBOSS. The indicated percent-identity corresponds to the percentage of the total amino acids with identical matches between the two sequences over the aligned regions. When available, the amino acid sequences were derived from RefSeq records; otherwise, matching GenBank protein records were used. In the case of BF522554, neither of these sources was available; thus, a translated version of the coding sequence predicted by PipMaker was used. WhenPipMaker failed to predict a human coding sequence for a mouse gene or no open reading frame could be found in the predicted coding sequence, BLASTX or BLASTP was used to search the National Center for Biotechnology Information database. For three genes (AK003386, AK019256, and Pom121), this yielded an aligning human protein (XP_042880, XP_042882, and XP_034753.1, respectively). In some cases (indicated by NA), amino acid sequence alignments could not generated, either because the mouse coding sequence did not provide an open reading frame that enabled an accurate prediction of a protein sequence or a human amino acid sequence could not be obtained for alignment with the predicted mouse protein.

This Article

  1. Genome Res. 12: 3-15

Preprint Server