Generation and Comparative Analysis of ∼3.3 Mb of Mouse Genomic Sequence Orthologous to the Region of Human Chromosome 7q11.23 Implicated in Williams Syndrome
- Udaya DeSilva1,4,5,
- Laura Elnitski3,4,
- Jacquelyn R. Idol1,
- Johannah L. Doyle1,
- Weiniu Gan2,6,
- James W. Thomas1,
- Scott Schwartz3,
- Nicole L. Dietrich2,
- Stephen M. Beckstrom-Sternberg1,2,
- Jennifer C. McDowell2,
- Robert W. Blakesley1,2,
- Gerard G. Bouffard1,2,
- Pamela J. Thomas2,
- Jeffrey W. Touchman1,2,
- Webb Miller3, and
- Eric D. Green1,2,7
Abstract
Williams syndrome is a complex developmental disorder that results from the heterozygous deletion of a ∼1.6-Mb segment of human chromosome 7q11.23. These deletions are mediated by large (∼300 kb) duplicated blocks of DNA of near-identical sequence. Previously, we showed that the orthologous region of the mouse genome is devoid of such duplicated segments. Here, we extend our studies to include the generation of ∼3.3 Mb of genomic sequence from the mouse Williams syndrome region, of which just over 1.4 Mb is finished to high accuracy. Comparative analyses of the mouse and human sequences within and immediately flanking the interval commonly deleted in Williams syndrome have facilitated the identification of nine previously unreported genes, provided detailed sequence-based information regarding 30 genes residing in the region, and revealed a number of potentially interesting conserved noncoding sequences. Finally, to facilitate comparative sequence analysis, we implemented several enhancements to the program PipMaker, including the addition of links from annotated features within a generated percent-identity plot to specific records in public databases. Taken together, the results reported here provide an important comparative sequence resource that should catalyze additional studies of Williams syndrome, including those that aim to characterize genes within the commonly deleted interval and to develop mouse models of the disorder.
[The sequence data described in this paper have been submitted to GenBank under accession nos. AF267747, AF289666,AF289667, AF289664, AF289665, AC091250, AC079938, AC084109, AC024607,AC074359, AC024608, AC083858, AC083948, AC084162, AC087420, AC083890,AC080158, AC084402, AC083889, AC083857, and AC079872.]
The past decade has brought spectacular advances in our understanding of the contiguous gene deletion disorder Williams syndrome (WS, also known as Williams-Beuren syndrome; OMIM 194050 [see http://www.ncbi.nlm.nih.gov/Omim]). This complex and intriguing developmental disorder is associated with defects in multiple physiological systems, with the classic phenotypic features including cardiovascular disease, dysmorphic facial characteristics, infantile hypercalcemia, and unique cognitive and personality components (Burn 1986; Morris et al. 1988; Bellugi et al. 1990, 1999; Lashkari et al. 1999; Mervis et al. 1999; Donnai and Karmiloff-Smith 2000; Mervis and Klein-Tasman 2000; Morris and Mervis 2000).
A key turning point in elucidating the genetic basis of WS came in 1993 with the discovery that the disorder is associated with hemizygous microdeletions within human chromosome 7q11.23 that include the elastin gene (ELN; Ewart et al. 1993). Since that time, there have been numerous studies aiming to map this region of chromosome 7, identify the genes residing within the commonly deleted interval, and associate the phenotypic features of the disorder to the haploinsufficiency of specific genes. These efforts have been aided by a joint effort between our group and the Washington University Genome Sequencing Center (http://genome.wustl.edu/gsc) to map and sequence the human WS region. However, significant challenges have been encountered. For example, attempts to establish contiguous and accurate long-range physical maps of the human WS region have been hampered by a number of problems, including unstable yeast artificial chromosome (YAC) clones derived from the region (which are most likely a consequence of the notably high density of repetitive sequences) and the presence of several large (∼300 kb), closely spaced blocks of DNA with near-identical sequence (Gorlach et al. 1997; Osborne et al. 1997a; Hockenhull et al. 1999; Korenberg et al. 2000; Peoples et al. 2000; Valero et al. 2000). The latter genomic segments, which greatly confound conventional mapping and sequencing strategies, are particularly important, both because they contain gene and pseudogene sequences (Gorlach et al. 1997; Osborne et al. 1997a; Perez Jurado et al. 1998) and because they appear to play a central role in mediating the inter- and intrachromosomal recombination events that lead to the WS-associated deletions (Perez Jurado et al. 1996; Robinson et al. 1996; Baumer et al. 1998).
Despite the challenges associated with mapping and sequencing the human WS region, numerous genes residing within the commonly deleted interval and the flanking duplicated segments have been identified (Fig.1; Table 1;Francke 1999; Osborne 1999; Osborne and Pober 2001). The diverse phenotypic features associated with WS likely result from haploinsufficiency of these and/or yet-to-be-identified genes that reside within the deleted interval. However, with the exception ofELN and cardiovascular/connective tissue disease, correlating individual genes with specific phenotypic features has proven difficult.
Long-range organization of human and mouse Williams syndrome (WS) regions. A physical map of the WS regions on human chromosome 7q and mouse chromosome 5G is depicted emphasizing the positions of the known genes residing within and flanking the interval commonly deleted in WS (DeSilva et al. 1999; Francke 1999; Hockenhull et al. 1999; Osborne 1999; Korenberg et al. 2000; Peoples et al. 2000; Valero et al. 2000). In the human WS region, this interval spans ∼1.6 Mb (indicated by a bold dashed line) and is flanked by duplicated blocks of DNA of near-identical sequence (estimated at ∼300 kb in size; indicated by dark rectangles). The relative positions of the centromere (CEN) and telomere (TEL) are indicated in each case. Note the inverted orientation of the two discontiguous segments of human chromosome 7 relative to the single contiguous segment of mouse chromosome 5G. The relative positions of the known human and mouse genes residing in this region are indicated, with additional details provided in Table 1. Depicted below the map of the mouse WS region are the 21 overlapping BAC/PAC clones selected for sequencing (seehttp://bio.cse.psu.edu/publications/desilva for a complete contig map of the mouse WS region), with the current sequencing status (finished, full shotgun, or working draft) indicated at the bottom (also see Table2). Note that the depicted genomic regions and the BAC/PAC clones are not drawn to scale.
Known Human/Mouse Genes Residing Within or Near the WS Region
As a complement to the above efforts, our interests have focused on the comparative mapping and sequencing of the WS region in the human and mouse genomes. Previously, we established a bacterial clone-based contig map of the mouse genomic region encompassing the Elnand Ncf1 (p47-phox) genes (DeSilva et al. 1999); note that NCF1 gene/pseudogene sequences reside within the duplicated blocks in the human WS region (Fig. 1; Table 1). Interestingly, we discovered that the mouse WS region is devoid of the large duplicated segments that are characteristic of its human counterpart. To acquire a more detailed view of this important genomic interval, we have now extended our mouse physical mapping efforts as well as sequenced the entire mouse WS region. Here, we report the generation of ∼3.3 Mb of mouse genomic sequence and the results of detailed computational analyses, which included extensive comparisons with the available sequence of the human WS region.
RESULTS
Physical Mapping of the Mouse WS Region
The segment of the mouse genome corresponding to the human WS region resides on distal mouse chromosome 5. Our previous clone-based physical mapping efforts resulted in the construction of a bacterial artificial chromosome (BAC)/P1-derived artificial chromosome (PAC) contig spanning a large portion of this genomic region, including the entire interval flanked by the Eln and Ncf1 genes (DeSilva et al. 1999). As part of a broader effort to generate BAC-based physical maps of the portions of the mouse genome orthologous to human chromosome 7 (Thomas et al. 2000), we extended this contig map to encompass the entire WS region (including the interval commonly deleted in WS, the segment that is duplicated in human, and additional flanking DNA). The complete contig map is available as part of an electronic supplement accompanying this paper (athttp://bio.cse.psu.edu/publications/desilva). Based on our earlier (DeSilva et al. 1999) and expanded physical mapping efforts, a set of 21 clones, which together fully encompass the mouse WS region, was selected for systematic sequencing (Fig. 1).
Consistent with our previous mapping studies (DeSilva et al. 1999), we encountered no evidence for the presence of large, duplicated blocks of DNA within the mouse WS region, such as those residing in the orthologous segment on human chromosome 7q11.23. Indeed, the clone-based physical mapping of the mouse WS region proceeded smoothly, in striking contrast to our efforts and those of others (Osborne et al. 1996; Hockenhull et al. 1999; Korenberg et al. 2000; Peoples et al. 2000; Valero et al. 2000) in mapping the human WS region.
The long-range organization of the mouse and human WS regions is also different in other ways. Specifically, a single contiguous block of mouse chromosome 5 encompassing the WS region is orthologous to two discontiguous segments of human chromosome 7, one on 7q11.23 and one on 7q22. The former segment contains the interval commonly deleted in WS and the flanking duplicated blocks; interestingly, the orientation of the central portion of this region is inverted in mouse versus human (Fig. 1). The inverted orientation of the mouse WS region (compared to the human WS region) was confirmed by two-color fluorescent in situ hybridization (FISH) studies with Ncf1– andFkbp6-containing BACs; the results clearly showed thatNcf1 is at the centromeric end and Fkbp6 at the telomeric end of the WS region on mouse chromosome 5 (data not shown). These physical mapping studies are consistent with the BSS JAX panel genetic mapping data (http://www.jax.org/resources/documents/cmdata/bkmap/BSS.html). Importantly, the breakpoints associated with this evolutionary inversion correspond to the locations of the duplicated blocks in the human WS region, which are also the most common sites of deletion breakpoints seen in WS (Fig. 1). Our finding of an inverted orientation of the mouse versus human WS region is consistent with data generated by others (Peoples et al. 2000; Valero et al. 2000).
Immediately telomeric to the interval commonly deleted in WS is a genomic segment encompassing the HIP1/Hip1,MDH2/Mdh2, POR/Por, andZP3/Zp3 genes; this region is oriented the same in mouse and human. However, in mouse, this segment is contiguous (in the telomeric direction) with a region that is orthologous to human 7q22 and that contains the Cutl1 and Pai genes. In human, this segment is not contiguous with the WS region and, in fact, is inverted in orientation (relative to the mouse segment; see Fig. 1).
Sequencing of the Mouse WS Region
The 21 overlapping mouse clones depicted in Figure 1 were sequenced by a shotgun sequencing strategy. The GenBank accession number for each resulting sequence is provided in Table 2. Note that the first five clones (391O16, 92N10, P510M19, 303E12, and 42J20) were isolated from libraries derived from the 129SV mouse strain and sequenced prior to the decision to use the C57BL/6J mouse strain (with an emphasis on the RPCI-23 mouse BAC library) for sequencing the mouse genome as part of the Human Genome Project (Battey et al. 1999;Denny and Justice 2000). The remaining 16 clones were isolated from the RPCI-23 library. Taken together, a total of ∼3.3 Mb of nonredundant mouse genomic sequence was generated, of which a single contiguous block of just over 1.4 Mb is finished, high-accuracy sequence (i.e., with an error rate of <1 in 10,000 bp), another ∼1.4 Mb is at a full-shotgun stage (with ∼11-fold average coverage in Phred Q20 bases; Ewing et al. 1998; Ewing and Green 1998) and is currently being finished, and the remaining ∼0.5 Mb is at a working-draft stage (with ∼5-fold average coverage in Phred Q20 bases), as indicated in Figure1 and Table 2.
Sequenced Mouse Clones
Mouse–Human Comparative Sequence Analysis
The resulting mouse genomic sequence was subjected to rigorous computational analyses. Emphasis was placed on studying the large (∼1.4 Mb), contiguous block of finished sequence, which included the entire region orthologous to the interval commonly deleted in WS. For comparison to the finished mouse sequence, we were able to identify finished or draft-level human sequence in GenBank for all but ∼200 kb of the corresponding region on human chromosome 7q11.23 (with the notable segments unavailable for comparative analyses being ∼40 kb encompassing the gene represented by AK005040, ∼100 kb at the 5′ end of ELN, and ∼20 kb just 5′ to CLDN3).
The central analytical and organizational tool for our comparative sequence analyses was the program PipMaker (Hardison et al. 1997; Ellsworth et al. 2000; Schwartz et al. 2000). The core function of this program is to perform direct comparisons between large blocks of orthologous sequences. In addition, though,PipMaker provides an effective and convenient mechanism for assimilating and displaying relevant annotations about large segments of genomic sequence, including the location of repetitive elements and CpG islands, the intron–exon organization of genes, and, most importantly, the areas (both coding and noncoding) found to be highly conserved between two orthologous sequences. To enhance the utility of PipMaker, we recently added a feature that incorporates hyperlinks from annotated regions of the resulting percent-identity plot (PIP) to relevant Internet sites. This allows the creation of an informative and dynamic electronic supplement that captures the key elements of each comparative analysis. An illustration of this new PipMaker feature is provided in Figure2, which shows a small portion of the PIP generated by comparing the sequences of the mouse and human WS regions (note that the entire PDF-formatted PIP is available athttp://bio.cse.psu.edu/publications/desilva).
Representative portion of the percent-identity plot (PIP) comparing mouse and human sequence from the Williams syndrome (WS) region. The finished mouse sequence reported here was compared with the available orthologous human sequence using PipMaker. The complete PIP and details about the various annotations it contains are available at http://bio.cse.psu.edu/publications/desilva. Shown here is a ∼60-kb region containing portions of theGtf2i/GTF2I and Gtf2ird1/GTF2IRD1genes and the interval residing between them. Note that only gap-free segments that are ≥50% identical between mouse and human are plotted. The first two exons and last nine exons ofGtf2i/GTF2I and Gtf2ird1/GTF2IRD1, respectively, are represented by vertical rectangles and numbered accordingly; most of these exons are associated with high levels of mouse–human sequence conservation. Note the two conserved noncoding sequences at ∼205 kb and ∼239 kb (both are gap-free segments of >100 bp in length with mouse–human sequence identities of >70% and >90%, respectively, as indicated by the different colored vertical lines at those positions). Also note the various colored horizontal bars drawn above the two genes; in the actual PDF file generated byPipMaker, these bars provide direct links to relevant Internet sites (e.g., appropriate PubMed citation[s] for the gene [pink], the GenBank record containing the predicted amino acid sequence of the protein encoded by the gene [light blue], and the LocusLink entry for the gene [dark blue]). The bookmarks along the left side provide links to compiled information about the various genes and other annotations generated during the comparative analysis of these sequences.
Our comparative analyses revealed a number of interesting general features of the WS region. First, the GC content of the mouse and human WS regions is similar, both the overall level (48.8% and 49.2%, respectively) and the relative uniformity across the region (ranging from 41.7% to 51.7% in mouse and 40.2% to 55.5% in human when calculated in 50-kb windows). In contrast, the mouse and human WS regions differ substantially in their repeat content, for example, consisting of 35.9% and 54.2% interspersed repetitive elements (mostly SINES and LINES), respectively. In addition, there is a notable lack of uniformity of repeat content across the region, ranging from 30.6% to 62.7% in mouse and 27.9% to 84.3% in human (when calculated in 50-kb windows). The difference in the amount of repetitive sequences largely accounts for the slight compression of the mouse WS region compared to its human counterpart. For example, this is clearly evident in the interval encompassing the genesGTF2IRD2/Gtf2ird2, NCF1/Ncf1, andGTF2I/Gtf2i, with finished sequence being available for both the mouse and human regions; the size of the same genomic segment is ∼124 kb and ∼169 kb in mouse and human, respectively (consisting of 34.3% and 50.0% interspersed repeats, respectively). Finally, PipMaker analysis revealed numerous segments that are highly conserved between the mouse and human WS regions. Most of these correspond to exons within known and newly identified genes (see below); however, many others appear to be conserved noncoding sequences. Specifically, within the ∼1.4 Mb of finished mouse sequence, 55 gap-free alignments of ≥100 bp in length and with ≥70% mouse–human sequence identity were identified that do not overlap any of the identified exons. Two of these are shown in Figure 2, with the complete list available at http://bio.cse.psu.edu/publications/desilva.
PipMaker analysis also revealed that mouse–human sequence conservation across the WS region is relatively low compared to other genomic regions examined to date, both in terms of the total amount of noncoding, nonrepetitive sequence that is at least moderately conserved (i.e., can be reliably aligned between mouse and human) and the amount that is highly conserved. To quantify this, we focused attention on the finished sequence from the mouse WS region. Following removal of segments for which the orthologous human sequence was not available and the masking of both repeats and annotated coding regions, the remaining mouse sequence was aligned with its human counterpart. Only 20.3% of the nonexonic, nonrepetitive sequence could be aligned between mouse and human, providing a benchmark for the overall level of conservation (Table 3). Only 1.1% of the sequence was found to be highly conserved (i.e., resided within a gap-free alignment of ≥100 bp in length and ≥70% mouse–human sequence identity). For comparison, we performed the same analysis on 12 other genomic regions for which large blocks of finished sequence were available for both mouse and human. For these other regions, we first masked repeats and annotated exons in the human (rather than mouse) sequence. In all but two cases, there is a greater degree of total mouse–human sequence conservation than that encountered with the WS region (Table 3), with a greater percentage of highly conserved sequence seen in all but three cases. In addition, the data presented in Table 3 suggest a potential correlation between mouse–human sequence divergence and the content of G+C nucleotides and/or interspersed repetitive elements; note that the latter is consistent with the findings of Chiaromonte et al. (2001). However, a more systematic study is certainly required before firm conclusions can be reached.
Mouse-Human Sequence Conservation in Selected Genomic Regions
Significant effort was also focused on the computational detection and annotation of genes residing in the WS region. The availability of both mouse and human genomic sequences greatly enhanced the ability to detect genes and to define their long-range organization. Table4 provides a summary of the 30 genes identified within the ∼1.4 Mb of finished mouse sequence, with additional details (e.g., deduced coding sequences, predicted amino acid sequences of the corresponding proteins, and presence of conserved domains) available at http://bio.cse.psu.edu/publications/desilva. Of these 30 genes, 20 have been assigned names and reported previously as residing within the WS region (see Table 1), while one (Gtf2ird2) is associated with an annotated GenBank record (AY014963) indicating its presence in the WS region. Importantly, the remaining 9 (in each case indicated in Table 4 by a representative GenBank record containing a corresponding full-length cDNA sequence or an associated expressed-sequence tag [EST]) represent newly identified genes with respect to their presence in the WS region. The evidence that these are authentic genes includes the identification of cDNA sequences matching the mouse genomic sequence, their overlap withGenScan-predicted gene models (in all but one case), and the presence of strong mouse–human sequence conservation; these features are detailed in Figure 3. Remarkably, 6 of these newly identified genes (AK017044, AK004244,AK008014, AK003386, AK019256, and BE290321) clearly reside within the genomic interval commonly deleted in WS. Additional features of the newly identified genes are summarized in an electronic table athttp://bio.cse.psu.edu/publications/desilva.
Genes Identified in the ∼1.4 Mb of Finished Sequence from the Mouse WS Region
Identification of previously unreported genes in the Williams syndrome (WS) region. Of the 30 genes identified within the ∼1.4 Mb of finished mouse sequence (see Table 4), 9 have not been previously reported to reside within the WS region. Information about each of these 9 genes is provided (listed in order across the mouse WS region), including (1) a representative GenBank accession number for the mouse cDNA sequence (note in one case, BF522554, the only available cDNA sequence was from rat); (2) the type of sequence contained in that GenBank record (Riken full-length [FL] cDNA sequence [Kawai et al. 2001] or EST); (3) the percent-identity between the mouse genomic sequence and the matching cDNA sequence; (4) an indication of whether or not the putative gene overlaps aGenScan-predicted gene (specifically, if >1 exon matches a Genscan-predicted exon or, in the case of AK019256, the single exon matches the predicted exon for >500 bp; note that the only gene not meeting these criteria, AK017044, did have one of its exons matching a Genscan-predicted exon); and (5) the gene-containing portion of the percent-identity plot (PIP) showing the pattern of mouse–human sequence conservation (except for AK005040 andAK017044, for which no human sequence was available). See Fig. 2 for additional details about the PIP.
The 30 identified genes are associated with a number of other interesting features. First, all but 4 (87%) have a CpG island at their 5′ end (Table 4); this is a considerably higher fraction than that reported previously for mouse genes (Antequera and Bird 1993;Jareborg et al. 1999). Second, the splice sites and intron–exon organization of the genes are the same in mouse and human (at least for the genes for which genomic sequence was available in both species) except for Eln/ELN, which has 81% amino acid identity between mouse and human but shows a lack of conservation at the splice junctions. Third, the coding-sequence conservation between the mouse–human orthologous gene pairs (Table 4) falls within the typical range established previously (Makalowski et al. 1996;Makalowski and Boguski 1998), with the exceptions being the less conserved Wbscr15/WBSCR15 (as we reported previously [Doyle et al. 2000]) and perhaps Pom121/POM121. Finally, with the exception of the changes associated with the evolutionary inversions depicted in Figure 1, gene order is the same in the mouse and human WS regions.
The ∼1.9-Mb segment of draft-level mouse sequence that we generated (corresponding to the seven clones taken to full-shotgun and three clones taken to working-draft levels of redundancy; see Table 2) is orthologous to a region of human chromosome 7 that is telomeric to the interval commonly deleted in WS (Fig. 1). As such, less rigorous computational analyses have thus far been performed with this mouse sequence. However, since human sequence is available for virtually all of this segment, a routine set of comparative analyses was performed using PipMaker, with the resulting PIPs available athttp://bio.cse.psu.edu/publications/desilva.
DISCUSSION
It is now well-established that the comparative analysis of genomic sequence from different organisms represents a powerful means for identifying conserved coding and noncoding regions, including regulatory elements (Duret and Bucher 1997; Hardison et al. 1997;Hardison 2000; Miller 2000; Wasserman et al. 2000; Cliften et al. 2001;Pennacchio and Rubin 2001; Touchman et al. 2001). With the recent completion of a working-draft sequence of the human genome (International Human Genome Sequencing Consortium 2001; Venter et al. 2001), increasing attention is being given to the sequencing of other organisms (Green 2001). In particular, the sequencing of the mouse genome is now taking center stage (Battey et al. 1999; Denny and Justice 2000), with the recognition that the resulting data will provide both an invaluable infrastructure for performing research with this important experimental animal and the ability to more rigorously annotate the human sequence by comparative analyses (Batzoglou et al. 2000; Bouck et al. 2000).
Indeed, the past few years have brought a sizable crescendo in the generation of mouse genomic sequence, allowing insightful comparisons to be made with the orthologous human sequence. Notable examples of large (e.g., >300 kb) blocks of generated mouse sequence include that from the velocardiofacial syndrome region (∼634 kb; Lund et al. 2000), the Cftr region (∼358 kb; Ellsworth et al. 2000), theBpa/Str region (∼430 kb; Mallon et al. 2000), the region on chromosome 7 containing an imprinted genomic domain (∼1 Mb;Onyango et al. 2000), the region on chromosome 11 containing a cluster of interleukin genes (∼1100 kb; Loots et al. 2000), the region containing the protocadherin gene cluster (∼900 kb; Wu et al. 2001), the cat eye syndrome region (∼450 kb; Footz et al. 2001), the region on chromosome 17 containing a cluster of olfactory receptor genes (∼330 kb; Younger et al. 2001), a segment on mouse chromosome 16 orthologous to the Down's syndrome critical region (∼470 kb;Pletcher et al. 2001), the Fra14A2/Fhit region (∼600 kb; Shiraishi et al. 2001), and the 15 mouse genomic segments orthologous to human chromosome 19 (totaling ∼42 Mb; Dehal et al. 2001); note that a handful of other examples are also cataloged at www.ncbi.nlm.nih.gov/genome/seq/MmProgress.shtml. Together, the generated mouse sequence has played a key role in the establishment and refinement of computational approaches for systematic comparative sequence analysis (Mallon and Strivens 1998; Stojanovic et al. 1999;Batzoglou et al. 2000), with the emergence of tools such asPipMaker (http://bio.cse.psu.edu; Schwartz et al. 2000),VISTA (http://sichuan.lbl.gov/vista; Mayor et al. 2000), and Alfresco (http://www.sanger.ac.uk/Software/Alfresco;Jareborg and Durbin 2000).
The ∼3.3 Mb of sequence reported here for the mouse WS region represents one of the largest and most complete blocks of mouse sequence reported to date. This is particularly the case with respect to the ∼1.4-Mb contiguous segment of finished, high-accuracy sequence. Indeed, in many of the cases listed above, only draft-level mouse sequence has thus far been generated. Our extensive and high-quality data set provided the opportunity to perform detailed computational analyses, with particular emphasis on mouse–human sequence comparisons. Several general findings deserve special mention. First, the order and structure of genes in the mouse and human WS regions are well conserved, with the only exceptions relating to the two large evolutionary inversions illustrated in Figure 1. Second, comparative sequence analysis in conjunction with cDNA/EST comparisons and Genscan predictions has provided strong evidence for the presence of at least nine previously unreported genes within the WS region (see Fig. 3 and below). Finally, numerous conserved noncoding sequences can be readily identified within the human and mouse WS regions; these represent viable candidates for regulatory elements associated with the numerous genes residing in the region or perhaps serve some other biologically important function(s). Of note, during the generation of our mouse sequence data, Martindale et al. (2000)reported the elucidation and analysis of ∼115 kb of sequence from the mouse WS region, specifically a segment encompassing the genesLimk1, Eif4h, Wbscr15, and Rfc2. Their analyses of this portion of the mouse WS region are concordant with the results presented here.
Our experience in analyzing the sequence of the mouse WS region once again illustrates the tremendous value of mouse–human sequence comparisons for annotating genes. Simple comparisons of genomic sequences and collections of cDNA-derived (e.g., EST) sequences often fail to detect certain mRNAs (e.g., those expressed at low levels or in a tissue-restricted fashion). In addition, false-positive results are common, typically due to contaminating genomic sequences amongst the ESTs. However, a combined strategy employing both mouse–human genomic sequence comparisons and genomic-cDNA sequence comparisons provides an efficient and effective path toward the construction of accurate gene models. For example, such a combined approach led to our identification of a previously undetected 5′ terminal exon ofHIP1/Hip1, leading to refined information about the structure of this gene beyond that available in RefSeq. In addition, evidence of mouse–human sequence conservation provided critical clues that directly led to the identification of the nine previously unreported genes in the WS region. Once detected, the conserved regions were more carefully compared to available sequence databases, resulting in the identification of matching full-length cDNA sequences in a majority of cases.
PipMaker is now a well-established program for performing the types of routine comparative sequence analyses mentioned above. The new enhancements to PipMaker reported here should further increase the utility of this tool. In particular, PipMakercan now be used to capture and disseminate the large amount of ancillary information that is routinely generated during the comparative analysis of large blocks of genomic sequence, in essence providing an archive of both the underlying data and a detailed account of any analyses performed with it. This is accomplished through the creation of a PDF-based file that contains both the PIP and links from relevant features of the PIP to specific Internet sites. Such a PDF file can serve as an electronic supplement to a publication, which inevitably can only provide highlights of the comparative analyses being reported (e.g., Figs. 2, 3). Indeed, this is just one facet of the expanding synergy between traditional scientific publishing and the Internet. An alternate approach to this problem was recently described (Wilson et al. 2001), which involves the use of a sequence-alignment viewer that is provided as part of the electronic supplement and downloaded automatically by the Web browser when viewing alignments. An advantage of the Wilson et al. strategy is that it provides greater interactivity to the end-user, for example, allowing access to alignments with nucleotide-level resolution. An advantage ofPipMaker is that it only utilizes features of the PDF language, making the supplemental archive much easier to create and to access.
The region of human chromosome 7q11.23 commonly deleted in WS is of great medical and biological interest because of the relative frequency of the disease (∼1:20,000), the complex and intriguing phenotypic features of WS (Burn 1986; Morris et al. 1988; Bellugi et al. 1990,1999; Lashkari et al. 1999; Mervis et al. 1999; Donnai and Karmiloff-Smith 2000; Mervis and Klein-Tasman 2000; Morris and Mervis 2000), and the involvement of large, duplicated blocks of DNA in the deletional events leading to the syndrome (Perez Jurado et al. 1996;Robinson et al. 1996; Baumer et al. 1998). The mouse sequencing efforts reported here should accelerate research aiming to better understand the genetic basis of WS. First, our data provide a comprehensive resource for characterizing the genes residing within and around the interval commonly deleted in WS. This includes information about gene structure as well as valuable clues about potential regulatory regions. The value of this mouse sequence deserves highlighting in light of the difficult-to-generate and, at present, fragmentary nature of the human sequence for the WS region. Second, our comparative analyses have revealed the presence of at least nine genes that were not previously known to reside within the WS region. Importantly, six of these genes are located within the interval commonly deleted in WS, making each an important candidate to evaluate for its possible role in the disorder. Finally, the mouse sequence we generated should aid the creation of mouse models of WS. Specifically, significant efforts are currently ongoing to create mouse strains completely deleted or hemizygous for one or more genes within the WS region. Our efforts have provided a key infrastructure (i.e., complete genomic sequence) that should greatly facilitate the design of appropriate knockout constructs as well as a set of additional gene targets. In light of the difficulty to date in assigning specific genes to WS-associated phenotypic features, the ability to generate mouse models is regarded as key for untangling the complex genetics of WS.
In a slightly different context, our studies provide insight about the evolution of the WS region and the genes residing therein. Based on our comparative mapping and sequence data, this region has undergone extensive evolutionary changes in the human and/or mouse lineages since their last common ancestor. For example, the genomic complexities (with respect to large, closely spaced duplicated segments) encountered in the human and other great apes are not present in more distantly related mammals, such as the mouse (DeSilva et al. 1999). Interestingly, these duplicated segments reside at the breakpoints associated with an evolutionary inversion, such that the interval commonly deleted in WS has an inverted orientation in the human versus the mouse genome. In addition, there is a second evolutionary inversion associated with a genomic segment residing just telomeric to the WS region; this segment is contiguous with the rest of the WS region in mouse but discontiguous in human. It is interesting to contemplate the steps that produced two evolutionary inversions and one breakpoint within the human and mouse lineages, as discussed by Valero et al. (2000). At a sequence level, there is also evidence for significant divergence between the mouse and human WS regions. Indeed, the overall level of mouse–human sequence conservation across the WS region is atypically low; this is particularly the case for the noncoding (and nonrepetitive) sequence (Table 3), but is also evident for some genes (e.g., Wbscr15/WBSCR15 [Doyle et al. 2000;Martindale et al. 2000] and Pom121/POM121; see Table4).
In summary, our studies show how comparative sequence analysis can simultaneously provide valuable data for addressing problems in both human genetics and genome evolution. Based on this experience and the anticipated surge in the acquisition of genomic sequence for numerous other organisms, one can now readily envision a new era of scientific inquiry, in which sequence-based comparisons drive the study of genome structure, function, and evolution.
METHODS
Mouse Genomic Sequencing
The overlapping set of mouse BAC (Shizuya et al. 1992) and PAC (Ioannou et al. 1994) clones shown in Figure 1 and listed in Table 2were selected from either the contig reported previously (DeSilva et al. 1999; specifically, clones 391O16, 92N10, 303E12, and 42J20 isolated from the Research Genetics CITB-CJ7-B [strain 129SV] mouse BAC library [http://www.resgen.com] and clone P510M19 isolated from the RPCI-21 [strain 129SV] mouse PAC library [http://www.chori.org/bacpac]) or one more recently constructed as part of a larger mouse mapping effort (Thomas et al. 2000; specifically, clones with the prefix ‘RP23’ that were isolated from the RPCI-23 [strain C57BL/6J] mouse BAC library [http://www.chori.org/bacpac; Osoegawa et al. 2000]). Colony-pure clone isolates were subjected to restriction enzyme digest-based fingerprint analysis (Marra et al. 1997), and the resulting data were analyzed with the programs Image and FPC(http://www.sanger.ac.uk/Software; Soderlund et al. 1997, 2000) to assemble BAC/PAC contig maps, which in turn were used to guide the selection of overlapping clones for sequencing. Each selected clone was subjected to shotgun sequencing (Wilson and Mardis 1997; Green 2001), essentially as described previously (DeSilva et al. 2000; Ellsworth et al. 2000; Touchman et al. 2000). Sequences were edited and assembled with the Phred/Phrap/Consedsuite of programs (Ewing et al. 1998; Ewing and Green 1998; Gordon et al. 1998).
Comparative Analyses of Mouse and Human Sequences
The generated mouse sequence reported here was subjected to detailed computational analyses, including comparisons with the orthologous human sequence (when available). Genomic sequence from the human WS region was obtained as follows. The available sequence encompassing the LIMK1-RFC2 interval (Martindale et al. 2000) was supplemented with individual sequence records found by searching the NCBI databases (nr and htgs); most often, these records contained draft-level (as opposed to finished) sequence. In some cases, only small sequence contigs were available. For example, the CLDN3 gene could only be found on a ∼1.6-kb stretch of sequence, with the regions immediately flanking the gene not available for comparison with the mouse sequence.
Mouse and human genomic sequences were compared by constructing a percent-identity plot (Hardison et al. 1997; Ellsworth et al. 2000;Schwartz et al. 2000). Specifically, the generated mouse sequence and available human sequence were subjected to repeat masking with theRepeatMasker program (A.F.A. Smit and P. Green, unpubl. data; seehttp://www.genome.washington.edu/UWGC/analysistools/repeatmask.htm). The human sequence was then aligned relative to the mouse sequence using the BLASTZ component of the PipMakerprogram (http://bio.cse.psu.edu; Schwartz et al. 2000). In the resulting PIP, segments that were ≥50% identical between mouse and human were plotted, with other regions appearing blank. Gaps within an alignment appear as discontinuities between adjacent horizontal lines. Representative portions of the PIP generated with the sequences from the mouse and human WS regions are shown in Figures 2 and 3, with a more complete summary of the PipMaker results available athttp://bio.cse.psu.edu/publications/desilva. Additional information about the range of computational analyses performed is also detailed in Tables 3 and 4.
Acknowledgments
We thank the staff of the NIH Intramural Sequencing Center (NISC) for their dedicated work in generating the mouse sequence reported here, with special thanks to Michelle Walker, Jyoti Gupta, Sirintorn Stantripop, and Quino Maduro for their efforts in sequence finishing. We also thank the Washington University Genome Sequencing Center for generating the human sequence; Amalia Dutra for FISH studies; Jennifer Munsterteiger for editorial assistance; and Elliott Margulies, Matthew Portnoy, and Arjun Prasad for critical review of the manuscript. This work was supported in part by grant HG02238 (W.M.), grant HG02325-01 (L.E.), and funds for mouse sequencing (E.D.G.) from the National Human Genome Research Institute (NIH).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
-
↵4 These authors contributed equally to this work.
-
Present addresses: 5Department of Animal Science, Oklahoma State University, Stillwater, OK 74078, USA; 6Celera Genomics, Rockville, MD 20850, USA.
-
↵7 Corresponding author.
-
E-MAIL egreen{at}nhgri.nih.gov; FAX 301-402-4735.
-
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.214802.
-
- Received September 18, 2001.
- Accepted November 7, 2001.
- Cold Spring Harbor Laboratory Press














