Identification and Chromosomal Localization of Human Genes Containing CAG/CTG Repeats Expressed in Testis and Brain

  1. Frédérique Bulle1,3,
  2. Nuchanard Chiannilkulchai2,
  3. André Pawlak1,
  4. Jean Weissenbach2,
  5. Gabor Gyapay2, and
  6. Georges Guellaën1
  1. 1Institut National de la Santé et de la Recherche Médicale (INSERM), Unité 99, Hôpital Henri Mondor, 94010 Créteil, France; 2Centre National de la Recherche Scientifique (CNRS), URA 1922, Généthon, 91002 Evry CEDEX, France

Abstract

Human genes containing triplet repeats have been demonstrated to be involved in several neurodegenerative diseases by expansion of the repeat in succeeding generations. To identify novel genes involved in such pathologies, we have isolated transcripts containing (CAG/CTG)n repeats using two approaches. First, we screened 4 × 106 clones representing 10 copies of a human testis cDNA library using a (CAG)14 oligonucleotide probe. Among the 910 clones identified, the 243 clones with the strongest hybridization signal were sequenced partially from 3′ or 5′ ends. This provided us with 251 partial sequences that grouped into clusters corresponding to 39 genes, of which 19 represent unknown species. Second, we selected 203 additional ESTs containing (CAG/CTG)n repeats representing 121 clusters from the IMAGE consortium infant brain cDNA library. From these two series of sequences, we have localized 95 genes on human chromosomes using a panel of whole genome radiation hybrid (Genebridge 4). These genes are located on all of the chromosomes except for chromosome X, the highest density being observed on chromosome 19.

[The sequence data described in this paper have been submitted to GenBank under accession nos. AA065241AA065346.]

The human genome contains a large number of short tandem repeats (also known as microsatellites), including trinucleotide repeats in stretches of five or more, that have been detected in at least 50 genes (Riggins et al. 1992). Expansions of various types of these trinucleotide repeats have been implicated in genetic diseases. Although CGG and GAA repeats are expanded in different fragile X syndromes (Kremer et al. 1991; Verkerk et al. 1991; Yu et al. 1991;Knight et al. 1993; Jones et al. 1994; Nancarrow et al. 1994; Parrish et al. 1994) and Friedreich’s ataxia, respectively (Campuzano et al. 1996), CTG and CAG repeats are involved in a larger series of pathologies. Amplification of CTG repeats have been described in myotonic dystrophy (MD) (Aslanidis et al. 1992; Brook et al. 1992; Fu et al. 1992), whereas amplifications of CAG repeats have been observed in spinal and bulbar muscular atrophy (SBMA) (La Spada et al. 1991), spinocerebellar ataxia type 1 (SCA1) (Orr et al. 1993), Huntington’s disease (HD) (The Huntington’s Disease Collaborative Research Group et al. 1993), dentato-pallidoluisian atrophy (DRPLA) (Koide et al. 1994;Nagafuchi et al. 1994), Machado-Joseph disease or spinocerebellar ataxia type 3 (MJD or SCA3) (Kawagushi et al. 1994), spinocerebellar ataxia type 2 (SCA2) (Imbert et al. 1996; Pulst et al. 1996; Sanpei et al. 1996), and spinocerebellar ataxia type 6 (SCA6), the last identified SCA associated with a CAG expansion (Zhuchenko et al. 1997). All of these diseases exhibit instability in transmission of the expanded repeat from parent to offspring. In some of them, the increase in repeat size correlates with an increase in disease severity and a decrease in age of onset or penetrance as established for HD (Duyao et al. 1993) and MJD (Kawaguchi et al. 1994). In addition, the expansion of the repeats occurs in the transcribed part of the gene; GAA repeats are located in the intron, CGG and CTG in the untranslated exons, and CAG in the coding exons of the related gene.

The study of CAG repeats is of special interest for at least three reasons: (1) As mentioned previously, this type of repeat is involved in at least six neurodegenerative diseases; (2) CAG repeats are translated into polyglutamine stretches, domains that are often present in transcription factors and may function as a polar zipper interacting with other proteins (Perutz et al. 1994); and (3) experiments inEscherichia coli have shown that CAG/CTG tracts are expanded at least eight times more frequently than any of the other nine triplets (Ohshima et al. 1996).

Therefore, the identification and mapping of genes containing CAG repeats are of importance as CAG repeats represent potential candidates for diseases that exhibit genetic anticipation (La Spada et al. 1994), such as unipolar and bipolar disorders (McInnis et al. 1993; Engstrom et al. 1995; O’Donovan et al. 1995), autosomal dominant cerebellar ataxia (ADCA) type I (Durr et al. 1996) and type II (Benomar et al. 1995), familial nonspecific dementia (Brown et al. 1995), and schizophrenia (Ross et al. 1993; Bassett and Honer 1994; Morris et al. 1995; O’Donovan et al. 1995; Bowen et al. 1996), although anticipation is still questionable for this last disease (Petronis at al. 1996;Sasaki et al. 1996).

To aid in the identification of new genes containing CAG/CTG repeats, we decided to look for transcripts containing such repeats in human testis. This tissue expresses a large number of mRNAs, many of which are shared exclusively with nervous tissue, such as neuropeptide precursors, proenkephalin, or pro-opiomelanocortin (Wolgemuth and Watrin 1991), or belong to neurotransmitter biosynthesis (glutamate decarboxylase) (Persson et al. 1990). In the present work, the screening of a human testis cDNA library with a CAG-specific probe resulted in the identification of 39 CAG-containing genes, 19 of them corresponding to new genes. In parallel, we analyzed expressed sequence tags (ESTs) from the IMAGE consortium obtained from infant brain cDNA clones positive for CAG hybridization. From this analysis, we collected 121 CAG-containing clusters. From these two pools of transcripts, we have mapped 95 CAG/CTG-containing genes using radiation hybrid mapping.

RESULTS

The strategy for the analysis of sequences containing CAG repeat is outlined in Figure 1.

Figure 1.

Schematic outline of the strategy used to identify and map new CAG-containing transcripts.


Isolation of Human Testis mRNA-Containing CAG Repeat (Group A)

The screening of 4 × 106 clones from a human testis cDNA library, using a (CAG)14 as a probe, produced 910 positive clones. We selected 243 clones eliciting the strongest signal and sequenced them from their 3′ end, as this region is likely to correspond to a single exon (Hawkins et al. 1988) and, therefore, is more suitable to derive sequence tag sites (STSs) for mapping purposes (Hayes et al. 1996). After the analysis of these sequences, we rejected all clones lacking a poly(A) tail and we sequenced the 5′ end of the insert when no repeat was detected in the 3′ end sequence or when the sequence in the 3′ end was not informative enough. Of 243 cDNA clones, we obtained 251 sequences from the 3′ or 5′ ends that ranged between 200 and 400 bp with an average size at 380 bp. These sequences have been deposited in dbEST with AC accession numbers AA065241AA065346.

These sequences were submitted to two types of analysis. First, they were compared among themselves to eliminate exact duplicates. This analysis led us to discard 146 sequences (58%). We kept overlapping sequences as well as identical sequences containing repeats of various sizes. The remaining 105 sequences (42%) were assembled into 39 independent clusters (group A) (Table 1). Sequences were incorporated in the same cluster when they exhibited at least 98% identity in nucleic acid sequence. Second, the sequences corresponding to these 39 clusters were then compared with sequences present in nucleic and proteic sequence databases using BLAST programs (Altschul et al. 1990): 12 clusters corresponded to genes already known in human, 8 clusters were found to be homologous to known genes in human or in other species, 16 clusters only matched with anonymous ESTs, and 3 clusters did not give any match at all.

Table 1.

Summary of Testis cDNA Sequences


Analysis of the repeats present in the sequences revealed that 21 clusters exhibit a CAG or CTG repeat located either in 3′ or in 5′ region of the cDNA. One-third of these repeats contained three to nine triplets, the remaining two-thirds had between 10 and 20 triplets, and three cases had >20 repeats. In 15 clusters, we observed insertions of CAA or TTG triplet in the CAG or CTG repeats, respectively, thus extending the stretch of glutamine. In six clusters, we observed small insertions of nine bases or less in the CAG repeat. Two variations in the size of the repeat were observed; 13, 15, and 17 CAG repeats are present in the different cDNAs of the cluster 14, as well as 10 and 11 CAG repeats in cluster 25. In 18 clusters, we did not detect any repeat in the partial sequences that were obtained. Nevertheless, it remains likely that these clones contain CAG repeats. This statement is based on the fact that 4 clusters among 18, corresponding to already known human genes, identify genes containing CAG repeats (e.g., monocyte differentiation antigen precursor, myotonic dystrophy kinase, nucleolar phosphoprotein p130, human 54-kD protein mRNA). Only one cluster matches to a human gene that contains a CAG-rich region with no perfect successive repeats (e.g., human XRCC4 mRNA). The complete sequence of these cDNAs will definitely prove the presence of the CAG repeat.

Selection of Human EST-Containing CAG Repeat (Group B)

The IMAGE consortium screened a subset of 40,000 clones from the normalized infant brain cDNA library 1 (NIB1) of B. Soares, using CAG oligonucleotides. One hundred eighty-six positive clones likely to contain CAG were obtained and listed in the IMAGE web page (http://www.bio.llnl.gov/bbrp/image/itri.htlm). From this series of clones, 203 sequences (350–500 bp), from either the 3′ or 5′ region of the insert (mean 1820 bp) were recovered from dbEST. These sequences were assembled into 121 clusters according to the same strategy as the one described for the human testis cDNA clones (Fig.1).

Chromosomal Localization of cDNA Clones and ESTs

The human testis CAG/CTG containing clones and the human ESTs were localized using a radiation hybrid panel. In this technique, segments of human chromosomes obtained by X-ray irradiation are rescued in rodent recipient cells. A linkage distance can be established on the basis of the scoring of the number of breaks between two loci by measuring the frequency of coretention (Cox et al. 1990). In our case the CAG/CTG-containing clones were mapped by using 90 hybrids from the Genebridge 4 radiation hybrid panel (Gyapay et al. 1996). In general, for each cluster only one sequence located preferentially at the 3′ end was retained for primer design. For some infant brain EST clusters, the localization was achieved by using oligonucleotides derived from the same cluster of ESTs of the CAG positive clone present in Unigene (Schuler et al. 1996).

Of the 39 genes expressed in human testis that we analyzed, 27 (69%) were localized (Table 2), and of 121 EST clusters derived from the brain cDNA library, 68 (57%) were mapped successfully (Table3). Several clusters could not be localized for different reasons: (1) primers failed to amplify (majority of cases); (2) presence of background from hamster DNA; (3) human bands had a different size from that expected; and (4) sequences were too short for primer determination.

Table 2.

Localization of CAG-Containing cDNA from Human Testis


Table 3.

Summary and Localization of CAG Sequences from IMAGE Source


DISCUSSION

In the present study we used cDNA sequences obtained by two different approaches to identify and map new human genes containing CAG repeats.

The screening of 4 × 106 clones, equivalent to 10 copies of a human testis cDNA library, allowed us to identify 910 clones containing CAG repeats. The 243 clones that exhibited the strongest hybridization signals were further characterized. When duplicates were eliminated, these clones appear to correspond to 39 genes containing CAG repeats. With respect to the number of plated clones, this represents roughly 1 in 10,000 (39 in 400,000 for one copy of the library). This number is lower than the ratio of 37 in 10,000 reported by Néri et al. (1996) from a human fetal brain cDNA library, or 28 in 10,000 from a human cerebral cortex cDNA library (Li et al. 1993), and 7 in 10,000 from another fetal brain library (Riggins et al. 1992). Such differences might result from the differential expression of transcripts between testis and brain, but also from differing experimental conditions. In the series that we analyzed, we selected clones with intense hybridization signals. In addition, preliminary tests that we performed on 26 of the remaining clones (667) that gave low hybridization signals, allowed us to identify new transcripts that contain 6–11 CAGs. Thus, the population of transcripts that contain CAG repeats in normal human testis is certainly larger than our initial observations indicate. In addition, at least in our study, the intensity of the hybridization signal correlates more or less with the number of repetitions.

The average size of the CAG/CTG repeats in human testis cDNA analyzed in this study (strong hybridization signal) was ∼13, with at least 30% of the 39 clusters above this value. The different lengths that we observed are within the same range of repeat numbers usually observed in normal alleles of disease genes (5–54 trinucleotide repeats). Other reports, analyzing either genomic DNA or cDNA, mentioned lower numbers in CAG repeats. Gastier et al. (1996), analyzing the (CAG/CTG)n repeat lengths in 479 unique genomic clones, observed 30% of the repeats with six triplets, whereas the repeats with 13 copies represented only 2%. In human fetal brain cDNA,Néri et al. (1996) observed only 13 of 88 (15%) clones that exhibited repeats of size above nine. The larger size repeats that we observed result from our selection of the clones with the highest hybridization signal as mentioned above.

In most of the clusters, the CAG repeats were not perfect. First, in 15 clusters, we observed the presence of CAA triplets in the CAG repeat. This triplet also encodes glutamine and is also present in genes for HD, DRPLA, SBMA, SCA2, and MJD1. Second, for six clusters, we detected small insertions with sizes between 3 and 9 nucleotides. For five of those clusters, the insertion did not change the possible open reading frame (ORF). Similar insertions were found in the gene of SCA1, SBMA, and MJD1. As already described for the SCA1 gene, this might contribute to the stabilization of the repeat length (Chung et al. 1993). For the remaining cluster, the sequence was not accurate enough to determine whether the insertion induces a frameshift in the ORF.

Until now genes with CAG expansions that are likely involved in genetic diseases have two specific features: (1) the stretch of CAG is translated into polyglutamine; and (2) in general, this locus is highly polymorphic. With partial sequencing of cDNA, it is impossible to predict with accuracy the ORF in which the CAG repetition is inserted. Therefore, such stretches could be translated as poly(Gln), poly(Ser), or poly(Ala). As an example, in our clones there is one that corresponds to the nucleolar phosphoprotein p130 and contains a CAG repeat coding for a poly(Ser), which is not known to be involved in a neurodegenerative disease. Nevertheless, one cannot exclude implication of a CAG amplification in a poly(Ser) or poly(Ala) in genetic diseases. For example, an expansion of a polyalanine stretch in the amino-terminal region of HOXD 13 is associated with the synpolydactyly (Muragaki et al. 1996). For two genes, we detected some polymorphisms in the length of the CAG/CTG repeat. We observed a variation of 13 to 17 CAGs for gene 14, and 10 or 11 CAGs for gene 25. This could reflect allelic mosaisism expression observed previously for this kind of gene in human testis (Zühlke et al 1993; Telenius et al. 1995; Zhang et al. 1995), which may occur in meiosis during spermatogenesis.

From the 39 clusters that we identified, 19 correspond to unknown genes and 20 to already characterized genes. Among this population, we retrieved two of the seven genes already described to be involved in CAG diseases, the MJD1 protein and the myotonic dystrophy kinase, which are the products of genes involved in Machado-Joseph disease and myotonic dystrophy, respectively. This is the first observation that indicates an active transcription of these genes in human testis.

The cloning of cDNAs containing CAG repeats by hybridization is an efficient method for the detection of new candidate genes, but this technique is very time-consuming and it is difficult to eliminate redundancy. As a complementary approach, we screened databases for ESTs containing CAG. The IMAGE consortium identified by hybridization with a CAG specific probe a large series of transcripts that are likely to contain such repeats. The corresponding cDNAs, once characterized by 3′ partial sequencing, did not reveal long stretches of CAG, but larger repeats can be present upstream in the transcripts. This approach, although less reliable, allows us to screen rapidly a larger pool of cDNA from various tissues.

Until now, the various studies that have been done to identify new genes containing CAG repetitions have reported ∼100 genes (Riggins et al. 1992; Li et al. 1993; Jiang et al. 1995; Aoki et al. 1996;Néri et al. 1996). Of these genes, only 17 have been assigned to chromosomes and 7 of them sublocalized. In the present study, we have localized the largest group of genes containing CAG repeats. We have mapped 27 testis cDNAs and 68 EST sequences containing CAG/CTG repeats using a radiation hybrid panel. All the present localizations agree with the previous assignments when available.

In our study, CAG-containing genes were found on all chromosomes, except for chromosome X (the chromosome Y was not included in the panel of genome radiation hybrid). The distribution of the CAG-containing genes is not even, the largest number (10) being present on chromosome 19, whereas only one gene was detected on chromosome 2.

Some of the localizations of anonymous sequences that we have found are very close to loci involved in autosomal dominant genetic diseases associated with progressive neuropathy such as Charcot-Marie-Tooth disease type B (3q13–q22), schizophrenia disorder 1 (5q11.2–q13.3), Charcot-Marie-Tooth neuropathy (8q13–q21.1), related 4 and 8 of spinocerebellar ataxia (16q, 10q23.1–24.1), or schizophrenia disorder 4 (22q11). Further studies are needed to investigate whether some of these genes might be related to these genetic diseases.

METHODS

cDNA Cloning

Poly(A)+ RNA from the testis of a 27-year-old man was isolated as described previously (Matsuoka et al. 1992). The cDNA library was constructed using the Superscript plasmid system and plasmid cloning kit (Life Technologies, Inc.) according to manufacturer’s specifications, except that reverse transcription was done at 42°C for 1 hr. The cDNA was size-selected on S400 Sephacryl columns, and the material >700 bp was inserted in an oriented manner into pSPORT1 vector, using NotI–SalI adaptators. The resulting plasmids, once transfected into XlI blue, gave 360,000 independent colonies. Ten copies of this library (4 × 106 clones) were plated onto filters and hybridized with a 5′-32P-labeled oligonucleotide (CAG)14in 6× SSC [1× SSC: 150 mm NaCl/15 mm sodium citrate (pH 7.0)]; 5× Denhardt’s solution (1× Denhardt’s solution: 0.2% bovine serum albumin, 0.02% Ficoll, 0.02% polyvinylpyrrolidone); 0.1% SDS; 5 mm EDTA (pH 7.5); and 100 μg/ml of denatured salmon sperm DNA for 16 hr at 42°C. After hybridization, filters were washed twice in 0.5× SSC with 0.1% SDS at 65°C for 1 hr and exposed at −80°C to Amersham Hyperfilm for 16 hr with one intensifying screen.

DNA Sequencing

Plasmid minipreps were performed using a minikit Tip 20 (Qiagen, Chatsworth, CA) according to manufacturer’s specifications. Plasmid DNA concentrations were adjusted to 250 ng/μl based on absorbance at 260 nm. Plasmids were sequenced according to Sanger’s method using fluorescent dye-labeled primers and cycle sequencing kits (Applied Biosystems) as described previously (Pawlak et al. 1995). The reaction products were analyzed on a 373A automated DNA sequencer (Applied Biosystems). The sequences were done systematically on the 3′ end of the cDNA using SP6 or −21M13 primer and, when necessary, on the 5′ end using T7 or M13 reverse primer.

Sequence Analysis

The sequences were edited manually and limited to 400 bp and 2% ambiguities (N). The redundancy was evaluated by internal comparison of those sequences using the FASTA program. The sequences were sent to the National Center for Biotechnology Information (NCBI) for BLASTX and BLASTN analysis (Altschul et al. 1990) in the nonredundant nucleic acid and protein libraries. Sequence similarities identified by the BLAST programs were considered statistically significant when scores were >150 and 75 for acid nucleic acid and amino acid sequences, respectively, or when the Poisson P value was <0.05. The BLASTX and BLASTN results for each clone were analyzed simultaneously and processed manually. We always selected the protein match when a hit was detected with both types of analyses.

PCR for Radiation Hybrid Mapping

Primers for the PCRs were designed using the program as described by Rychlick and Rhoads (1989), which was adapted to large-scale primer design. The repeat elements, such as Alu, Kpn, and LINE were masked first and then the primers were selected according to the desired criteria. PCRs were performed on DNA obtained from the Genebridge 4 radiation hybrid panel (Gyapay et al. 1996). The PCRs were carried out in a volume of 15 μl. The final concentrations in the PCR were as follow: 2 ng/μl of DNA, 125 nm dNTP (31 nm of each), 1.33 μm primers (of each), 50 mm KCl, 2 mm MgCl2, 0.1% of Triton X-100, 0.01% of gelatine, 10 mm of Tris-HCl (pH 9.0) (25°C), and 0.25 units per 15 μl of Taq polymerase. The samples were overlaid with heavy mineral oil. Amplifications were performed using the hot start procedure. The first three cycles consisted of 30 sec of annealing at 61°C and 40 sec of denaturing at 94°C. The annealing temperature was lowered successively by 2°C for each consecutive three cycles until 55°C, followed by 25 further cycles at an annealing temperature of 55°C. After completion of the PCR reaction, 4 μl of loading mixture containing 0.1% (wt/vol) bromophenol blue and 50% (vol/vol) glycerol were added to each well. The PCR products were allowed to migrate on an agarose gel containing 1% SeaKem and 3% NuSieve agarose in TBE buffer with 0.25 μg/ml ethidium bromide. Then, the images of the gels were recorded with a CCD camera and scoring of the results was carried out semiautomatically with the BioImage software developed by Millipore. Typing results were downloaded into a database. The calculations were performed using the RHMAP package (Boehnke et al. 1991). Positioning of the CAG/CTG containing clones or ESTs were carried out relative to ∼1000 evenly distributed Genethon genetic markers. In the course of the calculations, the program positioned the ESTs into each interval defined by the adjacent genetic markers and the probability of this position was calculated. The highest probability was retained and considered as the real position of the given locus.

Acknowledgments

We thank Y. Laperche and T. Rohn for the critical reading of the manuscript, R. Derreumaux for excellent technical assistance, and Edith Grandvilliers for her secretarial assistance. This work was funded by INSERM and the Groupement de Recherche sur l’Etude des Génomes.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Footnotes

  • 3 Corresponding author.

  • E-MAIL bulle{at}im3.inserm.fr; FAX 33-1-48-98-09-08.

    • Received February 13, 1997.
    • Accepted May 1, 1997.

REFERENCES