The biological effects of simple tandem repeats: Lessons from the repeat expansion diseases
Abstract
Tandem repeats are common features of both prokaryote and eukaryote genomes, where they can be found not only in intergenic regions but also in both the noncoding and coding regions of a variety of different genes. The repeat expansion diseases are a group of human genetic disorders caused by long and highly polymorphic tandem repeats. These disorders provide many examples of the effects that such repeats can have on many biological processes. While repeats in the coding sequence can result in the generation of toxic or malfunctioning proteins, noncoding repeats can also have significant effects including the generation of chromosome fragility, the silencing of the genes in which they are located, the modulation of transcription and translation, and the sequestering of proteins involved in processes such as splicing and cell architecture.
Tandem repeats are a ubiquitous feature of the genomes of many organisms, and the human genome database is replete with examples of repeat tracts of different sequence and repeat number in different genomic locations (Riggins et al. 1992; Epplen et al. 1993; Subramanian et al. 2003; Subirana and Messeguer 2008). Estimates from the human genome sequencing project indicate that such repeats make up ∼3% of the sequenced human genome (Lander et al. 2001). However, the true number and length of such repeats may be much higher than the current database suggests since these sequences are prone to deletion when propagated in bacteria and yeast. Additionally, these repeats are often unstable or hypervariable in mammals as well. Thus, genomes are polymorphic with respect to many of these repeats, with some individuals or families having some tandem repeat tracts that are significantly longer than those seen in the general population. For example, most humans have 30 CGG•CGG repeats in the 5′ UTR of their FMR1 gene (Fu et al. 1991; Eichler et al. 1994; Strom et al. 2007). However, population studies in Caucasians, the only population for which significant data exist, indicate that ∼1 in 246–468 females have 55–200 repeats and ∼1 in 3717–8918 males have 200 to >1000 repeats at this locus (Crawford et al. 2001).
These larger repeat lengths are not necessarily biologically neutral. For example, FMR1 alleles with 55–200 CGG•CCG-repeats are associated with neurodegeneration (Hagerman and Hagerman 2004) and ovarian insufficiency (Murray 2000; Sherman 2000). Alleles with >200 repeats are associated with intellectual disability and autistic symptoms (Hagerman 2006). The FMR1 gene is not unique in having significant repeat length polymorphism that is associated with disease pathology. In fact, to date 20 disorders have been identified in humans that result from the presence of a large expansion-prone DNA tandem repeat. These diseases, known collectively as the repeat expansion diseases, are listed in Table 1. Two recent books discuss most of these diseases in some detail, and the reader is referred to these sources for more information as to their incidence, genetics, and pathology (Fry and Usdin 2006; Wells and Ashizawa 2006). Diseases are also known that involve an expanded amino acid tract generated by insertion or duplication of an imperfect trinucleotide repeat. These diseases differ in key ways from “classic” members of the repeat expansion diseases. This review will focus primarily on those “classic” or “nucleotide repeat expansion diseases.” More information about those diseases arising from expansions at the protein level can be found elsewhere (Albrecht and Mundlos 2005).
The repeat expansion diseases
Because of their association with human disease, the biological effects of the repeats responsible for these diseases have been studied intensively. As a result, a number of new paradigms have emerged for understanding how tandem repeats can affect genome structure and function. This review will focus on current ideas as to what effects these repeats have on the region of the genome in which they are located and what the ramifications may be for related repeats elsewhere.
The repeat expansion diseases can be divided into two categories: those like Huntington Disease (HD) or spinobulbar muscular atrophy (SBMA), where the repeat is located in an exon, and those like myotonic dystrophy (DM) or Fragile X syndrome (FXS), where the repeat is outside of the open reading frame (ORF) (Table 1). Both coding and noncoding repeats can have significant impact.
Open reading frame expansions
In all repeat expansion diseases identified thus far in which the repeat is in an exon, the repeat unit is a triplet with the sequence CAG•CTG. To date, at least 10 such diseases have been identified (Table 1). The result of the expansion is an increase in the length of a polyglutamine (polyQ) tract in the encoded protein. This is associated with the death of vulnerable neurons in the brain. The specific features of each disorder are thought to result from the particular combination of the properties of the expanded polyQ tract and the protein in which it is located (Orr and Zoghbi 2007). PolyQ is thought to cause conformational changes that confer toxic properties to the protein. This is accompanied by the appearance of polyQ-containing inclusions that may or may not be directly pathogenic. It has been suggested that failure to properly degrade the misfolded polyQ proteins via either the autophagic or the ubiquitin-proteasome pathways contributes to polyQ toxicity. Mitochondrial dysfunction, excitotoxicity, and disrupted intracellular trafficking have also been reported in cells expressing PolyQ proteins (for review, see Orr and Zoghbi 2007). Many of the polyQ proteins also interact with glutamine-rich transcription factors and cofactors like CREB-binding protein (CBP, also known as CREBBP) and SP1 (for review, see Orr and Zoghbi 2007). In some cases, factors like CBP have been found in inclusions, and CBP overexpression is known to rescue the polyQ toxicity in cells in tissue culture (Nucifora et al. 2001; Steffan et al. 2001). Sequestering of these proteins in aggregates may explain the widespread effect of polyQ on transcription. Reduced availability of transcription factors like SP1 could account for specific changes in gene expression, while inhibition of the histone acetyltransferase activities of proteins like CBP could account for the more global changes that are seen. Histone deacetylase inhibitors alleviate polyQ toxicity in various model systems (Steffan et al. 2001; Hughes 2002; Hockly et al. 2003; Bates et al. 2006), consistent with the hypothesis that polyQ toxicity results from the sequestering of proteins that affect gene expression by altering normal chromatin.
In addition to the toxic gain of function seen in expanded alleles, a reduction in the normal functioning of the affected protein can occur. For example, in SCA6, the small polyQ expansions seen in the affected protein, the alpha subunit of the P/Q-voltage gated calcium channel, affect channel function (Kordasiewicz and Gomez 2007). The relative contribution of impaired channel function and polyQ toxicity to SCA6 pathology is still the subject of some debate (Kordasiewicz and Gomez 2007). Small increases in the number of glutamines are also associated with a reduced transactivation ability of the androgen receptor (Mhatre et al. 1993; Chamberlain et al. 1994).
Long polyalanine repeat tracts generated by imperfect trinucleotide repeats share many of the properties of polyglutamine repeats. They have been implicated in a variety of congenital abnormalities, including cleidocranial dysplasia, holoprosencephaly, oculopharyngeal muscular dystrophy, and synpolydactyly 1 (for review, see Albrecht and Mundlos 2005). Thus, long polyalanine tracts can also be deleterious. In HD and SCA3, a small amount of frameshifting occurs during translation of the repeat (Davies and Rubinsztein 2006). This results in the low levels of proteins with polyalanine and polyserine tracts. Their contribution to disease pathology is currently unknown. Reduced protein translation is also sometimes seen (Brockschmidt et al. 2007).
It has been estimated that as many as 1 in 20 human proteins have tandem repeat polymorphisms (O'Dushlaine et al. 2005). Of these, approximately one fifth involve tandem repeat units that are not multiples of three, and thus these polymorphisms involve frameshifts (O'Dushlaine et al. 2005). Many examples of tandem repeat-mediated frameshifts are seen in pathogenic microorganisms, where they are a frequent source of phase variation in bacteria like Neisseria and Mycoplasma (van Belkum et al. 1999). Frameshifting may be mediated at the RNA level, perhaps by affecting ribosomal scanning. This phenomenon will be discussed in more detail below in the context of RNA-mediated effects of noncoding repeats.
Noncoding expansions
Many of the repeat expansion diseases involve a repeat that is in a noncoding region of the gene. In these disorders, self-evidently, the relationship between repeat expansion and disease pathology is not simply a matter of a change in the properties of the protein product of the affected gene. These diseases include progressive myoclonus epilepsy type 1 (EPM1); neurodegenerative disorders such as Friedreich ataxia (FRDA), spinocerebellar ataxia type 8 (SCA8) and 10 (SCA10), and fragile X-associated tremor and ataxia syndrome (FXTAS); the mental retardation (MR) syndromes, fragile X syndrome (FXS) and FRAXE and FRA12A MR, as well as myotonic dystrophy type 1 (DM1) and 2 (DM2).
EPM1 is caused by a dodecamer repeat in the promoter of the cystatin B gene. It is characterized by severe stimulus-sensitive myoclonus, generalized tonic-clonic seizures (Lalioti et al. 1997; Virtaneva et al. 1997). FRDA is a relentlessly progressive ataxia with associated hypertrophic cardiomyopathy resulting from the presence of >66 GAA•TTC repeats in intron 1 of the frataxin (FXN) gene (Pandolfo 2002). SCA8 and SCA10 are characterized by cerebellar dysfunction and seizures, with variable expression of polyneuropathy, as well as cognitive and neuropsychiatric impairment. The CAG•CTG repeat responsible for SCA8 lies in the region of overlap of the 3′ end of the ORF of the ATXN8 gene and ATXN8OS, a gene that produces a noncoding transcript from the opposite strand (Moseley et al. 2006). SCA10 results from expansion of the repeat ATTCT•AGAAT in intron 9 of the ATXN10 gene (Matsuura et al. 2000).
FXTAS (Hagerman and Hagerman 2004) and premature ovarian insufficiency (Murray 2000; Sherman 2000) result from the presence of 55–200 CGG•CCG repeats in the 5′ UTR of the FMR1 gene. A quite different disorder, FXS, occurs when the repeat number in the FMR1 gene exceeds 200 (Hagerman 2006). Expansion in the region of overlap between the divergently transcribed genes AFF2 (formerly FMR2) and FMR3 results in a mild form of intellectual disability known as FRAXE MR (Gecz 2000a, b). FRA12A MR results from the presence of a large CGG•CCG-repeat tract in the 5′ UTR of the DIP2B gene (Winnepenninckx et al. 2007).
DM1, which results from a large CTG•CAG repeat in the 3′ UTR of the DMPK gene, is a dominantly inherited multisystemic disorder (for recent reviews, see Ranum and Cooper 2006; Wheeler and Thornton 2007). A juvenile-onset form and an adult-onset form are characterized by skeletal muscle myotonia, together with progressive muscle weakness and wasting, cardiac conduction defects, cataracts, insulin insensitivity, and testicular atrophy. Extremely large repeat numbers are associated with a congenital form of DM1 (CDM), which presents initially with delayed motor development, mental retardation, and hypotonia, with symptoms typical of adult-onset DM1 developing later. DM2 is a disorder with symptoms very similar to DM1 but that results from a large CTGG•CCAG-repeat tract in an intron of the unrelated CCHC-type zinc finger, nucleic acid binding protein (CNBP) gene (formerly known as ZNF9) (for recent reviews, see Ranum and Cooper 2006; Wheeler and Thornton 2007).
The noncoding repeats responsible for these diseases may affect cell and chromatin structure, as well as transcription, splicing, and translation in a variety of different ways.
Tandem repeats as origins of replication
The ATTCT•AGAATA repeats responsible for SCA10 are prone to unwinding in vitro and readily form unpaired regions in supercoiled plasmids (Potaman et al. 2003). This behavior is consistent with a role in the initiation of replication. Indeed, SCA10 alleles show elevated origin activity compared with unaffected alleles (Liu et al. 2007). CGG•CCG repeats have very different properties in vitro (Fry and Loeb 1994; Chen et al. 1995; Nadel et al. 1995; Usdin and Woodford 1995; Mariappan et al. 1996). Nonetheless, origins of replication have been mapped to the region containing the repeat in both the FMR1 (Brylawski et al. 2007; Gray et al. 2007) and the AFF2 genes (Chastain II et al. 2006). The ability to act as origins of replication may play a role in the tendency of these repeats to expand (Chastain II et al. 2006; Brylawski et al. 2007; Gray et al. 2007).
Paradoxically, the individual strands of the CGG•CCG and CTG•CAG repeats form folded structures like hairpins and tetraplexes that block DNA synthesis (Kang et al. 1995; Usdin and Woodford 1995; Weitzmann et al. 1996, 1998; Samadashwily et al. 1997; Ohshima et al. 1998; Usdin 1998). This may contribute to the association of repeats like these with chromosome fragility as discussed below.
Tandem repeats and chromosome fragility
Some large repeat tracts cause the formation of a cytogenetic abnormality known as a fragile site. Mammalian fragile sites are gaps or constrictions in chromosomes that are visible by microscopy when cells are grown in the presence of particular inducers. They are frequent sites of chromosome breakage and translocation (for review, see Arlt et al. 2006). Since microscopy is a relatively low-resolution technique, it may be that fragile sites occur with shorter repeat lengths but are not visible with this technique. Breakage-prone sites can be seen to occur spontaneously with even relatively short repeat tracts in yeast using genetic assays (Freudenreich et al. 1998; Balakumaran et al. 2000).
A number of inducers of mammalian fragile sites have been identified. These include folate stress, aphidicolin, bromodeoxyuridine, and distamycin. To date, all seven folate-sensitive fragile sites whose sequence is known contain long CGG•CCG-repeat tracts. This includes the disease-associated tracts seen in FXS, FRAXE MR, and FRA12A MR. The 33-bp AT-rich repeats responsible for the distamycin-inducible fragile site at FRA16B (Yu et al. 1997) and the ∼42-bp AT-rich repeats responsible for the bromodeoxyuridine-inducible site at FRA10B (Hewett et al. 1998) are not associated with any known disease. However, they do illustrate the potential for other tandem repeat sequences to cause chromosome fragility. Long CTG•CAG repeats are fragile in yeast (Freudenreich et al. 1998). However, no agent has yet been identified that induces fragility of these repeats in human cells, and whether, in fact, they are fragile in mammalian cells remains to be seen.
Agents that induce the different classes of fragile sites all interfere with DNA replication. In addition, many fragile sites are inherently difficult to replicate (Kang et al. 1995; Usdin and Woodford 1995; Samadashwily et al. 1997; Ohshima et al. 1998) or are associated with regions of late replication. This has led to the idea that fragile sites result from some sort of replication block or from incompletely replicated regions of the chromosome (Le Beau et al. 1998).
Most recent studies in human cells have been carried out on the common aphidicolin-inducible fragile sites, whose sequence basis is unknown (Rassool et al. 1996). It has been shown that reduced levels of ataxia telangiectasia-related (ATR), CHEK1, BRCA1, and SMC1 proteins lead to elevated levels of expression of these sites (Casper et al. 2002, 2004; Arlt et al. 2004; Musio et al. 2005; Durkin et al. 2006). Since these proteins are key components of the cellular response to stalled replication forks, it supports the idea that fragile sites block or retard the progress of the DNA replication machinery. Common fragile sites may represent double-strand breaks formed because of stalled replication forks that were not rescued via the ATR/CHEK1/BRCA1/SMC1 pathway. Neoplastic transformation may result from those cells that escape the ATR checkpoint and enter mitosis with an incompletely resolved or improperly repaired fragile site (Musio et al. 2005). While similar experiments have not yet been reported for other classes of fragile sites, there is reason to think that they arise by a similar mechanism. In this regard, it may be relevant that ATR is required for the stable maintenance of long CGG•CCG repeats in mice (Entezam and Usdin 2008).
It is unlikely that chromosome fragility per se plays any role in FXS, FRAXE MR, or FRA12A MR. However, some of the deletions common in this region may be related to this phenomenon, and at least one balanced translocation involving a breakpoint in AFF2 is known (Honda et al. 2007). In other locations, such repeat tracts may cause the loss or disruption of vital genes or the generation of an oncogene. For example, Jacobsen syndrome (JBS) (OMIM #147791), a disorder that includes trigonocephaly, cardiac anomalies, and thrombocytopenia, results from deletions involving the long arm of chromosome 11. Deletion breakpoints map to FRA11B, a folate-sensitive fragile site, as well as to other more distal CGG•CCG-repeat tracts (Jones et al. 2000). Fragile sites in primates have been shown to constitute breakpoints that have contributed to chromosome evolution (Ruiz-Herrera et al. 2005, 2006). In fact, it has been suggested that mammalian chromosomal evolution has been driven by such sites and that “the human genome can be considered a mosaic comprising regions of fragility that are prone to reorganization” (Ruiz-Herrera et al. 2005).
Tandem repeats as intrinsic promoter components
The bidirectional promoter for the AFF2 and FMR3 genes implicated in FRAXE MR is comprised largely of CGG•CCG repeats, with the transcription start site for each gene being located within 30 bp of one end of the repeat tract (Gecz 2000a). Thus, the repeats present in the normal allele may play a positive role in regulating transcription from this promoter. Antisense transcripts are produced from the FMR1 gene (Ladd et al. 2007). The major transcription start sites on normal alleles are within 350 bp of the CGG-repeat tract. Thus, it is possible that the repeats are an integral part of this promoter as well. Since normal FMR1 and AFF2/FMR3 alleles have 5–54 repeats and 3–42 repeats respectively, it suggests that even relatively short CGG•CCG-repeat tracts may be able to contribute significantly to promoter activity. A variety of CGG•CCG-binding proteins have been described that have the potential to affect transcription, including CGGBP (Deissler et al. 1997), the multifunctional Pur proteins (Gallia et al. 2000), and the Egr family of transcription factors (Christy and Nathans 1989).
In EPM1, repeat numbers in the disease range lead to decreased cystatin B mRNA production. As few as 12 repeats can cause a significant reduction in promoter activity in a reporter assay (Alakurtti et al. 2000). Since a random sequence of the same length has the same effect, the consequences of repeat expansion in EPM1 have been attributed to a nonspecific effect of the repeats on the relative position of important regulatory elements (Lalioti et al. 1999). A similar mechanism is thought to explain the effect of the polymorphic GAA•TTC repeat in the Mycoplasma gallisepticum pMGA promoter on the phase variation of the adhesin proteins (Glew et al. 2000). It is also possible to envision a scenario in which the repeats in the promoter may bind an activator or repressor, thereby positively or negatively affecting gene activity.
Tandem repeats as transcription enhancers
Increased numbers of CGG•CCG repeats are associated with increased FMR1 transcription up to ∼200 repeats (Tassone et al. 2000, 2007). The antisense transcript shows a similar repeat-related increase (Ladd et al. 2007). A more open chromatin architecture may be responsible, perhaps related to the tendency of CGG•CCG repeats to exclude nucleosomes (Wang et al. 1996). Alternatively, the repeat may bind more CGG•CCG-binding proteins that facilitate transcription initiation or elongation. The increased transcription is associated with the increased usage of upstream transcription start sites for both the sense and antisense transcripts (Beilina et al. 2004; Ladd et al. 2007). In the case of the antisense transcript, this upstream start site is located ∼10 kb away. This demonstrates that the effect of these repeats on transcription initiation can be exerted over a large distance. A positive effect of the repeats on transcription has also been seen in transient transfection experiments in which transcription is driven by the cytomegalovirus intermediate-early promoter (Chen et al. 2003). Thus, the effect of the repeats is not promoter specific.
Tandem repeats and gene silencing
In contrast to the elevated levels of transcription seen when the CGG•CCG-repeat number in the FMR1 gene is 55–200, the gene becomes silenced when the repeat number exceeds 200 (Pieretti et al. 1991). This silencing is responsible for FXS. Repeat-mediated gene silencing is also responsible for both FRAXE MR (Gu et al. 1996) and FRAX12A MR (Winnepenninckx et al. 2007). In addition to being hypermethylated, the 5′ end of the FMR1 gene is associated with marks of transcriptionally silent chromatin. H3 histones dimethylated at lysine 9 and hypomethylated at lysine 4 accumulate on expanded alleles (Coffee et al. 1999, 2002; Tabolacci et al. 2005). H4K16 is also hypoacetylated (Biacsi et al. 2008). Similar epigenetic changes are also seen in CDM and FRDA (Otten and Tapscott 1995; Thornton et al. 1997; Filippova et al. 2001; Cho et al. 2005; Herman et al. 2006; Greene et al. 2007).
Of the three repeats associated with heterochromatin formation, only CGG•CCG repeats can be methylated. Therefore, DNA methylation cannot be the trigger for all of these chromatin changes as initially envisioned for FXS (Smith et al. 1994; Chen et al. 1995). RNAs ∼21 nucleotides in length that are homologous to the region containing the repeats in the DMPK locus are seen (Cho et al. 2005). These small RNAs are characteristic of the products of Dicer (DICER1) digestion. DICER1 is a key component of the RNA interference (RNAi) pathway. RNAi leads to post-transcriptional gene silencing by targeting homologous mRNAs for degradation or by inhibiting their translation (for review, see Peters and Meister 2007). In some organisms, the RNAi machinery can also be involved in the formation of transcriptionally silent chromatin (Mette et al. 2000).
The presence of these small RNAs would be consistent with a role of an RNAi-like mechanism in heterochromatin formation at the DMPK locus, although a role for DICER1 has yet to be demonstrated. One potential source of dsRNA for this process could be the antisense transcript that is initiated downstream of the repeat (Cho et al. 2005). Interaction of the sense and antisense transcript would then generate a dsRNA substrate for DICER1. A similar sense-antisense pair is possible for the FMR1 gene (Ladd et al. 2007), and the FRDA repeat is upstream of a short interspersed nuclear element that would be transcribed in the antisense direction (Greene et al. 2007). However, another source of dsRNA is present in both CDM and diseases involving long CGG•CCG repeats: Both CGG-RNA and CUG-RNA form stable hairpins involving a mixture of Watson-Crick (WC) and non-WC base pairs (Napierala and Krzyzosiak 1997; Handa et al. 2003; Zumwalt et al. 2007). Despite the presence of the non-WC base pairs, these RNAs are also DICER1 substrates (Handa et al. 2003; Krol et al. 2007). Thus, it may be that the repeats directly contribute to gene silencing. In this regard, it is interesting to remember that every large CGG•CCG repeat that has been described to date is associated with DNA methylation. While antisense transcripts are relatively common, the methylation of all of these repeats may indicate that it is the repeat itself that is the trigger for gene silencing.
There is another human genetic disorder that lends support to the generality of tandem repeat-mediated epigenetic changes in humans. Facioscapulohumeral muscular dystrophy (FSHD1) (OMIM #158900) is caused by contraction of the subtelomeric D4Z4 repeat array on chromosome 4. This array consists of 1–150 large G + C-rich repeats (Wijmenga et al. 1993). Contraction of the array to <11 repeats results in hypomethylation of the region in affected individuals (van Overveld et al. 2003). This relative demethylation is thought to lead to the activation of an otherwise silent gene, whose identity is, as yet, unknown, that has negative effects on muscular development.
A repeat-mediated effect on chromatin structure in these noncoding disorders suggests an unexpected parallel with disorders resulting from expansions in the coding sequence. While the mechanisms involved are distinctly different, the ability of both polyQ and these long noncoding repeats to alter normal chromatin suggests that both groups of diseases are chromatinopathies.
Tandem repeats as blocks to transcription elongation
The GAA•TTC repeat that causes FRDA forms triplexes as well as related structures known as “sticky DNA” (Bidichandani et al. 1998; Sakamoto et al. 1999; Grabczyk and Usdin 2000a, b; Vetcher et al. 2002; Potaman et al. 2004). Experiments in vitro and in bacteria show that during transcription the repeats trap RNA polymerase on the template thus blocking transcription elongation. The pattern of transcription termination and the effects of blocking oligonucleotides are consistent with the formation of some sort of three-stranded structure. This effect on transcription elongation could contribute to the transcription deficit responsible for FRDA. In principle, a similar phenomenon could occur at other polypurine:polypyridimine sequences in the genome, since they too have the potential to form triplexes.
Transcribed repeats as protein traps
Since the symptoms of DM1 and DM2 are similar despite the different genes affected, the mechanism of disease pathology is probably not related specifically to DMPK or CNBP (Ranum and Cooper 2006; Wheeler and Thornton 2007). The expanded repeats in both diseases generate numerous cytoplasmic foci or inclusions in affected cells. These foci include, in addition to the repeat-containing transcript, members of the muscleblind (MBNL) family of splicing proteins (Mankodi et al. 2005). MBNL proteins interact antagonistically with members of the CUGBP1 and CUGBP2-like (CELF) family of splicing factors. The interaction of these different proteins is important for developmentally appropriate RNA splicing. MBNL proteins favor the adult splice isoforms and CUGBP1 favors the retention of fetal exons (Ho et al. 2004). Consistent with the idea that DM pathology results from sequestering of MBNL proteins by the DM RNA, DM1 and DM2 cells have aberrant splicing of cardiac troponin T, chloride channel 1, and insulin receptor transcripts. Furthermore, Mbnl1 nullizygous mice develop symptoms similar to that of DM and have aberrant splicing of the same transcripts (Kanadia et al. 2003). Similar effects are also seen in transgenic mice expressing large numbers of CUG repeats in the transcriptional unit of an unrelated gene (Mankodi et al. 2000), lending support to the idea that it is the CUG-RNA that is responsible for this effect.
FMR1 mRNA-containing intranuclear neuronal inclusions are seen in the brains of individuals with FXTAS (Greco et al. 2002) and knock-in mice (Willemsen et al. 2003; Entezam et al. 2007). An RNA-mediated mechanism of pathology is supported by the observations that RNA with large numbers of CGG repeats is toxic to human cells (Arocena et al. 2005; Handa et al. 2005) and causes neurodegeneration in flies (Jin et al. 2003). Some of the antisense transcripts seen in carriers of alleles with 55–200 repeats also contain the repeat region (Ladd et al. 2007), and CCG repeats also cause neurodegeneration in flies (Sofola et al. 2007a). The antisense transcript may thus also contribute to FXTAS pathology. The FXTAS inclusions contain many proteins (Iwahashi et al. 2006). This makes it difficult to identify a single responsible protein, if there is one. Lamin A/C is one of the proteins found in these inclusions, and cells from FXTAS brains show aberrant lamin localization and abnormal nuclear morphology (Arocena et al. 2005). This has led to the suggestion that FXTAS is a laminopathy. On the other hand, overexpression of Pur-α, HNRNPA2B1, or CUGBP1 suppresses the phenotype of the CGG transgenic fly (Jin et al. 2007; Sofola et al. 2007b). These proteins all interact directly or indirectly with the CGG-RNA, supporting the idea that neurodegeneration may result from the sequestering of such binding proteins. The presence of MBNL1 in the inclusions is interesting given the role of this protein in DM1 and DM2. MBNL1 binds to CCG-RNA but not CGG-RNA (Kino et al. 2004) lending support to the idea that the repeats in the antisense transcript may be significant in FXTAS pathology. This transcript also contains an ORF that encodes polyproline, but whether this peptide is made and whether protein-mediated effects contribute to the pathology remains to be seen.
In addition to a polyQ tract in the ATXN8 gene of individuals with SCA8 (Ikeda et al. 2008), repeat expansion at this locus also results in a long CUG tract in the antisense transcript produced from the ATXN8OS gene. Long CUG repeats cause neurodegeneration in a fly model (Mutsuddi et al. 2004). The symptoms of SCA8 may thus result from the combined effect of protein and RNA-mediated pathology. Interestingly, overexpression of even normal numbers of CUG repeats in this model causes pathology (Mutsuddi et al. 2004), thus demonstrating that even very short repeats may be deleterious if their transcript is abundant.
Tandem repeats as translational regulators
Paradoxical effects on translation have been demonstrated for the CGG•CCG repeats in the 5′ UTR of the FMR1 gene. Short repeat tracts are associated with increased levels of expression of a reporter construct (Chen et al. 2003). This effect may be related to enhanced recruitment of CGG-RNA binding proteins that facilitate translation. In contrast, large numbers of repeats produce a transcript that is translated poorly (Feng et al. 1995). This results in decreased FMR1 protein in humans (Kenneson et al. 2001) and knock-in mice with similar numbers of repeats (Entezam et al. 2007). This decrease may be responsible for some of the FXS-like symptoms seen in carriers of alleles with 55–200 repeats. This negative effect on translation is thought to be due to stalling of the 40S ribosomal subunit in the vicinity of the repeat (Feng et al. 1995). Secondary structures, particularly when they occur in the 5′ UTR, are associated with problems with ribosome scanning. It is thus reasonable to think that the translation deficit is related to the ability of CGG-RNA to form stable RNA hairpins (Handa et al. 2003; Zumwalt et al. 2007). A similar negative effect of CUG repeats on translation has also been reported (Raca et al. 2000). This would be consistent with the ability of CUG-RNA to form stable hairpins (Napierala and Krzyzosiak 1997). CAG-RNA hairpins are also stable (Krol et al. 2007) and may be responsible for the frameshifting and decreased translation seen in repeat expansion diseases where the repeat is in the coding sequence (Davies and Rubinsztein 2006; Brockschmidt et al. 2007). Other repeats involving palindromes or quasipalindromic sequences that can form stable RNA hairpins would be expected to behave in a similar way.
As mentioned previously, some of the disease-associated hairpins are substrates for the ribonuclease DICER1. DICER1 knockdown leads to a 40% increase in the amount of the CUG repeat-containing transcript produced from the DM1 locus (Krol et al. 2007). Even relatively short repeats can have this effect. For example, the levels of transcripts from a HD allele with 44 CAG repeats and transcripts from a SCA1 allele with 53 CAG repeats are twice as high in cells in which DICER1 is depleted, compared with cells with normal levels of DICER1 (Krol et al. 2007). The effect, if any, on disease pathology in these disorders is unclear. However, it does illustrate how even relatively short, transcribed tandem repeats can affect mRNA levels. Furthermore, this effect can be exerted in trans (Krol et al. 2007), raising the possibility that the repeats may also reduce the transcript levels of unrelated genes as well. Once again, other repeats with the potential to form RNA hairpins may have similar effects.
Concluding remarks
The repeat expansion diseases illustrate the myriad ways that tandem repeats can affect gene structure and function: Exonic repeats can have relatively large effects on protein synthesis and function even with small changes in repeat number, while the very much larger part of the genome that is comprised of noncoding sequences creates a large window of opportunity for repeat-mediated effects to occur that do not involve changes in protein sequence.
While genome sequencing efforts are likely to underestimate the prevalence of long repeats because of difficulties with stable cloning of these sequences, a number of relatively long repeat sequences have been identified in a number of sequenced organisms. For example, 402 GAA•TTC-repeat tracts with >99 repeats have been identified in the genome of the malaria vector, Anopheles gambiae, and >44 such repeat tracts have been found on the human X chromosome alone (Subirana and Messeguer 2008). Thus, it is reasonable to project that other examples of strong repeat-mediated effects like those discussed here may emerge with time.
There are many examples of polymorphic tandem repeat tracts with smaller repeat numbers that are overrepresented in patient populations, suggesting a link with disease pathology. This includes those with short repeat units, sometimes known as variable number tandem repeats (VNTRs), and those with much larger repeat units, sometimes known as copy number variants (CNVs). Examples of promoter VNTRs include the VNTR in the insulin gene where alleles with fewer repeats are associated with increased insulin gene transcription (Lucassen et al. 1995). Smaller alleles have been proposed to contribute to susceptibility to diabetes, an elevated body mass index, and metabolic syndrome (O'Dell et al. 1999). An increase in the number of (TAAAA) repeats in the promoter of the sex hormone-binding globulin gene (SHBG) has been linked to cardiac artery disease and polycystic ovary syndrome (Hogeveen et al. 2001; Ferk et al. 2007; Alevizaki et al. 2008). CNVs have also been implicated in a number of disorders including rheumatoid arthritis, bipolar disorder, and early-onset Parkinson disease (for review, see Estivill and Armengol 2007).
While more data is needed to firmly establish the role of these VNTRs and CNVs in disease pathology, there is good reason to think that some could be important modifiers of disease severity or contribute to disease symptoms in polygenic disorders.
Acknowledgments
This work was made possible by funding from the Intramural program of NIDDK (NIH).
Footnotes
-
↵1 Corresponding author.
↵1 E-mail ku{at}helix.nih.gov; fax (301) 402-0053.
-
Article is online at http://www.genome.org/cgi/doi/10.1101/gr.070409.107.
- Copyright © 2008, Cold Spring Harbor Laboratory Press











