Reading TE leaves: New approaches to the identification of transposable element insertions
- 1Department of Biochemistry and Molecular Biology, Mississippi State University, Mississippi State, Mississippi 39762, USA;
- 2Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana 70803, USA
Abstract
Transposable elements (TEs) are a tremendous source of genome instability and genetic variation. Of particular interest to investigators of human biology and human evolution are retrotransposon insertions that are recent and/or polymorphic in the human population. As a consequence, the ability to assay large numbers of polymorphic TEs in a given genome is valuable. Five recent manuscripts each propose methods to scan whole human genomes to identify, map, and, in some cases, genotype polymorphic retrotransposon insertions in multiple human genomes simultaneously. These technologies promise to revolutionize our ability to analyze human genomes for TE-based variation important to studies of human variability and human disease. Furthermore, the approaches hold promise for researchers interested in nonhuman genomic variability. Herein, we explore the methods reported in the manuscripts and discuss their applications to aspects of human biology and the biology of other organisms.
Transposable elements (TEs), comprising two major classes (retrotransposons and DNA transposons), are ubiquitous components of eukaryotic genomes that are often thought of as genomic parasites. They are also powerful agents of evolutionary change. For example, they impact gene expression via the introduction of alternative regulatory elements, exons, and splice junctions (Jurka 1995; Speek 2001; Nigumann et al. 2002; Kazazian 2004; Peaston et al. 2004; Matlik et al. 2006; Babushok et al. 2007; Hasler et al. 2007). However, TEs need not be actively mobilizing to have an effect on genome structure. TE-mediated genome rearrangements through nonhomologous recombination are well-documented (Batzer and Deininger 2002; Lonnig and Saedler 2002; Eichler and Sankoff 2003; Hancks and Kazazian 2010) and deletions, duplications, inversions, translocations, and chromosome breaks have all been linked to the presence of TEs in a variety of genomes (Weil and Wessler 1993; Lim and Simmons 1994; Mathiopoulos et al. 1998; Caceres et al. 1999; Gray 2000; Zhang and Peterson 2004).
The obvious evolutionary question that arises is, “Why are TEs tolerated if they cause so many problems?” Of course, they may simply be too adaptable to be completely eliminated. However, along with recombination, independent assortment, and sex, TE-mediated mutation plays a major role in generating genetic diversity. As potent mutagens, TEs create genetic changes upon which natural selection can act. Their prevalence in eukaryotic genomes may indicate that TEs are, on balance, selectively advantageous and several studies have suggested important roles in genome biology (Vidal et al. 1993; Hamdi et al. 2000; Deininger and Roy-Engel 2002; Nouaud et al. 2003; Lowe et al. 2007; Mikkelsen et al. 2007).
For example, one of the most exciting contributions of TEs to a genome is as a source of raw material in the evolution of new genes and regulatory pathways, aka exaptation or molecular domestication (for examples, see Kapitonov and Jurka 2005; Cordaux et al. 2006b; Feschotte 2008; Lu and Clark 2010; Volff 2010). TEs are recognized as important players in the diversification of taxa by way of their involvement in gene regulation. This point was emphasized with the publication of the Monodelphis domestica (opossum) genome (Mikkelsen et al. 2007) and by numerous other authors (Medstrand et al. 2005; Thornburg et al. 2006; Lowe et al. 2007; Feschotte 2008; Faulkner et al. 2009). For example, in the Monodelphis research the investigators noted that much of the evolutionary innovation distinguishing metatherian from eutherian mammals was not due to differentiation in coding sequences but was instead due to differences in noncoding DNA and that TEs are a “major creative force” in mammalian evolution. Furthermore, one recent publication provided strong arguments suggesting that increases in transposable element activity in response to physiological stress may provide the foundation for the punctuated equilibrium model of evolutionary change (Zeh et al. 2009).
As genetic markers, TEs provide certain advantages over other more widely used systems and have proven to be nearly ideal markers for phylogenetic and population genetic analyses (Murata et al. 1993, 1998; Stoneking et al. 1997; Tatout et al. 1999; Nikaido et al. 2001; Kawai et al. 2002; Xiao et al. 2002; Terai et al. 2003, 2004; Nishihara et al. 2005, 2006; Schmitz et al. 2005; Xing et al. 2005, 2007; Witherspoon et al. 2006; among many others). This is especially true of the retrotransposons, particularly the SINEs (Short INterspersed Elements). First, the presence of an element in multiple individuals at a given locus represents identity by descent in almost all cases because of the very large number of potential insertion sites for any element (Batzer and Deininger 2002; Okada et al. 2004; Ray et al. 2006). Polymorphic TE insertions therefore reflect relationships more accurately than many other genetic markers (e.g., single nucleotide polymorphisms (SNP), microsatellites, and restriction fragment length polymorphisms [RFLP]). In other words, SINEs have been demonstrated to be essentially homoplasy-free (Shedlock et al. 2004; Salem et al. 2005a; Schmitz et al. 2005; Ray et al. 2006). A second advantage is that the ancestral state of a SINE insertion locus is known to be the absence of the element (Perna et al. 1992; Batzer et al. 1994), making assumptions about this aspect of the analysis unnecessary.
Retrotransposons are of particular interest to human biology. They comprise a substantial proportion (∼42%) of the mass of our genome and the only human TE families known to exhibit current mobilization activity (Fig. 1). All three recently active non-LTR retrotransposons in the human genome, LINE-1 (Long INterspersed Element 1, L1), Alu, and SVA have insertions that are human specific and many that are recent enough to still be polymorphic in the human population (Kazazian et al. 1988; Batzer and Deininger 1991, 2002; Batzer et al. 1991; Brouha et al. 2003; Ostertag et al. 2003; Wang et al. 2005). These insertions have tremendous potential to be informative for human biology at a number of levels. Unfortunately, assaying genomes for lineage-specific TE insertions, especially those that are polymorphic among individuals can be a time-consuming and expensive proposition.
Recently active human retrotransposons (Long Terminal Repeat [LTR] and non-LTR groups) and their approximate representation in the human genome (in parentheses). While all sharing a polyA tail, the non-LTR retrotransposons are structurally distinct. The autonomous LINE-1 element (L1) contains two open reading frames while Alu and SVA do not. Alu is instead composed of two monomers linked by an A-rich linker sequence (A5TACA6). SVA is a composite element made up of a hexamer repeat of varying copy number, an Alu-like region, a region of variable numbers of tandem repeats, and an HERV-K derived region known as SINE-R. All non-LTR elements are flanked by target site duplications (arrows) that are typically between 5 and 10 bp. The only recently active LTR element in the human genome (HERV-K) has a distinct structure resembling most endogenous retroviruses—full-length copies contain a central region encoding the Gag, Pol, and Env proteins flanked by identical long terminal repeats and short TSDs. HERV-K was assayed only by Huang et al. (2010), exhibited relatively low insertion rates compared to non-LTR retrotransposons, and will not be mentioned further. L1, Alu, and SVA all mobilize via a mechanism known as TPRT (Target Primed Reverse Transcription; for review, see Ostertag and Kazazian 2001). During this process, the mobilizing element is transcribed via RNA pol II (LINE-1 and SVA) or RNA polIII (Alu). In the case of LINE-1, ORFs 1 and 2 are translated on the ribosomes and ORF1 will typically bind to its own transcript for transport back to the nucleus. Once in the nucleus, ORF1, which has endonuclease and reverse transcriptase activity, is responsible for creating and integrating a cDNA copy at some other location. Alu, and likely SVA elements, “hijack” the L1 enzymatic machinery, probably via docking to the ribosome, in order to facilitate their own nuclear reentry and reverse transcription (Boeke 1997; Ostertag et al. 2003).
Some authors have attempted various experimental methods to identify human-specific TE polymorphisms (Roy et al. 1999; Sheen et al. 2000; Budzin et al. 2002; Badge et al. 2003; Mamedov et al. 2005), but the approaches tended to be rather cumbersome and difficult to optimize. Limited sequencing and computational capacity were also two main problems. As a result, most of the loci used in analyses of human population structure were discovered as part of a disparate set of projects and just happened to be informative with regard to human population differentiation. Thus, their large scale utilization in the scientific community has been rather limited (Bamshad et al. 2003; Watkins et al. 2003).
Fortunately, over the past several years new techniques have revolutionized our ability to generate and analyze DNA sequence data. Whereas only 10 yr ago we were able to generate at best a few hundred thousand bases per day using chain termination sequencing methods, we can now generate gigabases of data in a single run of a 454 (Roche) or Illumina machine. Five recent papers report multiple methods based on second-generation sequencing techniques as well as hybridization arrays to rapidly and relatively inexpensively characterize genome-wide TE insertion patterns (Beck et al. 2010; Ewing and Kazazian 2010b; Huang et al. 2010; Iskow et al. 2010; Witherspoon et al. 2010) and identify a plethora of human TE polymorphisms. With these novel methods, the identification of markers for a vast array of applications can be specifically targeted.
The methods
Many of the methods are novel augmentations of the PCR-based techniques cited above (Roy et al. 1999; Sheen et al. 2000; Budzin et al. 2002; Badge et al. 2003; Mamedov et al. 2005). Ewing and Kazazian (2010b) took advantage of the unique sequence characteristics of the most recently active family of human L1 elements (L1Ta; Kazazian et al. 1988; Skowronski et al. 1988; Kazazian and Moran 1998; Boissinot et al. 2000; Sheen et al. 2000) to generate a library of half-sites (loci containing sequence from an insertion of interest and the neighboring flank) via multiple rounds of PCR. Libraries for 25 individuals including six family groups were then sequenced using Illumina technology to generate a huge data set of ∼12 million 36- or 76-bp single-end reads per individual, that is, ∼20% of a human genome consisting solely of sequences adjacent to recent L1 insertions. These sequence reads were mapped to the human genome reference sequence to identify the locations of the potentially polymorphic L1 insertions.
Similarly, Witherspoon et al. (2010) utilized Illumina technology, but with a different method that targets the genomic sequence junctions of Alu elements. Subsequent steps enriched for Alu-containing PCR amplicons and the resulting libraries were sequenced using a paired-end protocol. Although it could reduce the total number of insertions that could potentially be assayed, using paired-end sequencing gives this method the advantage of having not only a sequence read just upstream of the insertion but also the sequence of the 5′ insertion junction itself, thereby providing a mechanism to verify that the initial round of PCR was due to proper annealing of the Alu-specific primer.
Also taking advantage of the junction between retrotransposons and the adjacent flank were Iskow et al. (2010) in their study of L1 and Alu activity. Using both Sanger sequencing and 454 (Roche) pyrosequencing technology to interrogate the junctions, they investigated insertions in 46 individuals of diverse ancestry to identify 152 novel L1 insertions. Unique to this study, however, was the inclusion of DNA from eight cell lines derived from human tumors, thereby allowing a comparison of activity in normal somatic genomes and genomes thought to be under a differential regulatory regime.
Huang et al. (2010) took a very different approach. Following genome digestion and vectorette PCR, the resulting amplicons were hybridized to a human genome tiling microarray. Analysis of the hybridization data provided information on locations of the sequences flanking L1 insertions in the genomes analyzed.
Finally, Beck et al. (2010) were the only team not to utilize PCR to select for TE insertions in their initial assays. Instead, they used Sanger sequencing to determine the ends of 40 kb fosmid inserts. These end sequences were then used to identify potential size differences in the range of a full-length L1 insertion between these inserts and the human reference sequence. Using this method to survey the genomes of six geographically diverse individuals, they were able to identify 65 insertions not present in the human genome reference or dbRIP, the Database of Retrotransposon Insertion Polymorphisms in Humans (Wang et al. 2006). Furthermore, using cell culture analyses they estimate that each genome contained between three and nine “hot” L1 elements, those with increased activity compared to a previously characterized active LINE, L1.3 (Brouha et al. 2003).
Of course, each approach has its own advantages and disadvantages. For example, by utilizing Illumina sequencing technology, Ewing and Kazazian (2010b) and Witherspoon et al. (2010) were able to scan entire genomes of multiple individuals to identify polymorphisms. However, as is often the case, the cost of so many reads comes in the form of reduced read length and both studies are somewhat limited in their ability to query the human genome reference, especially in highly repetitive regions. Iskow et al. (2010) increased read lengths by utilizing Sanger and pyrosequencing but sacrificed throughput as a consequence. Additionally, for all three of these methods there are also problems with optimization of multiplex sequencing runs and PCR amplification.
The latter optimization problem was overcome by Beck et al. (2010) by eliminating the PCR and instead detecting size differences in large genome fragments. An advantage of this method is the ability to identify full-length insertions at single bp resolution. The major disadvantage, however, is the inability to recognize smaller insertions such as those produced by Alu activity and incomplete reinsertion of L1 elements (most L1 insertions are <1 kb), thereby decreasing throughput. The hybridization (TIP-chip) method of Huang et al. (2010) suffers from both PCR and hybridization optimization problems but this may be offset by the ability to build custom chips for particular genomic regions and the relatively low cost. The individual researcher who considers utilizing any of these methods must choose the appropriate path for his or her laboratory.
Human applications: variation
Many different genetic markers ranging from mitochondrial DNA polymorphisms to microsatellites to SNPs have been applied to investigations of human genetic variation and origins (for reviews, see Relethford 1998; Excoffier 2002; Cavalli-Sforza and Feldman 2003; Pakendorf and Stoneking 2005). Regardless, the ability to assay all or a substantial number of L1, Alu, or SVA insertions in a human genome represents a practical boon to fields related to human genetic variation. One application is to human population genetic and forensic analysis. Because of the homoplasy-free nature of retrotransposon insertions, a number of publications have applied variation in Alu insertion frequencies to ascertaining human demography and its extension, forensic identification of particular individuals or groups. For example, Bamshad et al. (2003), Witherspoon et al. (2006), and Watkins et al. (2003) utilized either Alu or L1 (or a combination of both) to not only explore ancient human origins and migrations but also to cluster continental human populations. Others have extended these results to forensic applications by genotyping unknown individuals and identifying their genetic ancestry with high probability, a potentially useful tool for limiting the field of suspects in a criminal investigation (Ray et al. 2005a).
While these studies have been successful, the identification of novel polymorphisms in the various human populations to provide additional resolution (e.g., intracontinental assignments) has been a difficult task (Mamedov et al. 2005; Cordaux et al. 2007) yielding only a few to a couple of dozen loci per study. However, the studies discussed herein identified numerous insertions with the potential to be useful in this area. For example, Ewing and Kazazian (2010b) identified over 300 nonreference L1 insertions while Witherspoon et al. (2010) simultaneously identified and mapped nearly 500 novel polymorphic Alu insertions in four individuals. Additionally, Beck et al. (2010) identified three L1 insertions apparently restricted to persons of African origin.
Two of the four studies focusing on L1 insertions (Ewing and Kazazian 2010b; Huang et al. 2010) suggest that the current estimates of the rate of L1 insertions in the human genome should be increased. The most recent estimate prior to this work was one new insertion for every 225 births (Xing et al. 2009). Ewing and Kazazian (2010b) and Huang et al. (2010) both essentially doubled this value to between one in 140 births and one in 108 births, respectively. While Beck et al. (2010) did not directly estimate rates of L1 retrotransposition activity, they did note the potential for multiple active L1 elements in all of the genomes surveyed, suggesting the potential for substantial retrotransposition activity. Further support for this idea was provided by Iskow et al. (2010) with their finding that 19% of their population samples exhibited private L1 insertions.
Observing such high rates of L1 mobilization activity is interesting in its own right, but its importance is emphasized when one considers the two other active families of retrotransposons in our genome, Alu and SVA. Both families are considered parasites of L1 and likely rely on L1 for their mobilization (Dewannieux et al. 2003; Ostertag et al. 2003). Alu has been amazingly successful in colonizing our genome (>1 million copies; Lander et al. 2001) and Cordaux et al. (2006a) found an insertion rate for Alu of around one insertion for every 20 births. Of course, these are estimates of overall rates for the human population and do not consider differential rates or the mutational load in individuals, which may vary widely (Brouha et al. 2003; Seleme et al. 2006) or differences in transposition activity between alleles at the same source locus (Lutz et al. 2003). However, in light of the upward revision of our estimates of L1 retrotransposition, should Alu or SVA retrotransposition rates be increased correspondingly? Such a revision is unlikely to be necessary because the estimation methods of Cordaux et al. (2006a) were very different from any of those of any of these studies and therefore independently derived. Unfortunately, Witherspoon et al. (2010) made no attempt to calculate the rate of Alu retrotransposition using their data, likely because they were examining a relatively small subset of Alu elements, the Yb8 and Yb9 subfamilies. No estimates of SVA retrotransposition frequency are available. However, given its likely dependence on L1 enzymatic machinery, the rate of L1 retrotransposition must have some impact on SVA rates.
We should not overlook additional human variation impacts of TE-mediated transduction leading to the duplication of portions of the human genome and potentially to exaptation and the formation of novel genes (Fig. 2). Transduction by transposable elements generates genome diversity by exon shuffling (Moran et al. 1999; Goodier et al. 2000; Pickeral et al. 2000; Beck et al. 2010) or through gene family formation (Xing et al. 2006) and at least two of the active human retrotransposon families, LINE-1 and SVA, are known to have participated in transduction events (Holmes et al. 1994; Goodier et al. 2000; Pickeral et al. 2000; Ostertag et al. 2003; Xing et al. 2006). These events provide a means of rapid lineage-specific evolution. The ability to assay all of the polymorphic insertions that may occur between any two individuals allows us the chance to observe evolutionary change in action. Large scale TE display along with powerful computing will allow a direct means to estimate the levels of these types of events within individual genomes and species to determine the contributions that they make to the architecture of the genome. Might some intrepid researchers actually identify a case of exon shuffling or gene duplication due to retrotransposition still segregating in the human population? It is entirely possible given that Beck et al. (2010) noted numerous such transductions ranging from 18 bp to over 1 kb.
Schematic illustrating the mechanism of 3′ transduction by non-LTR retrotransposons and possible gene-related impacts. TE-mediated 3′ transduction occurs when the transcription machinery skips a weak or nonexistent polyadenylation signal (pA). Transcription continues until a downstream polyadenylation signal is recognized. The resulting transcript will contain a portion of the 3′ genomic flank and a secondary homopolymer tract, which will be reverse transcribed into cDNA upon reinsertion into the genome (Boeke and Pickeral 1999; Moran et al. 1999; Goodier et al. 2000). If the transduced sequence contains an exon, it may be inserted near existing exons, resulting in an exon shuffling event. Assuming RNA pol II transcription and normal post-transcriptional processing, two or more exons in the transduced sequence may be merged and reinserted, resulting in a processed pseudogene.
Finally, the recent publication of the pilot paper of the 1000 Genomes Project (http://www.1000genomes.org/; Durbin et al. 2010) and an analysis of the L1 elements in the project data by Ewing and Kazazian (2010a) provides a context for the methods described and the TE variation observed. Briefly, the project's stated aim is to provide deep characterization of human genomic variation and its connection to phenotype. Obviously, any method of sampling comes with some inherent ascertainment bias and the studies described herein are no exception. One of the great strengths of the 1000 Genomes Project is an unbiased comparison of multiple genomes that were all sequenced and assembled in an identical manner. However, initial analysis suggests that the methods discussed by Iskow et al. (2010), Ewing and Kazazian (2010b), and Beck et al. (2010), all of which focused on L1 insertions, have managed to capture snapshots of L1-derived human variation that are very similar to that found by the 1000 Genomes Project. In all cases, nonreference L1 insertions tend to be of relatively low frequency in the human population. Thus, the newly reported methods appear able to accurately ascertain TE diversity in multiple genomes.
Human applications: biomedical
Previous studies have indicated that retrotransposon insertions from all three active families have played a role in the occurrence of human disease either directly, by insertion into or near coding sequences, or indirectly, by serving as loci for nonhomologous recombination (Ostertag and Kazazian 2001; Ostertag et al. 2003; Callinan and Batzer 2006; Cordaux and Batzer 2009). The identification of large numbers of TE insertions with differing levels of variation may provide a new set of markers to deploy in genome-wide association studies (Gibson 2010). Furthermore, the introduction of the new high-throughput ascertainment methods adds a valuable toolkit for identifying potential retrotransposon-based etiologies for de novo instances of genetic disease. For example, in their examination of L1 insertions via the TIP-chip method, Huang et al. (2010) searched specifically for L1 insertions that may be associated with X-linked disorders. While no direct link to a particular pathology was made, at least two insertions with correlations to known human X-linked disorders were indeed observed, suggesting further examination may be needed in these cases.
Somatic retrotransposition events have been identified previously. For example, researchers interested in the mechanism and impact of retrotransposition have engineered L1 elements to demonstrate retrotransposition in somatic cells (Babushok and Kazazian 2007; Garcia-Perez et al. 2007; Coufal et al. 2009; Kano et al. 2009). By including tumor-derived cell lines in their study, Iskow et al. (2010) were able to distinguish germline mutations from those made in somatic cells. Additionally, they were observant enough to note a somatic mobilization in a lung tumor in their small (n = 8) sample of tumor-derived data. Pursuing this outcome, they sampled from additional tumors along with neighboring tissues. Results indicate that lung cancers, in particular, appear to be home to high levels of L1 retrotransposition activity. In all, nine L1 insertions were identified, which when assayed against normal tissues from the same individual, were found to be specific to the tumor. Further analysis suggested that hypomethylation in the tumor cell-lines is at least partially responsible for the increased activity, an observation that is in agreement with numerous studies of L1 regulation (Alves et al. 1996; Jurgens et al. 1996; Yoder et al. 1997; Steinhoff and Schulz 2003; Suter et al. 2004; and Coufal et al. 2009 are several examples from among many). Is this a general pattern for human tumors? Such conclusions are not possible from this study alone due to its limited sample sizes, but other research has suggested that low methylation levels in tumor tissues may allow for increased retrotransposition (for review, see Slotkin and Martienssen 2007).
Extensions to other organisms
While all of the potential discoveries within Homo sapiens represent an exciting prospect, many consider the potential applications to other taxa to be even more exciting. Just as in humans, retrotransposon insertions in other taxa have potential as powerful tools for studying population biology. Most studies of population genetics in nonhuman species are facilitated by mitochondrial DNA, microsatellites, AFLP, or RFLP. Unlike in human studies, SNPs are typically too expensive to use for non-model species and have thus far had limited utility. However, retrotransposon insertions represent a valuable new tool because of their unique combination of genetic properties and the observation that they are one of the least expensive molecular markers to assay. Essentially, all one needs to assay a population is a thermal cycler and gel electrophoresis equipment. Of course, as was the case with humans, developing that all-important library of polymorphic insertions has been a major stumbling block to the widespread use of retrotransposon insertions as population genetic markers (Ray 2007), especially given the paucity of reference genomes from non-model organisms.
However, while each of these studies utilizes the human reference genome to identify specific locations for individual insertions, Witherspoon et al. (2010) point out that with the longer sequence reads now available to users of the Illumina sequencing platform, one could develop a library of polymorphic insertions to “study the population dynamics of nearly any [TE] family in any organism.” As such, this is an opportunity not to be missed by researchers interested in the population dynamics of non-model taxa. It should be noted however, that there may be substantial effort involved in designing and optimizing methods for other taxa. Not the least of these is identifying the polymorphic TE families in a given genome, which can be a daunting prospect. Compiling an inventory of potentially useful retrotransposons is beyond the scope of this commentary. However, for interested researchers, Ohshima and Okada (2005) provided a useful list in their 2005 discussion of LINE/SINE interactions.
Similar applications also exist outside of individual species. For many of the same reasons TEs are good population genetic markers, they also make good markers for the inference of organismal phylogenies (Shedlock and Okada 2000; Okada et al. 2004; Ray et al. 2006). However, the problem of applying the published methods to the identification of insertions polymorphic among taxa could be both more and less difficult. Obviously, there are likely to be multiple polymorphisms when comparing two species that diverged multiple millions of years ago. Thus, finding random differential insertions could be a trivial task. However, because of the evolution of the TEs themselves, a problem could be observed when it comes to identifying informative insertion patterns across the species group.
Researchers familiar with Alu SINEs will be aware that distinct subfamilies of Alu exist in each primate lineage (Carter et al. 2004; Hedges et al. 2004; Otieno et al. 2004; Garber et al. 2005; Ray and Batzer 2005; Ray et al. 2005b; Salem et al. 2005b; Han et al. 2007; Liu et al. 2009; Locke et al. 2011). Each of the methods described relies on sequence characteristics unique to particular subfamilies of elements. Herein lies the problem. When sampling among taxa, should one target particular subfamilies? If so, one may find insertions in one taxon but recover essentially nothing in any other taxa. For example, imagine that a researcher decides to develop a library of polymorphic insertions that will allow them to infer the relationships among humans, chimpanzees, gorillas, and orangutans. He or she unwisely follows Witherspoon's protocol exactly and targets Alu elements from the Yb8/9 family. As a result, they will find a plethora of insertions in the human genome but nothing of interest from any of the other taxa because these families are essentially human specific (Carter et al. 2004; Hedges et al. 2004). The end result will be an unresolved tree because only humans will contain any of the discovered elements.
It is therefore clear that targeting insertions that have been recently active in one taxon may not be the best way to proceed. Instead, one may cast a broader net and target a generalized SINE element from the group of interest. This would likely be a more productive avenue. Again, using humans as a model, we can imagine that the typical primate genome is home to approximately one million Alu insertions. Because of the initial success of Alu early in primate evolution, the majority of these insertions belong to the older subfamilies, J and S (Batzer and Deininger 2002). Thus, when comparing relatively recently diverged taxa, identifying the few hundred or thousand informative insertions will be like searching for the proverbial “needles in the haystack.” Fortunately, modern computational tools may prove to make the problem more tractable and we would encourage interested persons to pursue this as a potential methodology.
Finally, one additional benefit of discovering both population and/or taxon specific insertions is the ability to develop TE-based ascertainment tests for forensic applications to wildlife conservation. A prime example is the investigation into the illegal trade of endangered species. Wildlife conservation often comes into contact only with samples that are not readily identifiable as belonging to one species/population or another. A readily available library of species or population specific markers would be valuable, especially in cases where DNA is limited or degraded (Walker et al. 2003, 2004).
Conclusions
The observations reported in these manuscripts are powerful reminders of the impacts that TEs continue to have on the human genome and have provided valuable information on the way our genomes are being shaped not only in the germline but also in somatic cells, including cells destined to become cancerous. Not only have the investigators given us new perspectives on ongoing retrotransposon activity but they have each developed a new toolkit from which other researchers interested in various aspects of biology, ranging from human disease to endangered species conservation, can select.
Acknowledgments
H. Pagan, J. Smith, V. Joshi, and R. Platt contributed valuable comments to earlier versions of this manuscript. Transposable element research in the Batzer laboratory is supported by the Louisiana Board of Regents Governor's Biotechnology Initiative GBI (2002-005) and National Institutes of Health RO1 GM59290. The Ray laboratory is supported by the Life Sciences Biotechnology Institute and the Mississippi Agricultural and Forestry Experiment Station at Mississippi State University, and by the National Science Foundation (MCB-0841821, DEB-1020865).
Footnotes
-
↵3 Corresponding author.
E-mail mbatzer{at}lsu.edu.
-
Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.110528.110.
- Copyright © 2011 by Cold Spring Harbor Laboratory Press













