Transposon accumulation at xenobiotic gene family loci in aphids
Abstract
The evolution of resistance is a major challenge for the sustainable control of pests and pathogens. Thus, a deeper understanding of the evolutionary and genomic mechanisms underpinning resistance evolution is required to safeguard health and food production. Several studies have implicated transposable elements (TEs) in xenobiotic-resistance evolution in insects. However, analyses are generally restricted to one insect species and/or one or a few xenobiotic gene families (XGFs). We examine evidence for TE accumulation at XGFs by performing a comparative genomic analysis across 20 aphid genomes, considering major subsets of XGFs involved in metabolic resistance to insecticides: cytochrome P450s, glutathione S-transferases, esterases, UDP-glucuronosyltransferases, and ABC transporters. We find that TEs are significantly enriched at XGFs compared with other genes. XGFs show similar levels of TE enrichment to those of housekeeping genes. But unlike housekeeping genes, XGFs are not constitutively expressed in germline cells, supporting the selective enrichment of TEs at XGFs rather than enrichment owing to chromatin availability. Hotspots of extreme TE enrichment occur around certain XGFs. We find, in aphids of agricultural importance, particular enrichment of TEs around cytochrome P450 genes with known functions in the detoxification of synthetic insecticides. Our results provide evidence supporting a general role for TEs as a source of genomic variation at host XGFs and highlight the existence of considerable variability in TE content across XGFs and host species. These findings show the need for detailed functional verification analyses to clarify the significance of individual TE insertions and elucidate underlying mechanisms at TE–XGF hotspots.
Xenobiotics are substances that are foreign to an organism, including naturally occurring plant allelochemicals and man-made insecticides (Li et al. 2007). Xenobiotic toxicity can present a strong selective pressure, leading to the rapid evolution of resistance via efficient xenobiotic removal, metabolism, and tolerance (Chung et al. 2007; Li et al. 2007). Consequently, numerous insects have evolved resistance to both plant allelochemicals, produced for plant defense, and synthetic insecticides, developed to protect economically important crops from insect pests or to protect humans and livestock from insect vectored diseases (Sternberg and Thomas 2018; Sparks et al. 2021). The intensive use of numerous synthetic insecticides has led many insects to evolve resistance to multiple insecticide classes (Li et al. 2007; Bradshaw et al. 2016), severely affecting our ability to control target insects (Singh et al. 2021). The emergence of multiple-insecticide resistance is a major global challenge, threatening global food security in the case of agricultural pests and threatening human and animal health in the case of disease vectors (Worner and Gevrey 2006; Deutsch et al. 2018).
Two major mechanisms are frequently implicated in conferring xenobiotic resistance across a wide range of arthropods (Li et al. 2007; Kliot and Ghanim 2012): (1) “target-site resistance,” involving structural changes (mutations) in the gene encoding the insecticide target protein that make it less sensitive to the toxic effect of the insecticide, and (2) “metabolic resistance,” involving the increased production or activity of metabolic enzymes that break down or sequester the insecticide. Regarding the latter, several major gene families are associated with metabolic resistance, which act during three distinct phases of xenobiotic metabolism (Kennedy and Tierney 2013): (1) cytochrome P450 (CYP) monooxygenases (CYP genes); (2) glutathione S-transferases (GSTs), esterases, and UDP-glucuronosyltransferases (UGTs); and (3) ATP-binding cassette transporters (ABC transporters). We briefly summarize these three phases below. During phase I, CYP genes catalyze functionalization reactions, in which hydrophobic xenobiotics are typically converted to more water-soluble metabolites through the addition of functional groups (Kennedy and Tierney 2013). CYP genes are diverse enzymes involved in several purposes from biosynthesis to metabolism and are sometimes referred to as “nature's blowtorch,” owing to their high-valence iron chemistry oxidation mechanism (Guengerich 2009). CYP genes can mediate resistance to all classes of insecticides owing to their broad substrate specificity and versatility, whereas they are also involved in several other processes, such as juvenile hormone and ecdysteroid metabolism and fatty acid biosynthesis (Feyereisen 1999, 2005; Kennedy and Tierney 2013). During phase II, GSTs, UGTs, and esterases catalyze conjugation reactions between phase I substrates and endogenous molecules to form water-soluble metabolites (Kennedy and Tierney 2013). During phase III, ABC transporters, which are cell membrane transport proteins that efflux toxins and modified toxins from the cell, export xenobiotics and products of phase I and II (Gott et al. 2017). In addition, ABC transporters can also block the cellular import of xenobiotics to protect the host, which is termed phase 0 (Gott et al. 2017). Insecticide resistance can arise via mutations acting on the genes involved in the processes described above through various genetic mechanisms, including up-regulation, changes to coding sequence, and gene amplification (Li et al. 2007). It is also important to note that many xenobiotic gene family (XGF) members are involved in other metabolic processes beyond xenobiotic resistance. For example, some GSTs have roles in processes including eye pigmentation, intracellular transport, and cell signaling pathways (Ranson and Hemingway 2005; Ketterman et al. 2011), whereas ABC transporters have numerous roles in processes including heme biosynthesis, iron homeostasis, and protection against oxidative stress (Dermauw and Van Leeuwen 2014). UGTs also have roles in the modulation of endobiotics and olfactory processes (Ahn et al. 2012), and esterases have roles in neurodevelopment and pheromone signaling (Gilbert and Gill 2014).
There is growing evidence that transposable elements (TEs) can play an important role in the evolution of xenobiotic resistance in insects (Rostant et al. 2012; Gilbert et al. 2021), with examples from several major lineages: for Diptera, Drosophila (Aminetzach et al. 2005; Bogwitz et al. 2005; Feyereisen 2005; Schmidt et al. 2010; Remnant et al. 2013; Mateo et al. 2014; Salces-Ortiz et al. 2020), Culex (Darboux et al. 2007; Itokawa et al. 2011), Musca domestica (Kasai and Scott 2001; Li et al. 2007), and Anopheles funestus (Weedall et al. 2020); for Lepidoptera, Heliothis (Gahan et al. 2001; Chen and Li 2007; Yang et al. 2007; Zhao et al. 2010) and Pectinophora gossypiella (Fabrick et al. 2011; Wang et al. 2019); and for Hemiptera, Myzus persicae (Singh et al. 2020; Panini et al. 2021). TEs are DNA sequences capable of moving from one genomic location to another within the genome. TEs occur in nearly all eukaryotic genomes, and are implicated in the evolution of host genomic novelty through diverse processes, including the modification of regulatory networks, chromosomal rearrangements, exon shuffling, and donation of coding sequence (Sundaram et al. 2014; Bourque et al. 2018; Cosby et al. 2019; Wells and Feschotte 2020). In the case of xenobiotic resistance, TEs are reported to have contributed to evolvability via myriad mechanisms, including gene amplification and duplication (Remnant et al. 2013; Singh et al. 2020), knockout of a susceptible allele in heterozygous individuals (Panini et al. 2021), increases in detoxification gene expression (Bogwitz et al. 2005; Itokawa et al. 2011), allelic succession leading to increases in resistance gene copy number (Schmidt et al. 2010), and alternative splicing and production of truncated proteins that prevent interactions with xenobiotics (Gahan et al. 2001; Darboux et al. 2007; Yang et al. 2007; Zhao et al. 2010).
Studies examining the contribution of TEs to resistance evolution suggest that there can be an additive effect of successive TE insertions at focal host loci over time. For example, in the well-known case of the CYP gene CYP6G1 and DDT resistance in Drosophila, successive TE insertions are linked with an increasing ability to detoxify the insecticide (Daborn et al. 2002; Catania et al. 2004; Chung et al. 2007; Schmidt et al. 2010). A similar process involving the accumulation of TEs, in this case at CYP6CY3, is implicated in the evolution of resistance to nicotine, following a host plant shift to tobacco in the aphid M. persicae nicotianae (Puinean et al. 2010; de Little et al. 2017). More widely, there is evidence that the selective accumulation of TEs at rapidly evolving host loci under strong selection may represent a general evolutionary process. For example, studies have reported the enrichment of TEs at other gene classes, such as immune genes and those involved in responses to external stimuli (Song et al. 1998; van de Lagemaat et al. 2003). Existing studies strongly suggest that TEs are involved in individual cases of resistance to certain insecticides, but there is currently very limited understanding of how associations with TEs vary among different classes of XGFs or among whole clades of insects. However, patterns in TE accumulation remain poorly described, and it is unclear to what extent TEs may be enriched at XGFs compared with other host genes, or how much variability exists across XGFs.
Here we consider patterns of TE accumulation at XGFs in the genomes of 20 species sampled from across aphid phylogeny, for which high-quality genome assemblies are available. The aphid family is a relevant clade within which to explore the genomic bases of insecticide resistance, as it includes numerous globally important crop pests responsible for causing billions of U.S. dollars of crop losses annually (Van Emden and Harrington 2017), and members that have been intensively targeted with insecticides and evolved resistance. For example, the severely damaging global crop pest M. persicae (the green peach aphid) is resistant to at least 82 insecticides via at least eight different resistance mechanisms (Bass et al. 2014; Singh et al. 2021; Mota-Sanchez and Wise 2023). We hypothesize similar broadscale patterns for TE accumulation at XGFs among aphid genomes, given similar challenges from xenobiotics that target conserved biological pathways. Meanwhile, given the strong selection pressure that can arise from xenobiotic exposure, as well as repeated xenobiotic challenges over time, we anticipate an ability to detect patterns at the species-level.
Because the TE landscapes of aphids are poorly described, we begin by characterizing patterns of TE content across our focal genomes. After this, we examine key fundamental questions relating to the pattern of associations between TEs and XGFs in aphid genomes. First, we test whether XGFs are enriched for TEs compared with other host genes to evaluate the overall signature of TE accumulation at XGFs. Second, we examine the extent to which TE content varies among individual XGFs and among major XGF classifications to determine if contributions from TEs are more pronounced at certain types of XGF. Third, we consider if specific TE classifications are enriched at XGFs (e.g., DNA TEs, rolling circles, Penelope-like elements, LINEs, SINEs, and LTRs) to explore if particular TEs are predisposed to contribute to the evolution of host resistance. Fourth, we assess whether patterns of TE accumulation at XGFs vary among aphid species to examine if patterns vary across aphid phylogeny. Fifth, we investigate TE enrichment at XGFs that have shown associations with the detoxification of synthetic insecticides to test if these show particularly strong signatures of TE recruitment. Last, we explore whether our results suggest a selective role for patterns in TE accumulation at XGFs versus the alternative explanation of increased availability for TE insertion owing to chromatin availability.
Results
TE landscapes in aphids
TEs make considerable contributions to genome size in eukaryotes (Kidwell 2002; Gregory 2005). In aphids, we find that TE content varies from 8.21% (Melanaphis sacchari) to 35.52% (Metopolophium dirhodum) of total genome size (Fig. 1B; Supplemental Table S1). This is in accordance with levels of TE content identified among a large-scale survey of 196 insect species, in which TE content varied from 0.08% to 53.93% of total genome size (Peccoud et al. 2017). Similarly, consistent with findings in other eukaryotic lineages, we report a significant strong positive association between genome size and TE content in aphids (linear regression, F1,19 = 46.35, P < 0.01, R2= 0.69) (Supplemental Fig. S1). Aphids show variation in sexual and asexual life cycles (Van Emden and Harrington 2017). It is hypothesized that asexual lineages would suffer from Muller's ratchet as deleterious mutations accumulate (Muller 1964). Despite this, we find no significant difference in TE content between holocyclic and anholocyclic lineages (Wilcoxon rank-sum, W14,6 = 35, P > 0.05) (Fig. 1B).
Summary of TE content and activity in aphid genomes, plotted in phylogenetic order. Nodes in the aphid phylogeny with bootstrap support less than 100 are labeled. Reproductive mode (Van Emden and Harrington 2017) is labeled next to species names. Genome size is represented by the black bars (A). TE content is expressed as a percentage of total genome size for each species, with major TE classifications represented by the colors indicated in the key (B). Kimura two-parameter distance (CpG adjusted) from each TE family consensus is used as a proxy for relative TE activity (C), where a lower Kimura distance indicates more recent TE activity. Because of challenges with accurately estimating TE age, divergence from consensus better reflects relative TE activity, with lower Kimura distances signaling more recent TE activity. Activity is organized such that recent TE activity is toward the RHS of the x-axis.
TEs have a nonrandom distribution within host genomes (Bourque et al. 2018). As expected, given the predominantly deleterious effects of TE insertion into host genes (Sultana et al. 2017; Bourque et al. 2018), exonic regions show the lowest levels of TE sequence content in aphid genomes (mean TE content in exons = 1.90% of total genome size). Across 19 of the 20 aphid genomes considered here, the greatest TE coverage (expressed as proportion of total genome size) is found in 5′ and 3′ gene-flanking regions, where it varies between 3.49% (M. sacchari) and 17.94% (M. dirhodum) (Fig. 2B). However, this pattern is not universal across all aphid species. In Sipha flava, the majority of TE coverage is present in intronic regions (10.60%), whereas in Myzus cerasi, the majority occurs in intergenic space (11.28%) (Fig. 2B).
Overview of the extent to which TE families are shared among aphid genomes, as well as their insertion locations. (A) Quantification of shared and unique TE families in aphids. Main TE classifications are represented by the colors indicated in the key. (B) TE genome compartment occupancy, expressed by TEs as a percentage of total genome size. Gene-flanking regions are defined as 20 kb directly upstream of or downstream from the gene body.
The amount of TE sequence and the frequency of TE insertions in genic versus intergenic regions in aphid genomes are not significantly correlated with genome compactness, described as the ratio of genic (intron and exon) to intergenic (gene flanks and intergenic space) base pairs (Spearman's rank, S = 1680, P > 0.05, rho = −0.09) (Supplemental Fig. S2; Supplemental Table S2). S. flava has the most compact genome, with a genic:intergenic base pair ratio of 1.50:1 (250.0 Mb:166.8 Mb), and so presumably has a higher likelihood of TE insertion into genic regions compared with intergenic space, leading to the observation that the majority of TE insertions are found in intronic regions. In contrast, M. cerasi has the least compact genome, with a much lower genic:intergenic base pair ratio of 0.13:1 (76.1 Mb:559.2 Mb). In less compact genomes such as M. cerasi, TEs can accumulate in expanded intergenic regions that act as “genomic safe havens,” where insertion is less likely to exert deleterious effects (Arkhipova 2018). For context, the human genome (GCF_009914755.1) has a TE content of 44%–69% (de Koning et al. 2011; Nurk et al. 2022) and a genic:intergenic base pair ratio of 0.55:1 (1717.7 Mb:3117.3 Mb), making it several times more compact than M. cerasi but much less compact than S. flava.
Across all aphid species and in most individual genomes, the dominant TE classification is DNA TEs, which comprise between 36.6% (Schlechtendalia chinensis) and 68.9% (Aphis craccivora) of total TE content (μ = 45.27%, SD = 18.00%) (Fig. 1B; Table 1; Supplemental Table S1). SINEs are present only in the genomes of seven of the 20 species considered (Supplemental Table S1), and where present, they comprise just 0.009% (Acyrthosiphon pisum) to 0.1% (A. craccivora) of total TE content.
Mean and standard deviation of the percentage of total TE content attributed to each of the main TE classifications for the 20 aphid genomes included in this study
There is little evidence of ancient TE activity in aphid genomes, as indicated by the relatively low number of TEs showing high divergence from their respective TE consensus models in repeat landscape plots (Fig. 1C). On the other hand, there is considerable evidence of relatively recent TE activity in most aphid species, with particular evidence of very recent TE activity in M. persicae and M. dirhodum, indicated by large numbers of TEs with low genetic distance from their respective consensus sequences (Fig. 1C). The relative absence of ancient TE activity is presumably a consequence of a relatively rapid genomic turnover rate in aphids and contrasts strongly with TE landscapes in mammals, in which a low rate of genomic turnover leads to considerable accumulations of aging TEs (Blass et al. 2012). Instead, the dynamics we observe for aphids are consistent with patterns reported for other insects, such as lepidopterans (Lavoie et al. 2013; Baril and Hayward 2022).
The extent to which TE content differs among the genomes of closely related species can vary greatly. We find no significant effect of phylogeny on TE abundance (no. of TEs present in a genome) or TE diversity (no. of distinct TE families present in a genome) (Wicker et al. 2007) in aphids (phylogenetic GLMM, TE abundance: posterior mean = 0.592, 95% HPD = 0.005,0.999; TE diversity: posterior mean = 0.022, 95% HPD = 0.000,0.111). Thus, more closely related aphid species do not possess similar numbers of TEs or a similar level of TE diversity. We also find that most TE families present in aphid genomes are species specific, with almost half of the TE families present in a particular aphid genome being unique to that species (μ = 46.97%) (Fig. 2A). On average, 10.02% of the TE families identified in an aphid genome are shared among all 20 species (Fig. 2A). The remaining 43.01% of TE families are shared in similar proportions among two to 19 aphid species (i.e., μ = 2.53% per category) (Fig. 2A). The strong signature of species specificity observed for aphid TE families suggests that the independent gain of new TE families is a major factor driving aphid TE landscapes. Collectively, these findings suggest a dynamic repeat landscape in aphid genomes, characterized by relatively frequent gain and turnover of TEs.
TEs are enriched at XGFs compared with other host genes
Previous studies have implicated TEs in the evolution of xenobiotic resistance across a range of insects and resistance loci (Li et al. 2007). However, the extent to which TE association at XGFs represents a general mechanism for the evolution of new resistance phenotypes remains unclear.
Across all aphid genomes considered here, we find significant enrichment of TEs, both in TE coverage and in TE count, at XGFs in comparison to all other genes excluding XGFs (TE coverage: Wilcoxon rank-sum, W6488,401787 = 732,129,770, P < 0.01; TE count: Wilcoxon rank-sum, W6488,401787 = 770,552,556, P < 0.01) (Fig. 3; Supplemental Fig. S3). Specifically, XGFs in aphids have a mean of 2.4× TE coverage and 2.1× TE count compared with all other genes.
Differences in TE coverage and TE count at XGFs compared with all other host genes. Coverage and count of genic (exon and intron) XGF sequence plus flanking regions (20 kb directly upstream of or downstream from the gene body) are shown. Each point represents a single XGF locus. Black lines indicate the expected TE coverage difference whereby XGFs and other genes are equally enriched for TEs (i.e., zero enrichment). Colored lines indicate species mean coverage and count for each XGF locus and TE type. Major TE classifications are indicated by the colors in the key.
TE content varies among individual XGF loci
We observe an enrichment of TEs around all major XGF categories considered compared with other host genes: CYPs, 2.4× TE coverage and 2.1× TE count; GSTs, 2.7× TE coverage and 2.3× TE count; esterases, 2.0× TE coverage and 1.8× TE count; UGTs, 2.0× TE coverage and 1.7× TE count; and ABC transporters (ABCs), 2.8× TE coverage and 2.4× TE count (Fig. 3; Supplemental Fig. S3; Supplemental Table S3). Although all major XGF categories are enriched for TEs, there is no significant difference in relative enrichment among XGF types, suggesting relatively equal levels of enrichment across XGF types (Kruskal–Wallis, TE sequence: χ24 = 7.72, P = 0.10; TE count: χ24 = 4.25, P = 0.37). However, considerable variability in TE content is present within XGF categories, attributable to extremely large accumulations of TEs at certain individual XGFs (Table 2; Fig. 4). Some of these XGFs are unusually large in terms of sequence length compared with the mean size for the XGF type in question, as indicated by gene size Z-scores > 3 (Table 2), which is potentially a consequence of the increased presence of TEs in their intronic regions (Fig. 4). To investigate this, we consider gene sizes with genic TEs removed. We find that five of 18 of the most TE-enriched XGFs have significantly inflated gene sizes when TE insertions are removed, indicated by a Z-score of greater than three. This suggests that these loci were significantly larger than average before accounting for TE contributions. However, the remaining 13 highly TE-enriched XGFs have gene sizes within the expected range in the absence of TEs, suggesting that TEs are responsible for the significant increases in gene size at the majority of TE-enriched XGF loci. Individual XGFs with the greatest enrichment of TE coverage and TE count are listed in Table 2 (i.e., >35 kb TE sequence and more than 80 TE insertions compared with a non-XGF mean of 1.5 kb TE sequence and 1.48 TE insertions). Although all XGF types are represented in the TE hotspots listed in Table 2, there are markedly more hotspots at UGTs and CYP genes and fewer at ABCs, GSTs, and esterases.
Karyoplots illustrating the most TE-dense XGF loci, corresponding to the XGFs highlighted in bold in Table 2. For comparison, the gene region represented in the shaded box (gene10004) shows a non-XGF from M. persicae with representative TE coverage close to the mean for non-XGFs (1519 bp). XGFs 125_CYP380C52 and 105_CYP380C19 are not shown, as these are nested at the CYP cluster containing 109_CYP380C31. Sea green–shaded regions indicate gene-flanking regions. Gray bars represent gene bodies, and black bars show exonic regions. TEs are annotated above their respective gene tracks, with colors indicating the main TE classifications, as depicted in the key at the bottom of the figure.
XGF loci with the greatest enrichment for TE coverage and TE count in aphids
Specific TE types are enriched at XGFs
Although most TE types contribute to enrichment at XGFs, overall enrichment is primarily driven by increases in DNA TE content, owing to the high frequency of DNA TEs in aphid genomes, where they comprise almost half of total TE content (Table 3).
Mean base pair coverage of each TE classification at XGFs among aphids
The TE type with the greatest degree of enrichment in coverage around XGFs in comparison to other genes is SINE, although there is a slight nonsignificant reduction in SINE count at XGFs (TE coverage, 3.2× enrichment: XGFs μ = 1144 bp, all other genes μ = 358 bp [Wilcoxon rank-sum, W9482 = 439, P < 0.01]; TE count, 0.82× enrichment: XGFs μ = 1.00, all other genes μ = 1.22 [Wilcoxon rank-sum, W9482 = 2592, P = 0.14]). This is notable because SINEs are generally very scarce in aphid genomes and suggests a particular retention of SINE sequence at XGFs (Figs. 1B, 3; Table 1; Supplemental Table S1). However, given their scarcity, the relative contribution that SINEs make to TE enrichment at XGFs compared with other TE types is extremely low. Conversely, DNA TEs are the most abundant TE type in aphid genomes, and they are significantly enriched at XGFs both in TE coverage and in TE count (TE coverage, 2.0× enrichment: XGFs μ = 4026 bp, all other genes μ = 1974 bp [Wilcoxon rank-sum, W6156,336473 = 616,739,338, P < 0.01]; TE count, 6.4× enrichment: XGFs μ = 10.05, all other genes μ = 1.57 [Wilcoxon rank-sum, W6156,336473 = 1,267,581,920, P < 0.01]) (Fig. 3). Rolling circle TEs show significant but lower levels of enrichment (TE coverage, 1.1× enrichment: XGFs μ = 794 bp, all other genes μ = 728 bp [Wilcoxon rank-sum, W1801,55720 = 46,195,509, P < 0.01]; TE count, 1.8× enrichment: XGFs μ = 2.14, all other genes μ = 1.22 [Wilcoxon rank-sum, W1801,55720 = 56,336,944, P < 0.01]). In contrast, LINEs do not show significant levels of enrichment at XGFs in terms of TE coverage, but there is a significant enrichment of LINE TE count (TE coverage, 1.03× enrichment: XGFs μ = 992 bp, all other genes μ = 959 bp [Wilcoxon rank-sum, W2575,101686 = 128,128,748, P = 0.06]; TE count, 1.3× enrichment: XGFs μ = 1.79, all other genes μ = 1.41 [Wilcoxon rank-sum, W2575,101686 = 167,286,786, P < 0.01]). Similarly, we find only a slight, nonsignificant enrichment of LTR TE coverage at XGFs, but we do identify a significant enrichment of LTR TE count (TE coverage, 1.4× enrichment: XGFs μ = 1817 bp, all other genes μ = 1238 bp [Wilcoxon rank-sum, W1004,51486 = 25,656,118, P = 0.69]; TE count, 1.7× enrichment: XGFs μ = 2.54, all other genes μ = 1.51 [Wilcoxon rank-sum, W1004,51486 = 32,328,754, P < 0.01]). Only PLEs show a depletion, albeit insignificant, in TE coverage around XGFs, but they also show significant enrichment in count around XGFs (TE coverage, 0.7× enrichment: XGF μ = 456 bp, all other genes μ = 608 bp [Wilcoxon rank-sum, W47,6823 = 185,654, P = 0.06]; TE count, 1.3× enrichment: XGF μ = 1.43, all other genes μ = 1.11 [Wilcoxon rank-sum, W47,6823 = 174,958, P < 0.05]). However, like SINEs, PLEs account for a very small proportion of total TE content (Supplemental Table S1).
XGF enrichment varies among aphid genomes
We find no significant difference in TE enrichment among different XGF types (i.e., CYP genes, esterases, UGTs, GSTs, ABCs; Kruskal–Wallis, TE coverage: χ24 = 7.72, P = 0.10; TE count: χ24 = 4.25, P = 0.37) (Fig. 3). However, we do find significant variation in the magnitude of TE enrichment at XGFs among aphid species (linear regression, TE coverage: F19,78 = 37.49, P < 0.01, adjusted R2= 0.88, TE count: F19,78 = 28.28, P < 0.01, adjusted R2= 0.84). Specifically, in terms of TE coverage, Cinara cedri shows the highest levels of enrichment across all XGFs, with a mean enrichment of 15.9× compared with non-XGFs (Fig. 5; Supplemental Table S3). This is considerably higher than the lowest level of mean TE coverage enrichment, which was found in M. sacchari with 1.9× TE coverage enrichment at XGFs compared with non-XGFs (Fig. 5; Supplemental Table S3). In terms of TE count, the highest level of enrichment at XGFs compared with non-XGFs was also found in C. cedri, with a mean TE count enrichment of 28.5× the number of TE insertions, whereas the lowest level was found in A. craccivora, with a mean enrichment of 4.6× the number of TE insertions at XGFs compared with non-XGFs (Fig. 5; Supplemental Table S3).
TE enrichment at XGFs compared with genome-wide levels. Fold-change enrichment for TE coverage and TE count at XGFs compared with within-genome levels at non-XGFs across 20 aphid species.
Overall, although we observe enrichment in both TE coverage and TE count at XGFs compared with non-XGFs in aphid genomes, fold-change increases are larger for TE count. High enrichment for TE count combined with lower increases in TE coverage suggests that enrichment is driven by the presence of large numbers of TE fragments at XGFs as opposed to full-length TEs (Fig. 6). Indeed, mean TE length at XGFs across aphids is just 536 bp, which is much shorter than most full-length TEs, excluding SINEs (Wicker et al. 2007; Wells and Feschotte 2020). Further, 90.40% of TEs across XGFs are <1000 bp in length (Fig. 6). Therefore, it is likely that a general pattern in TE accumulation at XGFs is TE truncation with retention of specific TE domains, as previously observed for certain individual cases of TE co-option during the evolution of host insecticide resistance (Gahan et al. 2001; Bogwitz et al. 2005; Darboux et al. 2007; Yang et al. 2007; Schmidt et al. 2010; Zhao et al. 2010; Itokawa et al. 2011; Remnant et al. 2013; Singh et al. 2020; Panini et al. 2021).
TEs are enriched at XGFs associated with the detoxification of synthetic insecticides in certain aphid species
CYP genes are a highly important gene family for the detoxification of xenobiotics, including insecticides (Feyereisen 2005; Katsavou et al. 2022). Indeed, increases in the expression of certain CYP genes can provide increases in insecticide resistance (Nauen et al. 2022). Additionally, several cases of TE interactions with CYP genes have been characterized (Daborn et al. 2002; Chen and Li 2007; Chung et al. 2007; Karunker et al. 2008; Schmidt et al. 2010; Itokawa et al. 2011; Nauen et al. 2022). Therefore, we undertook a targeted analysis to examine evidence for the accumulation of TEs at CYP genes with putative functions in xenobiotic resistance (Supplemental File S6). For this, aphid CYP genes were labeled based on the availability of evidence to support their xenobiotic-resistance functions from Katsavou et al. (2022), as some CYP gene family members play other roles, such as hormone breakdown. This resulted in the assignment of two to 29 CYP genes per aphid genome to the xenobiotic detoxification category (Supplemental Table S5). In S. flava, C. cedri, and M. cerasi, we find significant enrichment of TEs surrounding CYP genes in CYP gene (sub)families frequently implicated in the breakdown of xenobiotics compared with those with other functions (Fig. 7A). However, we find the opposite pattern in Schizaphis graminum, M. dirhodum, and M. persicae, in which, overall, there is a significant depletion of TEs surrounding CYP genes associated with xenobiotic resistance (Fig. 7A).
TE association with cytochrome P450 (CYP) genes in aphids. (A) TE coverage around CYP genes linked to xenobiotic resistance versus those with other functions. Numbers on the right indicate the number of genes in each group being compared. (B) TE coverage around CYP genes linked to resistance to synthetic insecticides versus those conferring resistance to natural xenobiotics. Numbers on the right indicate the number of genes in each group under comparison.
Next, we considered patterns of TE accumulation at CYP genes specifically involved in resistance to synthetic insecticides rather than resistance to xenobiotics more generally. For this, CYP genes associated with xenobiotic-resistance functions were labeled to reflect the compounds to which they confer resistance. CYP genes that confer resistance to insecticides were then considered separately from CYP genes associated with the detoxification of other xenobiotics, such as those offering resistance to naturally occurring plant allelochemicals. We find a significant enrichment of TEs around CYP genes associated with resistance to synthetic insecticides in Pentalonia nigronervosa, A. pisum, M. cerasi, and M. persicae (Fig. 7B). With the exception of P. nigronervosa, these are all species of “most agricultural importance” (Van Emden and Harrington 2017) and are therefore likely to have experienced particularly high exposure to synthetic insecticides and extremely strong selection to evolve resistance. The significant enrichment of TEs at XGFs associated with resistance to synthetic insecticides in these species is thus consistent with their selective recruitment for resistance evolution. However, no significant enrichment of TEs around CYP genes associated with resistance to synthetic insecticides was found in seven other species of “most agricultural importance” included in our analysis: A. craccivora, A. gossypii, Diuraphis noxia, M. dirhodum, Rhopalosiphum padi, S. graminum, and Sitobion avenae (Van Emden and Harrington 2017).
Evidence for the selective enrichment of TEs at XGFs
We hypothesized that TE enrichment at resistance loci could arise through two distinct processes: (1) chromatin availability, whereby XGFs are surrounded by open chromatin owing to frequent transcriptional activity or constitutive expression and so represent locations that are more accessible for transposition, and (2) selective retention of beneficial TE insertions. Given that XGFs are essential for the detoxification of harmful plant metabolites and insecticides and so are of key importance for survival and reproduction, TE insertions that modify XGFs in beneficial ways may be selectively retained and so spread through host populations. Consequently, the combined effects of XGFs as selective hotspots (especially considering the extremely strong selection mediated by intensive insecticide regimes) and the shown evolutionary potential of TEs could result in the selective retention of TE insertions surrounding XGFs over time. Such effects may be magnified by the influence of host stress under insecticide exposure, which has been implicated in increases in TE activity (Chénais et al. 2012; Stapley et al. 2015; Horváth et al. 2017). Specifically, increased TE activity may result in a higher likelihood of TE insertions occurring at any genomic location, including XGFs. However, TE insertions at XGFs may be more likely to be selectively retained if they offer host benefits in the form of resistance mutations, owing to their capacity to contribute to host evolvability through diverse genetic mechanisms (Rebollo et al. 2012; Bourque et al. 2018).
Only germline TE activity, as opposed to somatic TE activity, is relevant from the perspective of TE accumulation at XGFs, because only TE insertions that occur in the germline have the potential to be transmitted to the next generation (novel TE insertions in the genomes of somatic cells are an evolutionary dead end). TE density at a gene, defined as the proportion of its sequence occupied by TEs, is positively correlated with germline gene expression, suggesting that TEs preferentially insert into regions that are actively transcribed in the germline (McVicker and Green 2010). Therefore, a key question to distinguish between alternative hypotheses that explain TE enrichment at XGFs is the following: To what extent are XGFs expressed in the germline? We addressed this question using RNA-seq data from a recent study on A. pisum (Jaquiéry et al. 2022) to examine gene expression in testes and ovaries for XGFs and a large panel of housekeeping genes (HKGs). HKGs are constitutively expressed in all cell types and so represent a “maximally accessible” case with which to examine TE enrichment (Supplemental Table S4; Butte et al. 2001). We find that germline HKG expression is significantly higher compared with XGFs considering mean reads per kilobase of transcript per million base pairs sequenced (RPKM) in A. pisum (Wilcoxon rank-sum, W612,112 = 52252, P < 0.01) (Fig. 8C). The observed pattern of expression of HKGs and XGFs is also present in Drosophila melanogaster, suggesting conservation across insect diversity (Supplemental Fig. S4). We find that, consistent with chromatin availability as an explanation for TE enrichment, TEs are significantly enriched at HKGs compared with all other host genes, excluding XGFs (TE coverage: Wilcoxon rank-sum, W546,475492 = 218,813,232, P < 0.01; TE count: Wilcoxon rank-sum, W546,475492 = 170,683,160, P < 0.01), to a similar level as that observed for XGFs versus all other host genes (Fig. 8A,B; Supplemental Fig. S5). However, given that germline expression of XGFs is consistently low across all XGF types (Fig. 8C), we conclude that germline expression cannot explain the observed enrichment of TEs at XGFs. Instead, our findings support the alternative explanation for the enrichment of TEs at XGFs, which is selective retention of TE sequences owing to genetic contributions toward the evolution of xenobiotic-resistance mutations. There is no mechanistic basis to suggest that TEs specifically target the regions surrounding XGFs as insertion sites compared with other genomic locations. Instead, under this model, TE insertions at XGFs are more likely to be retained and spread through the host population (compared with TE insertions at other loci), owing to the combination of their capacity to contribute to host evolvability and the extremely strong selection pressure imposed by insecticide treatment.
TE occupancy at XGFs, housekeeping genes (HKGs), and all other genes and the expression of XGFs and HKGs in A. pisum (Jaquiéry et al. 2022). (A) TE coverage around XGFs, HKGs, and all other genes. Box limits indicate upper and lower quartiles, and central lines indicate mean TE coverage. (B) TE count around XGFs, HKGs, and all other genes. Box limits indicate upper and lower quartiles, and central lines indicate mean TE count. (C) Expression of XGFs and HKGs in germline tissues (testis and ovary) in A. pisum, expressed as reads per kilobase of transcript per million base pairs sequenced (RPKM), to determine expected expression of XGFs in germline cell types. Gene types are indicated by colors in the key.
Discussion
Resistance evolution remains a major societal challenge that greatly affects efforts to control harmful pests and pathogens (Van Emden and Harrington 2017). To develop new approaches to overcome the challenge of resistance evolution, a deeper understanding of the genomic mechanisms leading to the repeated emergence of resistant phenotypes is required. Although isolated examples of TE involvement in the evolution of xenobiotic resistance exist, the extent to which TEs act as a source of genomic novelty for resistance evolution, as well as host evolution more generally, remains an open question.
We examine whether genes implicated in xenobiotic resistance are enriched for TE insertions, potentially because of the generation of novel advantageous mutations for resistance evolution. We find an enrichment of TEs, both in terms of TE sequence coverage and in number of TE insertions, at XGFs across all aphids and all XGF types. Further, we provide evidence that this enrichment is the consequence of selection at XGFs as opposed to elevated insertion rates owing to chromatin availability at XGFs in the germline.
Our results also uncover considerable variability in the enrichment of TEs around XGFs both within and among aphid species and XGF types. This shows that TEs are not globally enriched at every XGF but instead are enriched only at certain XGFs and depleted at others. Consequently, the patterns we observe are driven by hotspots of TE enrichment around certain XGF types and, in some cases, individual XGFs, which show extremely high levels of enrichment compared with other XGFs. For example, considering all aphid species examined, we identify significant enrichment of TEs around GSTs. However, this result is driven by a subset of aphids, with 11 of the 20 species showing no evidence of significant enrichment around this XGF class (Supplemental Fig. S4). This is consistent with varying levels of selection acting on particular detoxification genes and particular aphid species. Meanwhile, we find markedly more TE hotspots at UGTs and CYP genes, whereby certain XGFs are massively enriched for TE fragments. These observations show the nuances involved in patterns of TE accumulation at XGFs and underline a need for functional validation analyses to test the impact of individual TE insertions on XGF expression and to test whether highly TE-enriched loci are indeed involved in xenobiotic resistance.
Important crop pests are likely to experience much higher levels of insecticide stress than are nontarget species, shown by the increased number and diversity of insecticide-resistance mutations in target species compared with nontarget species (Mota-Sanchez and Wise 2023). Consequently, variability in TE enrichment is expected among aphid species and XGFs, depending on variation in the degree of insecticide exposure. This prediction is confirmed when considering CYP genes, for which data clarifying the relationship to specific insecticides are available (rather than just to xenobiotics in general). Specifically, we find that TE enrichment is significantly higher around CYP genes that are implicated in the detoxification of synthetic insecticides compared with those involved in the detoxification of natural xenobiotics, but only in three out of 10 aphids of “major agricultural importance” (Van Emden and Harrington 2017) included in our study (A. pisum, M. cerasi, and M. persicae). Consequently, either CYP genes are less important for survival in the other species (perhaps because of differences in pest control strategies, such as a greater reliance on forms of non-insecticide-based control, such as biocontrol strategies) or there are genomic mechanisms that limit the co-option of TEs for resistance evolution in these species.
Shown contributions of TEs to xenobiotic resistance often involve small fragments of the original TE insertions and contain motifs that act as host regulatory elements (Gahan et al. 2001; Bogwitz et al. 2005; Darboux et al. 2007; Yang et al. 2007; Schmidt et al. 2010; Itokawa et al. 2011; Remnant et al. 2013; Singh et al. 2020; Panini et al. 2021). Meanwhile, the emergence of a resistant phenotype frequently appears to result from a single TE insertion, or up to a few TE insertions, at each specific XGF locus, rather than large accumulations of TE DNA (Daborn et al. 2002; Chung et al. 2007; Schmidt et al. 2010; Panini et al. 2021). This presumably occurs because of genomic and selective processes that act to retain the core TE regions that convey a selective benefit, while purging long repetitive and potentially deleterious regions (e.g., that promote nonhomologous recombination with similar repetitive elements). Meanwhile, the impacts of these TE insertions can occur via complex multistep processes including the retention of several truncated TE fragments (such as for CYP6G1) (Schmidt et al. 2010). Thus, a key consideration is that TE-associated resistance mechanisms go beyond simple enrichment at specific loci.
We find that DNA TEs are highly enriched at XGFs despite the more widely acknowledged role of TEs with more complex cis-regulatory elements, such as LTR retrotransposons, in altering host gene expression (Chung et al. 2007). This may suggest that other processes beyond TE-mediated increases in expression are involved in the patterns of enrichment we uncover, although it is entirely possible that the DNA TEs involved also contain sequences relevant for the recruitment of transcription factors that increase expression. Meanwhile, at present, we cannot currently explain the repeated cases whereby certain XGFs are massively enriched in terms of TE coverage and TE count compared with other XGFs. Thus, unraveling the influence of TEs at these loci represents a key target for future research.
Gaining a deeper understanding of the mechanisms that drive patterns of association between TEs and XGFs will require contributions from complementary approaches to the broadscale comparative genomic analyses performed here. Of particular relevance are functional validation studies to examine the influence of individual TE insertions on nearby XGF loci, which our results provide considerable scope to address. Further analyses are also required to elucidate the underlying mechanisms by which TE sequences may contribute to resistance phenotypes, such as screening to investigate if implicated TE sequences contain transcription factor binding sites, as well as information on chromatin conformation and the availability of DNA within TE sequences at XGFs. Meanwhile, detailed intra-species analyses can provide valuable evolutionary insights into individual TE insertions of interest (e.g., Aminetzach et al. 2005) and information on TE occupancy within populations (e.g., Barrón et al. 2014). Future work considering strains displaying different resistance phenotypes using high-quality long-read data sets and individual TE annotations (e.g., Rech et al. 2022) will enable the characterization of TE variants likely contributing to resistance evolution in aphids. Collectively, such approaches will ultimately allow evaluation of the extent to which TEs are used in resistance mutations versus the contributions arising from other classes of genetic variation such as SNPs and CNVs.
Methods
Estimating host phylogeny
The genome assemblies of 20 aphid species and of seven diverse hemipteran outgroup species (Ericerus pela, Maconellicoccus hirsutus, Phenacoccus solenopsis, Bemisa tabaci, Sogatella furcifera, Diaphorina citri, Pachypsylla venusta) were obtained along with gene annotations where available (Supplemental Table S6). To estimate host phylogeny, a supermatrix approach was used. Specifically, for each genome assembly, BUSCO (version 5.2.2) (Manni et al. 2021a,b) was used with the hemiptera_odb10 gene set to identify conserved gene orthologs in each genome assembly. The identifiers for all complete genes were extracted, and those present in fewer than three genomes were removed. The amino acid sequences for each gene from all species were extracted and saved in individual FASTA files. For each gene, sequences were aligned using MAFFT (version 7.453; ‐‐auto) (Katoh and Standley 2013). The subsequent alignments were concatenated using Phykit create_concat (Steenwyk et al. 2021) to generate a supermatrix. To generate the species phylogeny, we performed 1000 ultrafast bootstrap repetitions (-bb 1000) in IQ-TREE (version 1.6.12) (Nguyen et al. 2015), using the best fit amino acid model identified by Modelfinder (Supplemental File S1; Kalyaanamoorthy et al. 2017).
TE annotation
To curate and annotate TEs, each genome assembly was annotated with the Earl Grey TE annotation pipeline (version 1.3) (Baril et al. 2022; https://github.com/TobyBaril/EarlGrey). Briefly, known TEs from Arthropoda (-r arthropoda) were first annotated using both the Dfam 3.4 (Hubley et al. 2016) and Repbase RepeatMasker edition (version 20181026) (Jurka et al. 2005; Kapitonov and Jurka 2008) TE databases. Following this, Earl Grey identified de novo TE families and refined these using an iterative “BLAST, extract, extend” process (Platt et al. 2016). Following final annotation with the combined library of known and de novo TE families, annotations were processed to remove overlapping annotations and to defragment annotations likely to originate from the same TE insertion. TE annotation GFF files are provided in Supplemental File S2.
To characterize TEs as either shared or species specific, de novo TE libraries from each genome assembly were clustered to the TE family definition of Wicker et al. (2007), implemented as described by Goubert et al. (2022) using cd-hit-est (-d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1) (Li and Godzik 2006; Fu et al. 2012). Sequences within each cluster were designated a number between zero and 19 to determine the number of other species that the TE family was shared with, with zero identifying TEs unique to a species and 19 identifying TEs shared across all 20 aphid species considered in this study. For known TE families from Dfam and Repbase, each TE was labeled using the same system based on the number of species the TE family was identified in.
Genetic distance from TE consensus sequence is used as a proxy for estimated timing of TE activity (i.e., TE age). Using genetic distance from TE consensus as a proxy for insertion time should be taken with caution, as this can be influenced by the methodology used to generate TE consensus sequences, as well as whether known TE sequences from other organisms are used as a reference or purely de novo TE consensus sequences from the species of interest. However, this metric can still be used to provide a rough relative estimate of TE activity.
Phylogenetic heritability of TE abundance and diversity
To determine the phylogenetic heritability of TE abundance, expressed as total TE count, and TE diversity, expressed as total number of distinct TE families, we measured the amount of variation explained by phylogenetic relationships using Bayesian phylogenetic mixed models with Markov chain Monte Carlo estimations, with an intercept fitted as a fixed effect and the phylogeny fitted as a random effect, in the MCMCglmm package (Hadfield 2010) in R (version 4.2.1) (R Core Team 2023) using the Rstudio IDE (Racine 2013; RStudio Team 2015). A Poisson error distribution and log link function were used, and inverse gamma priors were specified for all R and G-side random effects (V = 1, ν = 0.002). Models were run for 11,000,000 iterations with a burn-in of 1,000,000 and a thinning interval of 1000. This approach generated 10,000 posterior samples from which the posterior mode and 95% CIs were calculated. The proportion of between-species variation explained by phylogeny was calculated from the model using the equation Vp/(Vp + Vs), where Vp and Vs represent the phylogenetic and species-specific components of between-species variance (Freckleton et al. 2002), which is equivalent to phylogenetic heritability.
Gene predictions and TE association with genomic compartments
For genome assemblies lacking a gene annotation GFF file at the time of analysis (A. gossypii, Hormaphis cornu, M. dirhodum, P. nigronervosa, R. padi, S. graminum, S. chinensis, S. avenae, S. miscanthi), genes were annotated using AUGUSTUS (version 3.3.3) (Stanke et al. 2008; Keller et al. 2011) with the pea aphid reference gene set (‐‐species = pea_aphid ‐‐strand = both ‐‐genemodel = partial ‐‐introns = on ‐‐exonnames = on ‐‐gff3 = on).
To prepare genome annotation files for intersection with TE loci, each gene annotation GFF was processed to generate a GFF file containing coordinates of exons, introns, 5′ and 3′ flanking regions, and intergenic space. Gene-flanking regions are defined here as 20 kb directly upstream of or downstream from the gene body. These flanking regions are determined to identify TEs at a distance that could be described as the proximate promoter region, rather than just accounting for the core promoter region, to include both promoter and more distal enhancer regions for genes (Lenhard et al. 2012). Intergenic space coordinates were generated using the BEDTools complement (Quinlan and Hall 2010) to generate inverse coordinates to those of genic compartments and flanking regions. To quantify TE occupation in different genomic compartments (introns, exons, 5′ and 3′ flanking regions, and intergenic space), BEDTools intersect was used to calculate overlap (-wao) between all compartments and TEs.
Identification of XGF loci and HKGs
The gene identities for all ABC transporters, CYPs, UGTs, esterases, and GSTs were manually curated from the functional annotation of the M. persicae G006v2 genome (Singh et al. 2021). Using these identifiers, exon coordinates were obtained from the M. persicae gene annotation GFF3 file, and corresponding nucleotide sequences were extracted from the genome assembly. The resulting exon sequences from M. persicae were used as queries to identify XGFs in all other aphid assemblies using BLASTN (Camacho et al. 2009) with an e-value threshold of 1 × 10−10. BLAST hits for each species were manually inspected to extract those that best represented each XGF, with the inclusion of gene duplications where appropriate. Following this, a BED file of XGF coordinates for each species was generated including the annotation of flanking regions as described above (Supplemental File S3).
HKGs are required for basic cellular functions and are typically expressed in all cell types (Butte et al. 2001). Amino acid sequences of 28 HKGs characterized for aphids (Bansal et al. 2012; Yang et al. 2014, 2015; Koramutla et al. 2016; Kang et al. 2017; Li et al. 2020) were obtained from the NCBI GenBank database (https://www.ncbi.nlm.nih.gov/genbank/) (Benson et al. 2013) for M. persicae (Supplemental File S4). Subsequently, a BLASTX search was performed, with default parameters, against each aphid genome to identify HKG loci in the other 19 species. For each species, HKGs were manually curated, and a BED file of HKG coordinates, including the annotation of flanking regions, was generated (Supplemental File S5). To quantify TE occupation around XGFs and HKGs, BEDTools intersect was used to calculate overlap (-wao) between annotated TEs and XGFs.
Competing interest statement
The authors declare no competing interests.
Acknowledgments
T.B. was supported by a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC)–funded South West Biosciences Doctoral Training Partnership (grant no. BB/M009122/1). A.P. was supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement no. 646625). C.B. was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) grant (grant no. BB/S006060/1) and the ERC under the European Union's Horizon 2020 research and innovation program (grant agreement no. 646625). A.H. was supported by a BBSRC David Phillips Fellowship (grant no. BB/N020146/1). Myzus cerasi DNA sequencing data were downloaded from AphidBase. Funding for the sequencing was provided by an ERC starting grant APHIDHOST-310190 awarded to Jorunn Bos at the James Hutton Institute, United Kingdom.
Author contributions: T.B. performed TE annotation and downstream analyses, produced the figures, and drafted the manuscript. A.P. manually curated xenobiotic-resistance genes from M. persicae. C.B. and A.H. contributed to the analytical design and data analysis and participated in writing the manuscript. A.H. conceived and coordinated the study. All authors read and approved the final manuscript.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.277820.123.
-
Freely available online through the Genome Research Open Access option.
- Received February 20, 2023.
- Accepted August 29, 2023.
This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.



















