Transposable element variants and their potential adaptive impact in urban populations of the malaria vector Anopheles coluzzii

  1. Josefa González1
  1. 1Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 08003 Barcelona, Spain;
  2. 2Centre Interdisciplinaire de Recherches Médicales de Franceville (CIRMF), BP 769, Franceville, Gabon;
  3. 3École Doctorale Régional (EDR) en Infectiologie Tropicale d'Afrique Centrale, BP 876, Franceville, Gabon;
  4. 4Faculté de Médecine et des Sciences Pharmaceutiques, Université de Douala, BP 2701, Douala, Cameroun;
  5. 5Maladies Infectieuses et Vecteurs: Ecologie, Génétique, Evolution et Contrôle (MIVEGEC), Université Montpellier, CNRS, IRD, 64501 Montpellier, France
  • Corresponding authors: diego.ayala{at}ird.fr, josefa.gonzalez{at}csic.es
  • Abstract

    Anopheles coluzzii is one of the primary vectors of human malaria in sub-Saharan Africa. Recently, it has spread into the main cities of Central Africa threatening vector control programs. The adaptation of An. coluzzii to urban environments partly results from an increased tolerance to organic pollution and insecticides. Some of the molecular mechanisms for ecological adaptation are known, but the role of transposable elements (TEs) in the adaptive processes of this species has not been studied yet. As a first step toward assessing the role of TEs in rapid urban adaptation, we sequenced using long reads six An. coluzzii genomes from natural breeding sites in two major Central Africa cities. We de novo annotated TEs in these genomes and in an additional high-quality An. coluzzii genome, and we identified 64 new TE families. TEs were nonrandomly distributed throughout the genome with significant differences in the number of insertions of several superfamilies across the studied genomes. We identified seven putatively active families with insertions near genes with functions related to vectorial capacity, and several TEs that may provide promoter and transcription factor binding sites to insecticide resistance and immune-related genes. Overall, the analysis of multiple high-quality genomes allowed us to generate the most comprehensive TE annotation in this species to date and identify several TE insertions that could potentially impact both genome architecture and the regulation of functionally relevant genes. These results provide a basis for future studies of the impact of TEs on the biology of An. coluzzii.

    The deadly success of the malaria mosquito Anopheles coluzzii is rooted in its extraordinary ecological plasticity, inhabiting virtually every habitat in West and Central Africa where it spreads the human malaria parasite (Fontaine et al. 2015; Tene Fossog et al. 2015). Noteworthy, the larvae of An. coluzzii exploit more disturbed and anthropogenic sites than its sister species An. gambiae. An. coluzzii shows a higher tolerance to salinity and organic pollution and, as a consequence, is the predominant species in coastal and urban areas (Tene Fossog et al. 2013, 2015; Kengne et al. 2019; Longo-Pendy et al. 2021). However, this mosquito not only has a greater resilience to ion-rich aquatic environments, but it has also become resistant to DDT and pyrethroid insecticides used for vector control (Wiebe et al. 2017; Fouet et al. 2018; Vontas et al. 2018). The adaptive flexibility of An. coluzzii is also exemplified by its rapid competence to expand its range of peak biting times to avoid insecticide-treated bed nets (Perugini et al. 2020). This extraordinary adaptive capacity makes this malaria vector a threat for malaria control. Thus, elucidating the natural genetic variants underlying the ecological and physiological responses to fluctuating environments in this species is key for its control.

    At the molecular level, several genetic mechanisms have been related back to the adaptive capacity of An. coluzzii. The most prominent and historically studied are chromosomal inversions (Coluzzi et al. 2002; Ayala et al. 2017). An. coluzzii shows a large number of polymorphic inversions (Costantini et al. 2009; Simard et al. 2009), many of them associated with environmental adaptation through environmental clines and/or correlation with specific climatic variables (Coluzzi et al. 1979; Fouet et al. 2012; Ayala et al. 2017, 2019). Other types of rearrangements, such as gene duplications, have been involved in insecticide resistance. For example, the acetylcholinesterase (ACE1) gene has been duplicated, maintaining at least a sensitive and a resistance copy, to counteract the fitness cost of the resistant phenotype (Labbé et al. 2007; Assogba et al. 2015; Weetman et al. 2018). Moreover, a recent genome-wide analysis showed that genes containing copy number variants were enriched for insecticide functions (Lucas et al. 2019). However, although several of the candidate genes responsible for the adaptive capacity of An. coluzzii have been identified, including detoxification and immune-related genes, our knowledge of the genetic variants underlying differences in these genes lags behind (Tene Fossog et al. 2013; Mitri et al. 2015; Kamdem et al. 2017; King et al. 2017). In particular, very little is known about natural variation in transposable element (TE) insertions in An. coluzzii.

    TEs are key players in multiple adaptive processes across species owing to their capacity to generate a wide variety of mutations (Casacuberta and González 2013; Schrader and Schmitz 2019). TEs can disperse across the genome regulatory sequences such as promoters, enhancers, and repressive elements thus affecting nearby gene expression (Chuong et al. 2017; Ullastres et al. 2021). Additionally, they can also act as substrates for ectopic recombination leading to chromosomal rearrangements (Mathiopoulos et al. 1998; Gray 2000; Reis et al. 2018). However, TEs are often ignored when analyzing functional variants in genomes. Because they are repetitive sequences, TEs are difficult to annotate, and reads derived from TEs are often discarded in genome-wide analyses (Goerner-Potvin and Bourque 2018). Long-read sequencing techniques are needed to get a comprehensive view of TE variation in genomes, because they allow to annotate TE insertions in the genome rather than merely inferring their position (Logsdon et al. 2020; Shahid and Slotkin 2020).

    Although TE insertions have been annotated genome-wide in several anopheline species including An. coluzzii, most studies to date have characterized the TE repertoire in a single genome for each species (for review, see Vargas-Chávez and González 2020). To capture the full extent of TE natural variation and the potential consequences of TE insertions, it is necessary to evaluate multiple genomes to comprehensively assess diversity within a species (Yang et al. 2019; Bayer et al. 2020; Weissensteiner et al. 2020). This becomes especially relevant when attempting to identify recent TE insertions and their effect in the genome structure and genome function, given that they might be restricted to local populations. So far, our knowledge of An. coluzzii genome variation attributed to TE insertions is limited to a few well-characterized families that have been found to vary across genomes (Quesneville et al. 2006; Boulesteix et al. 2007; Esnault et al. 2008; Santolamazza et al. 2008b; Salgueiro et al. 2013).

    In this work, we sequenced (using long-read technologies) and assembled the genomes of An. coluzzii larvae collected in six natural breeding sites in two major cities in Central Africa: Douala (Cameroon) and Libreville (Gabon). Our work aims at generating a comprehensive TE annotation that could be used to identify insertions that can potentially impact both genome architecture and gene regulation in An. coluzzii.

    Results

    Six new whole-genome assemblies of An. coluzzii from two major cities in Central Africa

    To explore the TE diversity in An. coluzzii, we used long-read sequencing to generate whole-genome assemblies of larvae collected from six natural urban breeding sites: three from Douala (Cameroon), and three from Libreville (Gabon) (Table 1; Fig. 1A; Supplemental Table S1A). Reference-guided scaffolding was performed for the six assemblies and for the available An. coluzzii AcolN1 genome (Kingan et al. 2019). Although the seven genomes analyzed varied in sequencing coverage and read length, these differences only had an effect on contig N50 but did not affect other assembly and scaffold metrics (Supplemental Table S1B). Although the number of scaffolds varied from 5 to 107 (median = 20), the scaffolds’ N50 was similar across the seven genomes (Table 1). The BUSCO percentages of complete genes ranged from 94.2% to 96.6% except for the DLA155B sample, which had a lower completeness value (89.5%) (Table 1; Simão et al. 2015). These completeness values were similar to those from the AcolN1 genome assembly (98.9%) (Table 1; Supplemental Table S1). Thus, the analyzed genomes are overall comparable in terms of scaffold contiguity and completeness (Table 1; Supplemental Table S1).

    Figure 1.

    Transposable elements in three urban populations of An. coluzzii. (A) Geographic location of the six breeding sites analyzed, three in Douala (DLA) and three in Libreville (LBV) (in red), and of the place of origin of the Ngousso colony (in gray) that was used to generate the AcolN1 genome. (B) Number of TE families identified when using a single genome or when using all possible combinations of more than one genome. The red line shows the total number of TE families, and the blue line shows the number of newly described families. Note that on average 76% of all the TE copies where already identified when analyzing a single genome (Supplemental Fig. S1). (C) Classification of all TE families and newly described families in An. coluzzii. The three most abundant superfamilies from each order are shown.

    Table 1.

    Genome assemblies and scaffold statistics for the seven An. coluzzii genomes analyzed in this work

    Sixty-four new anopheline TE families discovered in An. coluzzii

    To identify the TE families present in each of the genomes, we used the TEdenovo pipeline from the REPET package (Flutre et al. 2011). After several rounds of manual curation, we identified between 172 and 294 TE families per genome (Table 1; Supplemental Table S2; Platt et al. 2016). We discovered that differences in sequencing and assembly metrics correlated with the number of families identified in only one of the genomes (Supplemental Table S1B). Thus, although using a single reference would have only allowed the identification of a median of 244 TE families, clustering the TE libraries from an increasing number of genomes allowed the identification of 435 well-supported TE families (Fig. 1B; Methods). Sixty-four of these families are described here for the first time (Fig. 1B; see below).

    To annotate the individual copies in each one of the genomes analyzed, besides the 435 families identified by REPET, we also added to our library 85 TE families from the TEfam database that were also present in the seven An. coluzzii genomes (Methods). Although the majority of these families were indeed identified by REPET, we initially discarded them during manual curation (Supplemental Table S3; Methods). The final total of 520 families were classified into 23 superfamilies and then further grouped into four orders (Fig. 1C). The 82 families initially discovered in all the genomes had a higher copy number (Wilcoxon test, P-value < 0.001) and were more abundant in euchromatin (two proportion Z-test P-value < 0.001) compared with the 353 families only discovered in some of the genomes.

    To further characterize the novel families, we estimated their average number of insertions in the seven An. coluzzii genomes (Fig. 2; Supplemental Fig. S2; Supplemental Table S4). Copies from all 64 new families were found in all seven An. coluzzii genomes, further suggesting that these are bona fide families. Although the majority of families contain full-length copies in at least one of the seven genomes analyzed, truncated copies were the most abundant (Fig. 2A; Supplemental Table S3). We identified a median of 72 insertions (ranging from 16 to 1445) per family and genome (Fig. 2A; Supplemental Fig. S2), with new families having lower copy numbers compared with previously described families (Wilcoxon test, P-value = 0.0214). No differences were found in the number of genomes containing new and previously described families (Wilcoxon test, P-value = 0.9381). Two of the four TRIM elements identified (Acol_LTR_Ele 4 and Acol_LTR_Ele 6) are among the most abundant new families, with more than 150 insertions (Fig. 2A). Indeed, TRIM elements have not been previously described in anopheline genomes and are still underexplored in insect genomes in general (Marsano et al. 2012; Zhou and Cahan 2012; Elsik et al. 2014). In plants, some TRIM elements have the capacity to restructure genomes by acting as target sites for retrotransposon insertions, alter host gene structure, and transduce host genes (Witte et al. 2001; Gao et al. 2016). Although we found TRIM elements in An. coluzzii genomes to be underrepresented in gene bodies (χ2 test P-value > 0.05), they were overrepresented in nested insertions (χ2 test P-value < 0.05).

    Figure 2.

    Structure, abundance, and phylogenetic distribution of novel TE families. The four newly identified TRIM families are shown; for the remaining 60 novel families see Supplemental Figure S2. (A) The structure of each new family is displayed: the light blue box represents the full extension of the TE, and the red arrows represent LTRs. All insertions for each TE family were identified and are shown as a coverage plot in which each line represents a copy in the genome. The large number of stacked horizontal lines in the extremes of the plot represent an abundance of solo LTRs. (B) Phylogenetic distribution of the TE family insertions in 15 members of the Anopheles genus, including the eight members of the An. gambiae complex, two more distantly related mosquito species (Culex quinquefasciatus and Aedes aegypti), and Drosophila melanogaster. The number of insertions with >80% identity and spanning at least 80% of the consensus in each species is shown using a black and white gradient: species with no insertions are shown in white, and species with 15 or more insertions are shown in black.

    Finally, the 64 new families were unevenly distributed among the members of the Anopheles genus (Fig. 2B; Supplemental Fig. S2; Supplemental Table S4). Ten families were exclusively found in members of the Pyretophorus series, suggesting that these elements emerged after the split of this series from the Cellia subgenus. Moreover, 13 families were also found in at least one of the other three non-anopheline species (Supplemental Fig. S2C). The distribution of these 13 families was patchy, with some of them present only in distantly related species whereas others were present in members of the Anopheles genus or in members of the Pyretophorus series, suggesting that some of them might have been acquired through horizontal transfer events (de Melo and Wallau 2020).

    TEs are nonrandomly distributed throughout the genome

    The percentage of the genome represented by TEs across the seven genomes varied between 16.94% and 20.21% (Table 2). However, differences across genomes in assembly and scaffolding statistics did not explain the variation in TE content (Supplemental Table S1B). We found a positive correlation between TE content and genome size as has been previously described in Anopheles and other species (Pearson's r = 0.90, significance = 0.007) (Supplemental Fig. S3; Sessegolo et al. 2016; de Melo and Wallau 2020). The percentage of TEs in euchromatin (11.73%–13.40%) was much lower than the percentage of TEs in heterochromatin (65.76%–74.77%; χ2 test P-value < 0.05) (Table 2; Fig. 3A), making heterochromatin a more variable compartment (Sharma et al. 2020).

    Figure 3.

    TE insertion distributions throughout the genomes. (A) Percentage of euchromatin and heterochromatin occupied by TEs in each of the seven analyzed genomes. Each order is shown in a different color. (B) Box plots of the percentage of the euchromatin of each chromosome covered by TEs. Autosomes are shown in blue, and the X Chromosome is in red. (C) Percentage of TE insertions in each genome that fall in a specific genomic region. A red line is used to display the expected percentage that should be covered by TEs taking into consideration the size of the genomic region. Each order is shown in a different color as in A. Significant differences were found across orders and superfamilies (Supplemental Table S5).

    Table 2.

    TE content in the seven An. coluzzii genomes analyzed

    Because we are mostly interested in the potential functional impact of TE insertions, for the rest of this work we focused on the TE insertions present in euchromatin. We first assessed whether the seven analyzed genomes differed in TE content at the order and superfamily levels, and we found this to be the case (χ2 test P-value = 1.07 × 10−21 and P-value = 1.69 × 10−14, respectively). The largest differences were found in the LTR order: LTRs were more abundant in the DLA112 and LBV88 genomes and less abundant in AcolN1 (Supplemental Fig. S4A). At the superfamily level, we found that the largest differences were in the Gypsy superfamily, which belongs to the LTR order. Indeed, most of the differences in TE content between the evaluated genomes appear to be in retrotransposon families (Supplemental Fig. S4B). Superfamily abundance did not clearly reflect the geographical population structure (Supplemental Fig. S4B,C).

    When comparing the TE content in autosomes versus the X Chromosome, as expected, the X Chromosome had a larger fraction of its euchromatin spanned by TEs (Fig. 3B; Xia et al. 2010). Finally, to evaluate the distribution of TE insertions regarding genes, we divided the genome in five regions: 1 kb upstream, exon, intron, 1 kb downstream, and intergenic (Ruiz et al. 2021). The number of insertions in intergenic regions was higher than expected by chance, whereas the number of insertions in exons and introns was lower (χ2 test P-value < 0.001) (Fig. 3C; Supplemental Table S5). Although the upstream regions were neither enriched nor depleted for TE insertions, the downstream regions had a smaller amount of TEs than expected by chance, consistent with downstream regions being more commonly found in a closed chromatin state (Ruiz et al. 2021).

    Overall, TEs were not randomly distributed in the genome, because they were more abundant in heterochromatic than in euchromatic regions, more abundant in the X Chromosome euchromatin than in autosomes, and more abundant in euchromatic intergenic regions than in gene bodies or gene flanking regions.

    MITE insertions are present in several inversion breakpoints

    TEs have been found in close proximity to the breakpoints of the 2La in An. gambiae and An. melas, and to the breakpoints of the 2Rb inversions in An. gambiae and An. coluzzii (Sharakhov et al. 2006; Lobo et al. 2010). We thus explored the TE content in the breakpoints of these two inversions and in three other common polymorphic inversions in An. coluzzii: 2Rc, 2Rd, and in the distal breakpoint of the 2Ru (Corbett-Detig et al. 2019). The analysis of these breakpoint regions suggested that the analyzed genomes have the standard conformation for all five inversions (Methods; Supplemental Table S6). We identified several TEs nearby the proximal and the distal breakpoints of 2La and 2Rb, in agreement with previous studies (Fig. 4; Mathiopoulos et al. 1998; Sharakhov et al. 2006; Lobo et al. 2010). For the standard 2La proximal breakpoint, Sharakhov et al. (2006) identified several DNA transposons and a SINE insertion. We also identified a cluster of MITE insertions, which are DNA transposons, and we additionally identified an Outcast (LINE) element (Fig. 4). Regarding the standard 2La distal breakpoint, we observed two MITEs similar to one of the insertions in the proximal breakpoint, in agreement with the findings by Sharakhov et al. (2006) (Fig. 4). We also observed similar TE content in the 2Rb breakpoints as the one described by Lobo et al. (2010): tandem repeats flanking the inversion in the standard and inverted forms, and TEs in the internal sequences of both breakpoints (Fig. 4). Finally, for the 2Rd inversion, we identified MITEs near both breakpoints.

    Figure 4.

    TE insertions near known inversion breakpoints. Diagram of Chromosome 2 with the analyzed inversions. For each inversion both breakpoints, proximal (closer to the centromere) and distal (farther from the centromere), plus 2.5 kb on each side are shown. When the position of a breakpoint was not identified at the single base pair level, the interval where the breakpoint is predicted to be is shown in a gray box. Genes are shown as blue boxes, and TEs are shown as red boxes. Below each TE, the family and the number of genomes where the insertion was found/the number of genomes where the breakpoint region was identified are given. Breakpoints are shared among some of the inversions.

    Six of the seven potential active families are LTR insertions

    To identify potentially active TE families, we first estimated their relative age by analyzing the TE landscapes (Smit et al. 2013–2015; Diesel et al. 2019). We observed an “L” shape landscape in all genomes which is indicative of a recent TE burst (Supplemental Fig. S5; Fonseca et al. 2019). This “L” shape landscape, dominated by retrotransposons, had previously been described for the sister species An. gambiae (Diesel et al. 2019; Petersen et al. 2019), where numerous Gypsy LTR retrotransposons (up to 75%) might currently be active (Tubío et al. 2005, 2011). Indeed, piRNAs predominantly target LTR retrotransposons in An. gambiae (George et al. 2015). We further investigated the families in the peak of the landscape, and we identified eight families with more than two identical full-length fragments and with more than half of their copies identical to the consensus (Supplemental Table S7A). Additionally, we assessed the potential ability of our candidates to actively transpose by identifying their intact open read frames (ORFs), LTRs (in the case of LTR retrotransposons), and target site duplications (TSDs) and determined that seven of these families are potentially fully capable of transposing. For the LTR families, we further evaluated the identity between LTRs of the same copy. Mean identities ranged from 97.38% to 99.44% across the six LTR families, with 17.64%–40.35% of the copies having identical LTRs (Supplemental Table S7B). These results further suggest that these families might be responsible for the recent retrotransposon burst in An. coluzzii (Supplemental Table S7A). LTR elements accounted for most of the differences in TE content across genomes (Supplemental Fig. S4).

    As a first step toward assessing the potential functional consequences of the TE insertions from these seven putatively active families, we focused on insertions that occurred in gene bodies, and 1 kb upstream or downstream from a gene. We identified 66 genes with insertions from these families, with four genes containing up to two insertions in the same gene region (Supplemental Fig. S6; Supplemental Table S8). Six of the genes have functions related to vectorial capacity: insecticide resistance, immunity, and biting ability (Table 3). We checked whether the TE insertions nearby these genes contained binding sites for transcription factors or promoter motifs (Supplemental Table S9). We focused on identifying binding sites for three transcription factors that are known to be involved in response to xenobiotics (cap'n' collar: cnc) and in immune response and development (dorsal: dl; signal transducer and activator of transcription: STAT) given the availability of matrix profiles from D. melanogaster (Osta et al. 2004; Ingham et al. 2017). We identified binding sites for dl, STAT, or both in three insertions; with the Acol_gypsy_Ele18 and the Acol_copia_Ele8 insertions having more than three binding sites for the same transcription factors, suggesting that they might be functional (Table 3; Xie et al. 2010). Additionally, the genes that contained these TE insertions also contained binding sites for the same transcription factors, which suggest that they already played a role in their regulation. We also identified a putative promoter sequence in the Acol_copia_Ele24 insertion found upstream of the CLIPA1 protease encoded by AGAP011794 that could also potentially lead to changes in the regulation of this gene (Supplemental Table S9C).

    Table 3.

    Seven TE insertions from putatively active families are located near genes related with vectorial capacity

    TE insertions could influence the regulation of genes involved in insecticide resistance

    The usage of pyrethroids, carbamates, and DDT as vector control mechanisms has led to the rapid dispersion of insecticide resistance alleles in natural populations (Dabiré et al. 2014; Silva et al. 2014; Cheung et al. 2018; Elanga-Ndille et al. 2019; Fadel et al. 2019). Among the best characterized resistance point mutations are L1014F (kdr-west), L1014S (kdr-east), N1575Y in the voltage gated sodium channel para (also known as VGSC), and G119S in the acetylcholinesterase ACE1 gene (Santolamazza et al. 2008a; Jones et al. 2012; Essandoh et al. 2013). We first investigated whether the seven genomes analyzed in this work contained these resistance alleles. We found the kdr-west mutation in the six genomes from Douala and Libreville but not in the AcolN1 genome (Kingan et al. 2019), whereas previous estimates in Douala reported a 68.2% frequency of the kdr resistant allele (Antonio-Nkondjio et al. 2011). None of the other mutations were identified in our data set; however, a previously undescribed nonsynonymous substitution (L1688M) in the fourth domain of para was identified in the aforementioned six genomes. Whether this replacement also increases insecticide resistance is yet to be assessed.

    TEs have been hypothesized to play a relevant role specifically in response to insecticides (Wilson 1993; ffrench-Constant et al. 2006; Rostant et al. 2012), and a few individual insertions affecting insecticide tolerance in anopheline mosquitoes have already been described (Weedall et al. 2020). Thus, we searched for TE insertions in the neighborhood of insecticide-related genes that could potentially lead to differences in their regulation. We focused on eight well-known insecticide resistance genes: ACE1, CYP6P3, GSTD1-6, para, Rdl, CYP6M2, CYP6Z1, and GSTE2. We also considered differentially expressed genes in response to insecticides in An. gambiae (Supplemental Fig. S7; Supplemental Table S10; Tene Fossog et al. 2013; Main et al. 2018; Adolfi et al. 2019; Bamou et al. 2019). We found that 21 of the 43 genes analyzed contained at least one TE insertion, which is similar to the number of genes containing insertions genome-wide (χ2 test P-value = 0.6285). Overall, insertions were enriched in intronic regions (χ2 test P-value < 0.05), although 10 were located in the 1-kb gene upstream region, one in the 3′ UTR, and five in the first intron, and thus are more likely to have a functional impact. Moreover, the majority (44/59) of TE insertions located nearby insecticide-related genes were absent or present at low frequencies (<10%) in two rural populations from the Ag1000G project (Anopheles gambiae 1000 Genomes Consortium 2020), suggesting that they might have increased in frequency in urban populations (Supplemental Table S11).

    To further explore if TEs could influence the regulation of insecticide resistance genes, we focused on polymorphic (present in two or more genomes) and fixed (present in all seven genomes analyzed) insertions located in gene bodies or 1 kb upstream or downstream from the gene. We searched for cnc binding sites; for those insertions located in gene upstream regions, we also looked for promoter motifs (Supplemental Table S9). We identified 15 insertions in 10 genes containing either cnc binding sites or promoter sequences (Fig. 5). Three insertions contained cnc bindings sites: one located in CYP4C28 and two in para, although the genes did not contain binding sites for this transcription factor (Fig. 5). The other 12 insertions contained promoter motifs. In some cases, such as the Acol_m2bp_Ele10 MITE insertion in ABCA4 or the tSINE insertion in GSTMS2, although the same TE insertion was found in six and seven genomes, respectively, the promoter motifs were found only in four and one genome, respectively (Fig. 5; Supplemental Table S9). We analyzed the consensus sequence of these two families and found that although the Acol_m2bp_Ele10 had the promoter motif, the tSINE did not, suggesting that some of the Acol_m2bp_Ele10 elements lost the promoter motifs, whereas the tSINE copies acquired them.

    Figure 5.

    TE insertions in the neighborhood of genes involved in insecticide resistance. Gene structures are shown in black with arrows representing the exons. TE insertions are depicted as red boxes. When containing a TFBS for cnc or a promoter they are filled in red, otherwise they are empty. The red color is darker on fixed TEs and lighter on polymorphic TEs. Promoters are shown as arrows, and cnc binding sites are shown in blue. Resistance mutations are shown for para (kdr).

    Immune-related genes could also be potentially affected by TEs

    Mosquitoes breeding in urban and polluted aquatic environments overexpress immune-related genes, suggesting that immune response is relevant for urban adaptation (Cassone et al. 2014). To assess the potential role of TEs in immune response, we searched for TE insertions in genes putatively involved in immunity according to ImmunoDB (Supplemental Table S12; Waterhouse et al. 2007). We identified TE insertions in 148 of the 281 genes analyzed, similar to the number of genes containing insertions genome-wide (χ2 test P-value = 0.7788). Eleven of the 148 genes containing TE insertions are differentially expressed in response to a Plasmodium invasion (Supplemental Table S12). These 11 genes participate in several pathways of the immune response including the small regulatory RNA pathway, pathogen recognition, the nitric oxide response, and ookinete melanization (Osta et al. 2004; Volz et al. 2006; Oliveira et al. 2012; Dennison et al. 2015). The majority of insertions were located in the 1-kb gene flanking regions (χ2 test P-value < 0.05), with 56 of them located in the first intron and 112 in the 5′ upstream gene region. More than half (262/438) of the TE insertions located nearby immune-related genes in urban populations were absent or present at low frequencies (<10%) in two rural populations (Supplemental Table S11).

    Finally, we further explored polymorphic and fixed insertions to identify binding sites for dl and STAT and promoter motifs. We found that 19 TEs contained binding sites for dl, 21 TEs contained binding sites for STAT, and 12 TEs contained binding sites both for dl and STAT (Supplemental Table S12). Additionally, we identified 81 insertions, in the upstream region of 56 genes, which carried putative promoter sequences. Five of the TEs located nearby Plasmodium responsive genes added TFBS and promoter sequences, thus suggesting that these TE insertions can potentially influence the response to this pathogen (Supplemental Table S12; Ruiz et al. 2019).

    Discussion

    In this study, we de novo annotated transposable element (TE) insertions in seven genomes of An. coluzzii, six of them sequenced here. A comprehensive genome-wide TE annotation was possible because we used long-read technologies to perform the genome sequencing and assembly. Long reads allow identifying TE insertions with high confidence given that the entire TE insertion sequence can be spanned by a single read (Logsdon et al. 2020; Shahid and Slotkin 2020). Although the genome-wide TE repertoire has been studied in other anopheline species, particularly in An. gambiae, to our knowledge there are no other studies that have explored TE variation in multiple genomes from a single species (Holt et al. 2002; Fernández-Medina et al. 2012; Marinotti et al. 2013; Neafsey et al. 2015; Diesel et al. 2019; de Melo and Wallau 2020). As reported in other species, we observed that increasing the number of available genomes analyzed allowed us to increase the number of identified TE families from a median of 244 (172–294) to 435 (Fig. 1B; Hufford et al. 2021). Moreover, having the full sequences of seven genomes also allowed us to discover 64 new TE families, including four TRIM families previously undescribed in anopheline genomes. This might be relevant because TRIM elements have been shown to be important players in genome evolution in other species (Witte et al. 2001; Gao et al. 2016). The wide range of families identified across genomes was not directly related to the quality of the genome assembly taking into consideration the more generally used quality parameters such as read length, number of contigs, and contig N50 (Ou et al. 2020). This suggests that there are possibly other characteristics of each genome that affected the identification of high-quality TE families, such as biases in the location of the TE insertions, given that TE families are challenging to identify in regions with low complexity or with numerous nested TEs. Nonetheless, the identification of TE families is dependent on the methodology used to perform TE annotations; therefore, other annotation strategies could lead to the discovery of still undescribed families (Vargas-Chávez and González 2020).

    The availability of several genome assemblies also allowed us to determine that the majority of the intraspecies differences in the TE content were in heterochromatic regions. Although we cannot discard that these differences are at least partly explained by differences in the quality of the genome assemblies, it is known that the heterochromatin compartment is highly variable even among members of the same species (Jagannathan et al. 2017; Sharma et al. 2020). Additionally, there were also significant differences in the TE content in euchromatic regions, as has been previously observed in several organisms including Drosophila (Kofler et al. 2015; Rech et al. 2019), mammals (Rishishwar et al. 2015; Diehl et al. 2020), maize (Haberer et al. 2020), and Arabidopsis (Quadrana et al. 2016). TE insertions were not randomly distributed throughout the genome and instead were consistently enriched in intergenic regions, most likely owing to purifying selection, as suggested in the wild grass Brachypodium distachyon (Stritt et al. 2020). We also analyzed the TE content in the breakpoints of five common polymorphic inversions, three of them analyzed here for the first time. We found TE insertions in all but one of the inversion breakpoints, with MITE elements being the most common TE family (Fig. 4).

    As a first step toward identifying the potential role of TEs in rapid adaptation to urban habitats (Johnson and Munshi-South 2017), we focused on insertions from recently active families located near genes that are relevant for the vectorial capacity of An. coluzzii (Table 3). Because adaptation can also happen from standing variation, in the case of insecticide resistance genes, which have been shown to be shaped by TE insertions in several organisms, and immune-related genes, we analyzed all insertions independent of age (Mateo et al. 2014; Salces-Ortiz et al. 2020; Weedall et al. 2020). Although the role of nonsynonymous substitutions and copy number variation in resistance to insecticides commonly used in urban environments has been studied, the potential role of TEs has not yet been comprehensively assessed in An. coluzzii or any other anopheline species (Kamgang et al. 2018; Bamou et al. 2019; Lucas et al. 2019; Grau-Bové et al. 2020; Anopheles gambiae 1000 Genomes Consortium 2020). We identified several insertions that were polymorphic or fixed nearby functionally relevant genes (Table 3; Fig. 5; Supplemental Table S12). Some of the identified candidate insertions contained binding sites for transcription factors related to the function of the nearby genes, and promoter regions. Besides adding regulatory regions, TEs can also affect the regulation of nearby genes by affecting gene splicing and generating long noncoding RNAs among many other molecular mechanisms (Sundaram et al. 2014; Chuong et al. 2017; Jiang and Upton 2019; Villanueva-Cañas et al. 2019; Sundaram and Wysocka 2020). Thus, it is possible that the candidate TE insertions identified, which lack binding sites and promoters, could be affecting nearby genes through other molecular mechanisms. Our results are a first approximation to the potential role of TEs in An. coluzzii adaptation to the challenging environment that urban ecosystems entail. Establishing a direct link between the TEs and the traits involved in urban adaptation will require sampling a larger number of individuals and characterizing the phenotypes associated with the insertions. A better understanding of the biology of An. coluzzii and its ability to rapidly adapt to urban environments should further facilitate the development of novel strategies to combat malaria. Better management strategies can be implemented if we understand and are able to predict changes in the frequency of genetic variants relevant for the vectorial capacity of this species.

    Methods

    Sample collection and DNA isolation

    We sampled An. coluzzii larvae in two cities of Central Africa: Libreville (Gabon) in January 2016 and Douala (Cameroon) in April 2018 (Supplemental Table S1). We collected immature third and fourth stage larvae of Anopheles from water bodies using the standard dipping method (Service 1993). All the samples were PCR tested to differentiate An. coluzzii larvae from An. gambiae larvae before library preparation.

    For Pacific Biosciences (PacBio) sequencing, DNA from a single An. coluzzii larva from the LBV11 site was extracted using the MagAttract HMW DNA extraction kit (Qiagen) following the manufacturer's instructions. For Nanopore sequencing, DNA from six larvae from each of the five breeding sites was extracted either with the QiaAMP UCP DNA kit (Qiagen) or MagAttract HMW DNA extraction kit (Qiagen). For Illumina sequencing, DNA from an additional larva from each of the six different breeding sites was extracted following the same extraction protocol as for Nanopore sequencing (for further details, see Supplemental Methods).

    Library preparation and sequencing

    Quality control of the DNA sample for PacBio sequencing (Qubit, NanoDrop, and Fragment analyzer) was performed at the Center for Genomic Research facility of the University of Liverpool before library preparation. The library was prepared by shearing DNA to obtain fragments of approximately 30 kb and sequenced on two SMRT cells using Sequel SMRT cell, 3.0 chemistry. Nanopore libraries were constructed using the Native Barcoding Expansion 1–12 (PCR-free) and the Ligation Sequencing Kit following the manufacturer's instructions. A minimum of 400 ng of DNA from each larva was used to start with the library workflow. For each breeding site, six larvae were barcoded, and equal amounts of each barcoded sample were pooled before sequencing. The samples from the same breeding site were run in a single R9.4 flow cell in a 48-h run, except for sample DLA112, which was run in two flow cells. The DNA concentration was assessed during the whole procedure to ensure enough DNA was available for sequencing.

    The quality control of the samples, library preparation, and Illumina sequencing was performed at the Center for Genomic Research facility of the University of Liverpool. Low input libraries were prepared with the NEBNext Ultra II FS DNA library kit (300-bp inserts) on the Mosquito platform, using a 1/10 reduced volume protocol. Paired-end sequencing was performed on the Illumina NovaSeq platform using S2 chemistry (2 × 150 bp).

    Genome assemblies

    The PacBio sequenced genome was assembled using Canu version 1.8 (Koren et al. 2017). The Nanopore genomes were assembled using Canu version 1.8 followed by a round of polishing using Racon version 1.3.3 (Vaser et al. 2017), followed by nanopolish version 0.11.1 (Loman et al. 2015) and Pilon version 1.23-0 (Walker et al. 2014) with the fix parameter set on “bases.” Although we cannot discard that using Illumina data from an additional individual could introduce novel variants, BUSCO values increased after the polishing step. Allelic variants were identified and removed using Purge Haplotigs version 1.0.4 (Roach et al. 2018) with the “-l 15 -m 100 -h 195” parameters. Finally, BlobTools version 1.1.1 (Laetsch and Blaxter 2017) was used to remove contamination from all six genome assemblies taking into consideration fragment sizes, their taxonomic assignation, and the coverage using the Illumina reads (for further details, see Supplemental Methods).

    As a proxy of the completeness, the BUSCO values for the six newly assembled genomes plus the AcolN1 genome were obtained using BUSCO version 3.0.2 (Simão et al. 2015) with the diptera_odb9 set as reference. Finally, the contigs for all seven assemblies were scaffolded with RaGOO v1.1 (Alonge et al. 2019) using the chromosome level An. gambiae AgamP4 assembly with default parameters.

    Gene annotation transfer

    The GFF for the genome annotation for AgamP4 was transferred into the newly assembled genomes using Liftoff (Shumate and Salzberg 2021) with default parameters. The annotation was manually inspected using UGENE version 35 (Okonechnikov et al. 2012) and whenever needed the annotation was accordingly corrected. Ninety-six percent of the AgamP4 genes were correctly transferred.

    Construction of the curated TE library and de novo TE annotation

    We ran the TEdenovo pipeline (Flutre et al. 2011) independently on each of the seven genomes with default parameters. The obtained consensus in each genome were manually curated (for further details, see Supplemental Methods). To ensure that we identified as much of the TE diversity as possible, we also annotated our genomes with the mosquito libraries present in the TEfam database (tefam.biochem.vt.edu). The consensus from the TEfam library identified in our genomes were added to the REPET library, and all the consensuses were clustered using CD-HIT version 4.8.1 (Fu et al. 2012) with the -c and -s parameters set to 0.8. The sequences belonging to the same cluster were used to perform a multiple sequence alignment and the consensuses were obtained.

    The consensuses were classified using PASTEC (Hoede et al. 2014) with default parameters. Next, their bidirectional best hits were calculated using BLAST (Camacho et al. 2009) against the TEfam (tefam.biochem.vt.edu), AnoTExcel (Fernández-Medina et al. 2011), and Repbase (Bao et al. 2015) databases (for further details, see Supplemental Methods). These classified consensuses were used to reannotate the assembled genomes with the TEannot pipeline using default parameters, and we discarded copies whose length overlapped >80% with satellite annotations (Quesneville et al. 2005).

    Transfer of TE annotations to the AcolN1 reference genome

    We transferred the euchromatic TE annotations from the six genomes we sequenced to the AcolN1 genome. Briefly, we built a GFF file composed by the coordinates of two 500-bp-long “anchors” adjacent to each TE. We transferred these features considering each pair of anchors as exons from a single gene using the Liftoff tool (Shumate and Salzberg 2021). We conserved only transfers in which both anchors were transferred to the AcolN1 genome. Next, following the same strategy we transferred these regions from the AcolN1 genome to the other six genomes. This step allowed the identification of TEs that were present in these genomes but that had not been initially annotated by REPET. When both anchors were separated by less than 10 bp, we considered the TE to be absent; when the anchors were found more than 10 bp away, the TE was considered to be present; finally, when any of the anchors was not transferred the TE was not transferred either. Overall, we transferred 53,893 TEs. A manual inspection of 98 of these insertions (686 TE calls) lead to the annotation of six insertions that were initially missed and the removal of 16 TEs that were incorrectly annotated. Thus, both the false positive (2.3%) and the false negative rate (0.87%) of our annotations were low (for further details, see Supplemental Methods).

    Identification of newly described families in other species

    To determine the presence of the newly described TE families in other species, RepeatMasker version open-4.0.9 (Smit et al. 2013–2015) was run on 18 dipteran genomes with default parameters and using the 64 newly described families as the library (for the list of the species, see Supplemental Methods).

    Identification of heterochromatin

    The coordinates for the pericentric heterochromatin, compact intercalary heterochromatin, and diffuse intercalary heterochromatin in An. gambiae AgamP3 were obtained from a previous work (Sharakhova et al. 2010). The An. gambiae AgamP3 genome assembly was mapped against the seven An. coluzzii genome assemblies using progressiveMauve (Darling et al. 2010), and the corresponding coordinates on each of the assemblies were retrieved (Supplemental Table S1C). To confirm that the heterochromatin coordinates were accurately transferred to each genome, we plotted the TE abundance throughout the whole genome and, as expected, we observed a sharp decrease in TE density near the heterochromatin–euchromatin boundaries (Supplemental Figure S8).

    Transfer of known inversion breakpoints

    The coordinates for the breakpoints of inversions 2La, 2Rb, 2Rc, and 2Rd, and the distal 2Ru breakpoint were obtained from Corbett-Detig et al. (2019). Fifty-kilobase regions flanking each side of the insertion were obtained and mapped using minimap2 (Li 2018) against the scaffolded genome assemblies to transfer the breakpoints. To validate the breakpoints, we analyzed long reads spanning the breakpoints using the Integrative Genomics Viewer (IGV) version 2.4.19 (Robinson et al. 2011).

    Detection of putatively active TE families

    To identify potentially active TE families, we identified families with more than two identical full-length copies in at least six of the seven annotated genomes. We determined the fraction of identical copies of these families by identifying all their insertions in the genome and calculating the sequence identity of all their bases against the consensus by performing a nucleotide BLAST. Given that long reads have a higher rate of sequencing errors that could affect the age estimation (although we used Illumina reads for polishing), we also used dnaPipeTE (Goubert et al. 2015) to estimate the relative age of the TE families using the raw Illumina reads for the six genomes that we sequenced. We compared the TE landscape obtained using dnaPipeTE with that obtained using the BLAST procedure, using a Kolmogorov–Smirnov test corrected for multiple testing using the Benjamini–Hochberg procedure (Supplemental Table S13). Because we observed few significant differences, we continued using the landscape data obtained using the BLAST procedure. We identified the families where the majority of the bases of their insertions were on the peak of identical sequences in the TE landscape (>50% of the bases with >99% base identity) in more than five of the seven genomes we analyzed. Finally, we assessed the ability to actively transpose strong candidates by identifying their intact ORFs, LTRs (in the case of LTR retrotransposons), and target site duplication (TSD), and estimated the percentage identity between the two LTR of each TE copy.

    Classification of TEs by their genomic location

    To determine the location of TEs we used the findOverlaps function from the GenomicAlignments R package (Lawrence et al. 2013) using default parameters. Both the TE and the gene annotation were converted to GenomicRanges objects ignoring strand information in the case of TEs.

    Insecticide resistance and immune-related genes

    A list with a total of 43 relevant insecticide resistance genes was generated taking several works into consideration (Supplemental Table S10; Tene Fossog et al. 2013; Main et al. 2018; Adolfi et al. 2019; Bamou et al. 2019). To determine the position of the L to M nonsynonymous substitution that we observed in AGAP004707 (para) we used the position from the CAM12801.1 reference sequence.

    The full list of 414 immune-related genes from An. gambiae was downloaded from ImmunoDB (Waterhouse et al. 2007). We focused on the 281 most reliable genes filtering by the STATUS field and conserving only those with A or B scores (A refers to genes confirmed with high confidence and expert-refined cDNA supplied, and B refers to genes confirmed with high confidence, no refinement required).

    TFBS and promoter identification

    The matrices for dl (MA0022.1), cnc::maf-S (MA0530.1), and Stat92E (MA0532.1) were downloaded from JASPAR (https://jaspar.genereg.net/) (Sandelin et al. 2004). The sequences for the TEs of interest were obtained using getSeq from the Biostrings R package. The TFBS in the sequences were identified using the web version of FIMO (Grant et al. 2011) from the MEME SUITE (Bailey et al. 2009) with default parameters. The ElemeNT online tool was used to identify promoter motifs (Sloutskin et al. 2015).

    Insertion frequency estimation in rural populations

    We compared the frequencies of the insertions where the presence/absence status was unambiguously determined in at least four of the seven genomes analyzed with the frequencies of these insertions in two rural populations from the Ag1000G project: Bana in Burkina Faso and Tiassalé in Ivory Coast (Anopheles gambiae 1000 Genomes Consortium 2020). We used the PoPoolationTE2 v-1.10.03 pipeline (Kofler et al. 2016) to compute the TE insertion frequencies in these two rural populations. Briefly, we used the AcolN1 reference genome and the newly generated TE library for An. coluzzii as reference and mapped the Illumina paired-end data from all samples. Next PoPoolationTE2 detected signatures of TE presence/absence and estimated their frequencies in every sample (for additional information, see Supplemental Material).

    Data access

    The sequencing data generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA676011. TE and gene annotations in each of the seven genomes analyzed are provided as GFF files available at https://digital.csic.es/handle/10261/224416 and as Supplemental Files 1–14, respectively. The TE library and the transferred annotations across the seven genomes are also available at https://digital.csic.es/handle/10261/224416 and as Supplemental Files 15, 16, respectively. The TE consensus sequences have also been deposited at Dfam release 3.6 (Storer et al. 2021).

    Competing interest statement

    The authors declare no competing interests.

    Acknowledgments

    We thank members of the González laboratory for comments on the manuscript. We thank the Ecology of Vectorial Systems team at the Centre Interdisciplinaire de Recherches Médicales de Franceville (CIRMF) (Franceville, Gabon) for their support in field collections. We thank Jean Pierre Agbor and Serge Donfanck for their commitment in larvae collections in Douala (Cameroon). This study was supported by grants from the Ministerio de Economía, Industria y Competitividad, Gobierno de España (MINECO/AEI/FEDER, EU) (BFU2017-82937-P) and grant PID2020-115874GB-I00 funded by Ministerio de Ciencia e Innovación/AEI 10.13039/501100011033 awarded to J.G. D.A. was supported by an Agence Nationale de la Recherche grant (ANR-18-CE35-0002-01—WILDING). N.M.L.P. was funded by Agence universitaire de la Francophonie (AUF) and CIRMF scholarships.

    Author contributions: D.A. and J.G. conceived and designed the experiments. N.M.L.P., S.E.N., and L.A. generated data. C.V.-C., D.A., and J.G. performed the data analysis. C.V.-C. and J.G. wrote and revised the manuscript with input from all authors. All authors read and approved the final manuscript.

    Footnotes

    • Received May 12, 2021.
    • Accepted November 24, 2021.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents

    Preprint Server