Functional characterization of enhancer activity during a long terminal repeat's evolution

Many transposable elements (TEs) contain transcription factor binding sites and are implicated as potential regulatory elements. However, TEs are rarely functionally tested for regulatory activity, which in turn limits our understanding of how TE regulatory activity has evolved. We systematically tested the human LTR18A subfamily for regulatory activity using massively parallel reporter assay (MPRA) and found AP-1- and CEBP-related binding motifs as drivers of enhancer activity. Functional analysis of evolutionarily reconstructed ancestral sequences revealed that LTR18A elements have generally lost regulatory activity over time through sequence changes, with the largest effects occurring owing to mutations in the AP-1 and CEBP motifs. We observed that the two motifs are conserved at higher rates than expected based on neutral evolution. Finally, we identified LTR18A elements as potential enhancers in the human genome, primarily in epithelial cells. Together, our results provide a model for the origin, evolution, and co-option of TE-derived regulatory elements.

(calJac3) genomes (Smit et al.). For baboon (papAnu2) which is not available on www.repeatmasker.org, we ran RepeatMasker 4.1.0 using the RepBase RepeatMasker library 20170127. Since LTR18A consensus sequences are 98% similar between the two repeat libraries, we believe that most if not all LTR18A elements will be identified in papAnu2 in the same way as the other primate genomes. For the closest two subfamilies, LTR18B and LTR18C consensus sequences are ~75% and 67% similar to the LTR18A consensus respectively.
For manual curation, we examined the alignment of each annotated LTR18A element and removed the element if it satisfied any of our filtering criteria (Supplemental Table S3). First, we exclude LTR18A elements that have significant alignments to LTR18B or LTR18C. RepeatMasker outputs alignment scores for each repetitive element, some of which have multiple significant alignment scores for different subfamily consensus sequences. RepeatMasker then chooses the subfamily with the highest alignment score to annotate elements with the same ID. A consequence of this method is that fragmented elements can be annotated for the same subfamily even when the highest scoring alignment differs for each fragment. Since LTR18B and LTR18C consensus sequences are ~75% and 67% similar to LTR18A respectively, some LTR18A elements have significant alignments to LTR18B and/or LTR18C. Thus, we discard these elements with multiple possible alignments to avoid ambiguity from subfamily assignment. Second, we use paired LTR information to remove LTR18A elements that have discordant annotations. Due to the mechanism of ERV retrotransposition, we expect non-solo LTR18A elements to exist as same orientation pairs that are separated by the ERV internal region. Using this logic, we reasoned that paired LTRs that are assigned to different subfamilies have uncertain annotation.

Manual curation in LTR18A ancestral reconstruction
For manual curation of reconstructed ancestral sequences, we focused on insertions rather than deletions due to the possibility of insertions propagating up the tree. We determined insertion sites by examining the multiple sequence alignment of ortholog ancestors and finding gaps in the alignment created by insertions in only a few ortholog ancestors. Generally, we used parsimony when deciding to keep or remove an insertion. For example, if the insertion is present in only one primate lineage, then it is less likely for the insertion to have existed in the ortholog ancestor. Our reasoning is that an insertion in the ortholog ancestor and subsequent deletion in the other lineages requires at least two mutation events, whereas a single insertion in one primate lineage requires only one mutation event. Following reconstruction from ortholog ancestors, we again applied parsimony to manually adjust the LTR18A subfamily ancestor.
The MPRA library was constructed as previously described with some adjustments. An AgeI site was introduced upstream of the SV40 polyA signal and the BamHI site downstream of the SV40 polyA signal was deleted using the QuikChange Lightning site-directed mutagenesis kit (Agilent). Synthesized oligos were amplified with 0.05pmol of template per 50µL PCR reaction for seven cycles using MPRA library amplification primers. A total of 32 reactions were performed. Following amplification and gel purification, oligos were cloned into a pGL backbone with the AgeI insert using NheI and AgeI sites. Multiple ligations were pooled, purified by PCR cleanup (Nucleospin), and transformed into 5-alpha electrocompetent E. coli (NEB). The hsp68 promoter driving dsRed reporter was cloned using EcoRI and BamHI sites. The MPRA library with the hsp68 promoter and dsRed reporter was purified and transformed into E. coli before plasmid extraction. The final library was concentrated by ethanol precipitation.

Cell culture and library transfection
Cell culture and library transfections were performed as previously described (Kwasnieski et al. 2014).

RNA extraction was performed 24 hours after transfection using PureLink RNA Mini Kit with on-column
DNase treatment (Life Technologies) followed by DNase I treatment using TURBO DNA-free kit (Invitrogen). Samples were prepared for RNA-seq as previously described (Kwasnieski et al. 2014). First strand cDNA synthesis was performed using Superscript III Reverse Transcriptase (Life Technologies).
Barcodes were amplified from cDNA from three transfections and three technical replicates of DNA from the plasmid library. Amplified barcodes were digested with KpnI and EcoRI and ligated to Illumina adapters. Ligation products were further amplified, after which replicates and plasmid library DNA input were pooled for sequencing. We obtained over 1000x average coverage for each transfection replicate and the DNA input.

MPRA evaluation
Enrichment scores of elements were highly reproducible across transfection replicates in HepG2 (average R 2 =0.904) while moderately reproducible in K562 (average R 2 =0.666) (Supplemental Figure S3). We confirmed that orientation does not have large effects on enrichment score in both HepG2 and K562 (Supplemental Figure S4). We also found that selected control sequences from Ernst et al. follow expected trends for both their original annotations as well as redefined annotations based on expression values from Ernst et al. MPRA results (Supplemental Figure S5).

Subfamily age estimate
The average divergence, weighted by copy length, was calculated for the LTR18A subfamily using the RepeatMasker output for hg19. The age was obtained by using the average divergence and the average mammalian genome mutation rate of 2.2 x 10 -9 per base per year (Kumar and Subramanian 2002).