
## TE annotation

TE annotation is performed with a pipeline with the following steps for each genome:
  - Construction of the TE library
  - Filtering out low quality consensuses
  - Annotation of TE insertions in the genome
  - Merging internal sequences to LTRs
  - Removing highly repeated insertions 
  - Removing overlapped insertions

Dependencies list:
  - [EarlGrey](https://github.com/TobyBaril/EarlGrey/releases/tag/v2.2)
  - [blast](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
  - [seqtk](https://github.com/lh3/seqtk)
  - [RepeatMasker](https://www.repeatmasker.org/RepeatMasker/)
  - [samtools](http://www.htslib.org/download/)
  - [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)
  - [trf](https://tandem.bu.edu/trf/trf.html)
  - [fastx_toolkit](https://github.com/agordon/fastx_toolkit/tree/master)
  - [pandas package](https://pandas.pydata.org/docs/getting_started/install.html)
  - [LTR_FINDER_parallel](https://github.com/oushujun/LTR_FINDER_parallel)
  - [Repeat Craft](https://github.com/niccw/repeatcraftp)

Data:
  - [TE library from Dfam](https://zenodo.org/api/records/13117512/draft/files/Dfam3.7_droso_49kclassified.fa.zip/content)
  - [Reference CDS](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/153/725/GCF_018153725.1_ASM1815372v1/GCF_018153725.1_ASM1815372v1_cds_from_genomic.fna.gz)

**Comand-line: Example for *D. arizonae***

EarlGrey:

```
earlGrey -g D_moj_wrigleyi_genome.fasta -s D_moj_wrigleyi -o d_moj_wri -t 24
```

Once you created the library with EarlyGrey, the next polishing steps are performed with the code `TEannotation/TEannot.sh`. Before running, set up the variable `PATH_TO_REPEAT_CRAFT` in the line 23 of the code.

Then run the code using the TE library from Dfam and the CDS file from *D. mojavensis* as reference:



```
bash TEannot.sh --genome D_moj_wrigleyi_genome.fasta \  ## Available on Zenodo
    --consensus dmoj22-families.fa.strained \  ## Generated by EarlGrey
    --annot D_moj_wrigleyi_genes.gff \  ## Generated by Liftoff
    --mate1 dmoj22_head_P2_R1.fastq.gz \  ## RNA-seq available on NCBI
    --mate2 dmoj22_head_P2_R2.fastq.gz \  ## RNA-seq available on NCBI
    --species TEannot_dmoj22 \  ## Output directory
    --cds GCF_018153725.1_ASM1815372v1_cds_from_genomic.fna \  ## Reference CDS available on NCBI
    --database Dfam3.7_droso_UNC_classfied.fa \  ## Reference TE consensus available on Zenodo and Dfam
    --strand rf-stranded \  ## RNA-seq strandness
    --threads 24  ## Number of threads
```
