

Provided files required to analyze a new sample
===============================================

  - Ty3_LTR.fa		28bp Ty3 LTR sequence
  - Yeast_Genome.fa     Genome Reference Sequence
  - Native_LTRs.csv     List of Native LTRs
  - tRNA_Genes.csv      List of tRNA genes
  - PolIII_Genes.csv    PolIII Transcribed Genes
  - SGD_Database.csv    SGD Gene Annotations


Instructions to analyze a single sample
=======================================

The program starts from the raw sequencing data for the sample.
Sequencing reads must be paired-end reads and must be provided
in two separate files, one for each end. For instance:

  - SAMPLE-R1.fastq
  - SAMPLE-R2.fastq

The files generated during each step of the analysis can be named
as wanted. The names used in this guide are only provided as an
example and are not required to analyze a sample. Note also that
the scripts provided for each step of the analysis are intended
to run on an UNIX operating system.


STEP 1 : sequencing reads trimming and filtering
================================================

This step prepares the reads for alignment by trimming the random
tags and Ty3 sequences from the genomic DNA. Reads for which the
Ty3 sequence is not found at the expected position or for which at
least one base is not called (N) in the 8bp random tag are discarded.
Detailed filtering statistics are displayed by the program at the end
of the processing. This step is performed by the provided script:

          1-Filter_Sequencing_Reads.pl

The program requires 6 arguments that must be provided in command
line and in the following order:

          Ty3_LTR.fa               // Input file  : provided
          SAMPLE-R1.fastq          // Input file  : Fastq READ 1
          SAMPLE-R2.fastq          // Input file  : Fastq READ 2

          SAMPLE-FILT-R1.fastq     // Output file : Fastq READ 1
          SAMPLE-FILT-R2.fastq     // Output file : Fastq READ 2
          SAMPLE-FILT-R1.tags      // Output file : Random tags

For instance, using the file names above and assuming all the input
files are located in the current folder, execute the command line:

./1-Filter_Sequencing_Reads.pl Ty3_LTR.fa SAMPLE-R1.fastq SAMPLE-R2.fastq SAMPLE-FILT-R1.fastq SAMPLE-FILT-R2.fastq SAMPLE-FILT-R1.tags


STEP 2 : filtered reads alignment to reference genome
=====================================================

The filtered fastq files obtained during the previous step:

          SAMPLE-FILT-R1.fastq
          SAMPLE-FILT-R2.fastq

must then be aligned to the reference genome sequences provided
in the file named "Yeast_Genome.fa" using the short-read aligner
of your choice. Any short-read aligner can be used in this step
but when configuring the options of the aligner, make sure that:

  - the alignments are reported for the entire read (end-to-end mode
    in bowtie for instance). Reads hard-clipped or soft-clipped by
    the short-read aligner will be discarded afterwards. Forcing the
    aligner to report full matchs only will prevent the reads to be
    discarded due to incomplete alignment results.

  - the results are reported in SAM file format. This is the default
    output file format for most aligners but some of them provide BAM
    files instead. You will need to convert them in SAM file format
    using samtools for instance in this case.

Multiple matchs for a single paired-end read can be present in the
alignment results but are not recommended. Paired-end reads with non
concordant alignment results (half-mapped only or mapped on different
chromosomes) will be discarded afterwards. By default, all paired-reads
aligned on the same chromosome are considered for the insertion sites
analysis (next steps). If you want to discard the paired-reads aligned
far apart on the chromosome, use the short-read aligner options to do
so, most of them offer the option to set a maximum insert length for
paired-reads. There is no such filter in the scripts provided for the
next steps of the analysis.

The output SAM file can be named as wanted, the rest of this
documentation will assume you named it:

          SAMPLE-FILT-PE.sam


STEP 3 : merging reads, random tags, and alignment results together
===================================================================

This step will perform several tasks at once:

 - retrieves for each paired-end read the sequencing data,
   the random tag trimmmed during step 1, and the corresponding
   alignment results in the SAM file.
 - discards all discordant alignment results.
 - reports all the concordant alignment results found for each
   paired-end read together with the corresponding sequences and tag
   in a tab-separated file that will be used during the next steps.

These tasks are performed by the provided script named:

          2-Merge_Reads_Tags_Mapping.pl

The program requires 5 arguments that must be provided in command
line and in the following order:

          SAMPLE-FILT-R1.tags      // Input file  : from STEP 1
          SAMPLE-FILT-R1.fastq     // Input file  : from STEP 1
          SAMPLE-FILT-R2.fastq     // Input file  : from STEP 1
          SAMPLE-FILT-PE.sam       // Input file  : from STEP 2
          SAMPLE.aln               // Output file : merged data

The program takes up to 8 GB of RAM memory and may take a few hours
to complete depending on the size of the dataset. Detailed filtering
statistics are printed by the program at the end of the processing.

Once this step completed, the output file SAMPLE.aln is all you need
for the next steps of the analysis. The following files:

          SAMPLE-R1.fastq
          SAMPLE-R2.fastq
          SAMPLE-FILT-R1.tags
          SAMPLE-FILT-R1.fastq
          SAMPLE-FILT-R2.fastq
          SAMPLE-FILT-PE.sam

can safely be archived or removed from your workspace.


STEP 4 : extracting insertion sites and random tags
===================================================

This step will perform the following tasks:

 - extracts the full list of insertion sites based on the read alignment
   results. Insertion sites are identified by (1) the chromosome (2) the
   strand and (3) the position on the chromosome.
 - extracts the full list of random tags observed at each insertion site
   at least one time (i.e. at least one sequencing read was located at
   the insertion site with the corresponding random tag).
 - extracts exact read counts for each pair insertion site / random tag.
 - annotates each insertion site with a list of six surrounding genes
   (closest gene, closest gene upstream of the insertion site, closest
   gene downstream of the insertion site, and the three closest genes
   for which the insertion site is upstream of the gene).
 - extracts a filtered list of insertion sites and random tags by first
   removing all random tags observed less than 50 times at each insertion
   site and then removing all insertion sites for which no random tag is
   left after filtering.
 - reports both sets of filtered and unfiltered insertions sites and
   random tags in comma-separated files that can be imported in Excel.

These tasks are performed by the provided script named:

          3-Extract_Insertion_Sites.pl

The program requires 6 arguments that must be provided in command
line and in the following order:

          Native_LTRs.csv          // Input file  : provided
          tRNA_Genes.csv           // Input file  : provided
          PolIII_Genes.csv         // Input file  : provided
          SGD_Database.csv         // Input file  : provided
          SAMPLE.aln               // Input file  : from STEP 3
          OUTPUTPREF               // Output files name prefix

The program outputs 4 files with names starting by the prefix
provided to the script in command line:

          OUTPUTPREF_all_sites.csv  // All insertion sites
          OUTPUTPREF_all_pairs.csv  // All pairs site / tag
          
          OUTPUTPREF_sel_sites.csv  // Filtered insertion sites
          OUTPUTPREF_sel_pairs.csv  // Filtered pairs site / tag

Output file description - files *_sites.csv
===========================================

The insertion sites are reported in the files *_sites.csv
with the following information for each site:

          Insertion Site		Generic site identifier. This identifier is unique
                                        for each insertion site and is used in each of the
                                        four output files in order to easily match entries
                                        in each file. Note that these identifiers are only
                                        valid for the processed sample and cannot be used
                                        to compare multiple samples processed separately.

          Chromosome                    Insertion site description.
          Strand
          Position

          Num Random Tags               Total number of distinct random tags observed at
                                        the corresponding insertion site. The actual list
                                        of random tags is provided in the files *_pairs.csv

          Total Read Count              The total number of sequencing reads aligned at
                                        the corresponding insertion site. The detail of how
                                        many of these sequencing reads are observed with
                                        each random tag is provided in the files *_pairs.csv

          Native LTR			Reports if the insertion site is actually a native
                                        LTR position on the reference genome. The values in
                                        this column are either equal to "no" meaning that
                                        there is no LTR in the reference genome at the
                                        corresponding position or are equal to the native
                                        LTR identifier in the file "Native_LTRs.csv".

The next 30 columns provide gene annotations for the insertion sites.
Six genes are systematically reported for each insertion site:

          CLOSEST GENE			Closest gene to the insertion site position

          CLOSEST UPSTREAM GENE         Closest gene upstream of the insertion site

          CLOSEST DOWNSTREAM GENE       Closest gene downstream of the insertion site

          UPSTREAM OF GENE #1           Closest gene the insertion site is upstream of

          UPSTREAM OF GENE #2           Second closest gene the insertion is upstream of

          UPSTREAM OF GENE #3           Third closest gene the insertion is upstream of

For each gene, 5 columns are used to provide information about the gene:

          Gene Identifier               The corresponding generic gene identifier used in
                                        the file "tRNA_Genes.csv", or "PolIII_Genes.csv",
                                        or "SGD_Database.csv" depending on which gene set
                                        the gene comes from.

          Distance                      The distance of the gene to the insertion site.

          Strand                        The gene strand.

          Identifier                    The SGD gene identifier.

          Name                          The gene name (HUGO gene symbol).

Output file description - files *_pairs.csv
===========================================

The pairs insertion site / random tag are reported in the files *_pairs.csv
together with the detailed read count for each random tag.

          Insertion Site                Generic site identifier.

          Chromosome                    Insertion site description.
          Strand
          Position

          Total Read Count              The total number of sequencing reads aligned
                                        at the corresponding insertion site.

          Random Tag                    Random tag 8bp sequence.

          Tag Read Count                Number of reads aligned at the corresponding
                                        insertion site with the random tag sequence
                                        reported in the previous column.

The four output files are provided in comma-separated values format (csv)
and can be imported in Excel for visualization or in another program
like R for further analysis.

