
The
ESAT-Quantitation module is an analytical toolkit to measure digital gene
expression (DGE) in end RNA-sequencing libraries. This is a complete re-write
of our old software that vastly improves running speed and has a fixed memory
footprint that is a function only of genome size and not library depth or
complexity. An overview of the analysis performed by ESAT is included below.
The toolkit is implemented in Java and requires version 1.6
or higher. ESAT can be run on a single or multiple samples. When running
multiple samples, all samples are processed together and the results reported
as a matrix. To handle multiple samples you will need to prepare a file
describing the samples indicating the alignment file locations and the
experiment condition assigned to each file. (See Usage Examples below for
a description of the file format.)
Download: esat.v1.jar
Usage:
java -jar esat.jar -task <score3p | score5p>
-alignments <sample description file> |
-in <BAM/SAM alignment file>
-annotations <transcription annotation
file> | -geneMapping <gene-to-transcript map
file>
-out <Output file name prefix> <Options>
|
Tasks: |
|
||
|
|
-task
<score3p | score5p> |
Specifies the type of analysis to perform. ESAT supports two types
of libraries (3' or 5') and you can process either a single sample or
multiple samples together. Multiple samples are specified using the -alignments option, whereas the single sample is specified by
the -alignment option for both 3' and 5' libraries.
When processing a single sample use score3p or score5p for 3' or 5' libraries respectively (default is score3p). |
|
|
|
|
|
|
|
|
|
|
|
|
Arguments: |
|
||
|
|
-annotations
<bed file> |
Points to a transcript annotation file
in BED format. If using this option, read alignment start counts will be
reported for each transcript listed
in the file. So, if a gene has multiple isoforms listed in this file, counts
will be reported as if each transcript corresponded to a separate gene. |
|
|
|
|
OR: |
|
|
|
-geneMapping
<gene-to-transcript mapping file> |
Points to a gene-to-transcript
mapping file whose format is described below.
If using this option, all transcripts for each gene are collapsed into a
single meta-transcript covering all bases contained in any transcript for that gene. This ensures that all isoforms for
each gene are represented in the output. The output file references genes by
the gene symbol listed in the mapping file, rather than the transcript ID. |
|
|
|
|
|
|
|
|
-in
<bam/sam file> |
For single sample runs, specifies
the path to the alignment BAM/SAM file. |
|
|
|
|
OR: |
|
|
|
-alignments
<samples list file> |
A sample file describing all the samples to be processed. This file
should contain 2 tab-separated columns, specifying one sample per line. Each
line should include: Column 1: Experiment/condition label (eID).
All samples with the same eID will be grouped together
as if all data came from the same input file. Column 2: Sample file path (BAM or SAM format). The
file name must provide the full file name path, and each file must
contain a header. In addition, the segments/chromosomes listed in the header must
match the ones listed in the annotation or gene-to-transcript mapping file. |
|
|
|
-out <filename prefix> |
Specifies the prefix of the output
files. Output files will be named <filename
prefix>.gene.txt and <filename prefix>.window.txt, containing counts of the number of reads with alignments
starting within each gene and window, respectively. |
|
|
|
|
|
|
|
|
|
|
|
|
Options: |
|
||
|
|
Scanning window parameters: |
||
|
|
-wLen
<window length> |
Length of the sliding scanning window. (Default=400) |
|
|
|
-wOlap
<window overlap> |
Overlap between adjacent sliding windows. (Default=0) |
|
|
|
-wExt
<window extension> |
Number of bases past the end (for 3’ libraries) or beginning (for 5’ libraries)
of the transcript to continue the sliding windows. (Note: The actual amount
of extension is truncated to prevent collision with neighboring transcripts.
If the extension is truncated, a warning message is issued giving both how
much the extension was truncated and the gene(s) causing the truncation.)
(Default=0) |
|
|
|
-sigTest
<p-value> |
Significance test threshold for each window. A SCAN statistic p-value
is computed for each window, based on the number of reads starting within the
full transcript and the number of reads starting within the window. If a
significance p-value is provided, only windows with a p-value less than this
threshold are saved. The default setting for this parameter is 1.0, so any
windows with non-zero counts are considered “significant”. If multiple
significant windows abut or overlap, the range from the beginning of the
first to the end of the last abutting/overlapping window is scanned with a
window (of the length specified by the –wLen parameter), and the most significant
window is located. Only this window is reported. (Default=1.0) |
|
|
|
-all |
As described above, if multiple abutting/overlapping windows have
significant counts, only a single, most significant, window is reported. If
the –all parameter is supplied, ALL significant windows
will be reported. If the –sigTest parameter is not provided, this will result in
reporting counts for all windows with non-zero counts. (Default: not
selected) |
|
|
|
-unstranded |
Ignores the strand of the alignment and
reference, and treats all reads as if they align on the + strand. (Default: not
selected) |
|
|
|
|
|
|
|
|
Alignment quality check: |
||
|
|
-quality <minimum quality> |
Minimum alignment quality. Reads with
alignment quality less than this threshold will not be processed. (Default=0) |
|
|
|
|
|
|
|
|
Multimapped reads: |
|
|
|
|
-multimap
<multimap option> |
Indicates how multimapped reads are handled. This only applies to
BAM/SAM files where the multimapped flag (in the
optional fields of the alignment fields section) are provided. Multimapped reads are indicated with NH:<i>, where <i> is
greater than 1. Options: ·
normal: treat multimapped reads as normal reads, receiving 1 count for
each start location. This results in multiple-counting ambiguous reads. (Default) ·
ignore: do not count multimapped reads. This avoids double-counting ambiguous
reads. ·
scale: counts each multimapped read as 1/<i>
count, where <i> is the number of locations
to which the read is mapped (provided in the NH:<i>
field.) NOTE: Using this option greatly
increases the working memory required for the program, since counts must be
recorded in a floating-point array (rather than the default short integer
storage). |
|
ESAT takes two primary inputs – SAM/BAM
alignment file(s) and a transcript reference file – and identifies locations in
the reference transcripts where a significant number of the aligned reads
start. It identifies these locations as follows:
Identification of significant window locations:
·
First, all of the SAM/BAM
alignment files are scanned, and the total number of aligned reads that start
at each genomic location is counted.
·
Next, the reference genome
file is loaded. If the gene-to-transcript mapping file is provided (using the –geneMapping parameter), all transcripts
for each gene are collapsed into a single composite transcript containing all
bases in all exonic regions of all transcripts for each
gene. If annotations are provided in a .BED file (with the –annotations parameter),
each transcript is treated separately.
·
The main scanning window
parameters (described above) define the length of the scanning window (-wLen),
the amount of overlap between windows as the window steps across the transcript
(-wOlap),
and the number of bases that the windows extend past the transcript boundary (-wExt).
·
Before beginning the scanning
process, intronic regions are removed and the
remaining regions are effectively “spliced” together for windowed counting purposes.
The window is then stepped along, with each window overlapping the previous
window by the amount specified with the –wOlap parameter.
The process is illustrated below:

·
For each transcript, the
total number of reads is counted. Then, for each window, a family wise error corrected p-value is
calculated based on a SCAN statistic, which specifies how “significant” the
counts in that window are with respect to the total number of reads within that
transcript, under the null hypothesis that reads are uniformly distributed across the transcript. Windows with a p-value less than the threshold specified with the –sigTest parameter are discarded, as
well as any partial windows. (Note: If no significance threshold is specified, the
default threshold of 1.0 is used, and all windows with non-zero counts are
considered significant.)
·
Finally, if a set of adjacent
(or overlapping) windows all have a significant number of counts, that region
is re-scanned from beginning to end in one-base steps to locate the window
location with the largest number of reads. All of the overlapping significant
windows are then replaced with this single “best” window. (Note: This
re-scanning process can be turned off by using the –all parameter.
If this parameter is selected, counts for all significant windows will be
reported.)
·
The locations of all
significant windows are saved for second pass, which counts the reads in each
of these significant windows for each input experiment/condition separately.
Collection of counts for individual experiments:
·
In the second pass, each of
the input alignment files are scanned again, and read counts are collected for
each experiment or condition specified in the alignment file list file
(described below). Read counts are recorded for each experiment/condition and
the read counts are reported in the gene-level and window-level output files
(described below). The gene-level file provides the counts of the total number
of reads aligned to the exonic regions (plus any
extension) of each transcript, and the window-level file provides counts of
reads for each experiment/condition within the significant windows determined
during the first-pass analysis.
Output
file formats:
The gene-level output file (<output file prefix>.gene.txt)
contains a tab-delimited matrix with one row per gene/transcript provided. The
first three columns contain the gene symbol or transcript ID, the chromosome,
and the strand. This is followed by one column for each experiment/condition
indicating the number of reads aligned to locations within the exons of the
gene/transcript for that experiment/condition. The column headers for the first
three columns are: “Symbol”, “chr”, and “strand”. The
column headers for the remaining columns are the names of the experiments (or
conditions) provided in the input file (with the –alignments parameter). If a single input file is provided
with the –in parameter, the data column is named “Exp1”.
The window-level output file (<output file prefix>.window.txt)
also contains a matrix with one row per window. The first 5 columns contain the
gene symbol (or transcript ID), chromosome, genomic start coordinate, genomic
end coordinate, and strand. This is followed by one column for each
experiment/condition indicating the number of reads aligned to locations within
the exonic regions contained in that window for that
experiment/condition.
For
both files, counts include any extension past the end of the gene/transcript
specified by the window extension (-wExt)
parameter. Partial windows (i.e., windows with length less than the value
specified with the –wLen parameter) are discarded,
with one exception: if the total length of the transcript, plus any extension,
is less than one full window length, read start counts are reported for all
available bases for that transcript.
Input Files:
ESAT takes alignment BAM/SAM files as input. Any alignment
software, such as Tophat, can be used for this
purpose. BAM or SAM files don’t need to be sorted and indexed in order to be
processed, but processing speed might be improved if this pre-processing has
been performed. (See samtools or Picard for tools to
sort and index SAM/BAM files.) The alignment files MUST have headers. BAM or
SAM files don’t need to be sorted and indexed in order to be processed, but
processing speed may be improved if this pre-processing has been performed.
Annotations file:
This is the genome
annotation file specified with the -annotations parameter, containing annotations for all
transcripts being evaluated. This file must be in standard BED file format (see
BED format for
details). Note that the genomic coordinates specified in this file must match
those used for the alignment files.
Gene
Mapping file:
This
is the gene-to-transcript mapping file provided with the –geneMapping parameter, containing annotations
for all transcripts being evaluated, along with the gene symbol associated with
each transcript. A table containing all required information can be downloaded
from the UCSC Genome Browser website (UCSC Table
Browser) by selecting the species and genome assembly if interest, and the
following additional fields:
·
group: Genes and Gene Predictions
·
track: RefSeq Genes
·
table: knownGene
·
output format: all
fields from selected table
Then, provide an output file
name and select “get output”. If the table is downloaded in gzip
format, it must be un-zipped to a tab-delimited text file for input to ESAT. At
a minimum, the gene mapping file must have the following (labeled) columns:
·
name: transcript ID
·
chrom: reference
sequence chromosome (or scaffold)
·
txStart: transcription start position
·
txEnd: transcription
end position
·
name2: Gene symbol
·
strand: + or - strand
·
exonStarts: exon start positions
·
exonEnds: exon end position
If
any of these mandatory columns are missing, the ESAT will terminate with an
error message. Feature coordinates are assumed to be 1-based, “right open” such
that exonStart
coordinates point to the base position of the start of an exon and the exonEnd
coordinates point to one base after the end of the exon. (Using this coordinate
system, exonEnd-exonStart=exonLength.) ExonsStarts and exonEnds are
given as a matching set of comma-separated lists of values indicating the start
and end position of each exon relative to txStart, similar to the blockStarts
feature in BED files.
When
the –geneMapping parameter is selected, all
exons for all transcripts assigned to a given gene symbol are collapsed down to
form a composite meta-transcript whose “exons” cover all exons of all transcripts
for that gene. An example is shown below:

Alignment file list
file:
The
file provided with the –alignments parameter is a tab-delimited list of
experiment label-alignment file pairs. Comment lines can be included by staring
the line with the ‘#’ character. An example file is shown below:
# Experiment 1 files:
Exp1 /home/usr1/sample1a.bam
Exp1 /home/usr1/sample1b.bam
# Experiment 2 files:
Exp2 /home/usr1/sample2a.bam
Exp2 /home/usr1/sample2b.bam
Note that multiple alignment files can be included for each
experimental condition. With this example file, two columns of counts would be
provided in the gene-level and window-level output files, with column headers “Exp1”
and “Exp2”.
Usage
examples:
Processing a single .bam file using a .bed file to provide
genomic coordinates of regions of interest:
java -jar esat.jar -in sample1.bam -annotations genes.bed -out test1
This will use all default parameters (wLen=400, wExt=0, wOlap=0, 3’ scoring, stranded, no
quality filtering, treating multmapped reads as
normal reads, and reporting only the most significant, non-zero window within
each set of adjacent non-zero-count windows), and produce two output files: test1.gene.txt and test1.window.txt.
Processing
a set of .bam files using a gene-to-transcript mapping file to provide genome
coordinates of transcripts:
java –Xmx12g –jar esat.jar –alignments
sampleList.txt –geneMapping geneMap.txt –wLen 200 –wExt 1000 –wOlap 50 –sigTest 0.01 –multimap ignore –out test2
The –Xmx12g switch is provided to allocate 12Gb of memory for the Java virtual machine. Alignment files
are listed in the sampleList.txt file (described below), and
the gene/transcript coordinates are given in the geneMap.txt file. If multiple transcripts for a gene are
listed in the geneMap.txt file, all transcripts are collapsed into a single
composite transcript as described in the “Gene Mapping file” section above. The
scanning window width is set to 200 bases with an overlap of 50 bases, and
scanning extends 1000 bases past the 3’ end of the transcript. A significance
test is performed for each window, windows with a p-value greater than 0.01 are
discarded, and only the most significant window of length 200 within a range of
overlapping “significant” windows is reported. Any reads that are marked as multimapped are skipped. Two output files, test2.gene.txt and test2.window.txt, are produced.
A
possible (tab-delimited) sampleList.txt file is shown below. Note
that lines beginning with the comment character ‘#’ are ignored.
Warning/error
messages:
Extension warnings:
If
the –wExt parameter is set a value greater than 0, it is
possible that when attempting to extend past the end of a transcript, the extension
would collide with a neighboring transcript. If this occurs, the region is
extended only as far as possible without overlapping the neighboring gene, and
reported with a warning such as:
“WARN
… Gene ENSMUSG00000077212.1 (-) extension overlaps ENSMUSG00000042354.6_5.
400-base extension shortened to 59”
This
indicates that, when attempting to extend the 6th exon of the transcript
ENSMUSG00000042354.6 (the “_5” indicates the zero-based exon ID) by 400 bases,
it collided with the transcript for ENSMUSG00000077212.1 and had to be
truncated to 59 bases in order to avoid the collision.
Isoform mismatch warnings:
When
a transcript-to-gene file is provided with the –geneMapping
parameter, all transcripts for each gene are collapsed into a single composite transcript,
as described in the Gene Mapping file section above. The transcripts of
some genes include copies of that gene on other chromosomes, rather than just
additional isoforms (splice variants) of that gene. For simplicity, the first
transcript listed for each gene is assumed to have the “correct” chromosome for
that gene. If transcripts of that gene listed later in the file appear on a
different chromosome, they are rejected with a warning message such as:
“WARN
… New isoform mismatch for Gm5512 (chr10+) with NR_002891 (chr19+)”
If this causes problems for your analysis, the
gene-to-transcript mapping file can be edited with this in mind. For example,
if there are two known copies of gene X
one two different choromsomes, and both copies are of
interest to your analysis, the file could be modified such that the gene symbol
for all transcripts on chromosome A are X.chrA, and all transcripts on chromosome B are named X.chrB. If it
should turn out that the “true” main isoform of a gene of interest is not the
first transcript listed in the file, simply modify the file to put the main
isoform first.
Count overflow
warnings:
The counts of the number of reads starting at each genomic
location are stored in unsigned short integer (16-bit) arrays in order to minimize
the memory footprint of ESAT. If the number of reads with alignments starting
at a particular location exceeds the maximum possible number that can be stored
(65,535), it is flagged with a warning message such as:
“WARN … location 14238898 in chr7 (-) has >65535 counts.”
Each overflow location is only flagged once, and counts for
that location will then be saved as floating-point numbers so that the correct
number of counts at that location is recorded.