README

nmer_stats v. 1.00

Complied for Linux

This program takes an inputted FASTA file containing in-frame protein coding DNA segments and produces real counts of the occurrences of all n-mers of a given length along with counts of the occurrences of all n-mers of the given length in randomizations of the inputted protein coding segments. The randomizations conserve codon and di-codon bias. Counts are provided in all reading frames as well as in individual frames and in different exon regions (beginning, middle, end). The software also outputs the randomized genomes, which can be further analyzed for other properties.

-----
TO RUN

To run, type "./nmer_stats" followed by your choice of flags. The general format is:

./nmer_stats -i <INPUT FILE NAME> -f <OUTPUT FILE NAME FORMAT> -r <MUMBER OF RANDOM GENOMES> -rs <COEFFICIENT OF NUMBER OF CODON SWAPS> -cuf <CODON USAGE FILE OR TEMPLATE> -cec -numExons <NUMBER OF EXONS> -maxExon <LENGTH OF MAX EXON> -prg <FILE NAME FORMAT>

-i <INPUT FILE NAME> : A fasta file containing coding segments only.
-f <OUTPUT FILE NAME FORMAT> : The prefix that will be used for the resulting output files with n-mer counts.
-s <N-MER LENGTH> : The length of the n-mers you want to count. (Default is 6)
-r <NUMBER OF RANDOM GENOMES> : The number of random genomes to generate from your input genome. The randomized n-mer occurrence counts outputted are the average count over all the generated random genomes. (Recommended 20 based on empirical measurements. Default is 20. More is okay.)
-rs <COEFFICIENT OF NUMBER OF CODON SWAPS> : If the number you provide here is X, each randomized genome will undergo X*YlogY codon swaps, where Y is the total number of basepairs in your input file. (Recommended 3 based on empirical measurments. Default is 3. More is okay.)
-cuf <CODON USAGE FILE OR TEMPLATE> : The codon usage file representing the codon usage in your input file. You may also provide a template file here (e.g. "template.cu" in this package) and specify the "-cec" flag to generate an empirical codon usage file based on your input data.
-cec : Measure the codon usage from the input data. You must also provide a codon usage template file using the -cuf flag (e.g. "template.cu"). The resulting empirical codon usage file will have a name the same as the name of the file provided in the -cuf flag, but with the added suffix "_EMPIRICAL" (e.g. "template.cu_EMPIRICAL). Optional if you want to use a previously generated codon usage file for your input data to save time. 
-numExons <NUMBER OF EXONS> : the number of exons in the input file 
-maxExon <LENGTH OF MAX EXON> : the number of basepairs in the longest exon in the input file
-prg <FILE NAME FORMAT> : Output the generated random genomes with this file name format (ex "-r 3 with -prg rand_" will print the 3 generated random genomes with names "rand_1,rand_2,rand_3" in fasta format). Optional.

-----
AN EXAMPLE

An example command to run nmer_stats is provided below, with the referenced example files (data_all_cds.fas and template.cu) included in the package you downloaded. The output files from executing this command are also included in the package.
Example command:
./nmer_stats -i data_all_cds.fas -f output -s 6 -r 20 -rs 3 -cuf template.cu -cec -numExons 277 -maxExon 2967 -prg output_random_genome

When running on your own input file, you may use the same codon usage template file provided in the example (template.cu), since a template codon usage file is necessary, but make sure to specify the -cec flag.
_____
THE OUTPUT N-MER COUNTS

After a successful run there should be 20 output files that have the prefix you gave using the -f flag, with suffixes in the form i_j where i goes from 0 to 3 and j goes from 0 to 4. Each of these files contains the real counts for all n-mers of the specified length under the column "Nreal", the randomized counts for all n-mers of the specified length under the column "Nrand", and columns containing the p-values, standard deviations, and z-scores for these counts.

The suffixes indicate which subset of n-mers were counted in that given output file, since, after a full randomization of the input data, the program counts n-mers in each of the frames, and over the entire exon length as well as just in the beginning, middle an end of the exons. The breakdown of the suffixes is as follows:

i=0 : entire exons
i=1 : first 50 basepairs of exons
i=2 : middle 50 basepairs of exons
i=3 : last 50 basepairs of exons

j=0 : all frames
j=1 : in-frame (0 frame)
j=2 : -1 frame
j=3 : +1 frame
j=4 : out-of-frame (-1 and +1 frames)

Thus an output file with the suffix "0_4" contains counts of occurrences of n-mers over the entire lengths of the exons but only n-mers that start in the -1 and +1 frames.

