# Readme for STR profiling/analysis
### press et al. (in prep)
This file contains information about the contents of this package of data and code for reproducing analyses in our study. For inquiries email maximilian [at] alumni.reed.edu or queitsch [at] uw.edu

### Contents
You will find the following directories/files in this package:

1) `araport_annot/` (dir): data files relating to the annotation of STRs.

2) `code/` (dir): code for data analysis and processing.

3) data_summaries/ (dir): some high-level descriptions of results, some of which are represented also in supplemental tables.

4) `expansion_analysis/` (dir): data related to analysis of STR expansions discovered in this study, including sequencing, gel images, and qPCR. 

5) `Kinmat.Rdat` (R data file): the kinship matrix of A. thaliana strains used in this study e.g. in association analyses.

6) `LD_analysis/` (dir): STR linkage disequilibrium analysis data and code. See the Readme for this dir therein. 

7) `mip_design_troubleshoot/` (dir): information regarding STRs in the A. thaliana genome, which STRs were chosen for MIP targeting, the MIPs themselves, and results of genotyping.

8) `output/` (dir): provides a location for output of data from overall data analysis driver script.

9) `problem_mip_genotypes_081016.txt` (space-delim text file): STR genotypes for each strain, with strains as columns and STRs as rows.

10) `strwa/` (dir): data and results pertinent to the association of STRs with phenotypes ("STR-wide association"). 

11) `strwa_analysis/` (dir): STR-phenotype association analyis data and code. See the Readme for this dir therein. Some components copied to `strwa/`, which is where further analyses are reading data from.

### Analysis
We provide a driver script which, when run, should reproduce the majority of the analyses and figures reported in our paper. Some intensive analyses, such as LD analysis and association analyses, can be performed with code in their own directories but are not included in order to make this step simple and run in a relatively short time (<5 minutes on my 2016 MacBook Pro laptop). MIP sequence data and processing code for e.g. calling genotypes is provided independently (see manuscript methods). 

The script will print out descriptive summaries and test results, in addition to generating a bunch of plots used as figures used in the text. It will generate more figures than were ultimately used, sometimes non-intuitively, but I chose to err on the side of too much output. 

#### Dependencies of driver script:
 * R >= 3.2.1
 * R libraries:
   * MASS 
   * ksvm
   * stringr
   * vegan
   * DAAG
   * beeswarm
   * RColorBrewer
   * lme4
   * lmerTest
 * a few GB of memory.
 
#### Running the driver script
If you just want to run the script which will spit out some numbers and figures, you could in principle just run one of these:

```
$ Rscript str_analysis_SUBMIT.R # (*nix terminal)
> source('str_analysis_SUBMIT.R') # R console
```

but I would recommend running it in knitr to generate a report with figures embedded, using an R console or Rstudio:

```
> spin('str_analysis_SUBMIT.R')
```

which will generate an HTML report with everything in it. It also does generate a bunch of figures in a figure/ dir and an .md file that can be compiled into a PDF report using pandoc or a similar program, if that is what you like instead.

If you are using Rstudio, you can simply use the "compile report" button/command to generate such a report.

I make no guarantees regarding the readability of the report or its exchangeability with the manuscript- the goal is instead reproducibility and transparency about where the results come from.