# STR-wide association analysis of A. thaliana phenotypes
### press et al. (in prep).
This file has information and instructions about the framework used to detect STR-phenotype associations.

### Contents
You will find the following directories/files in this directory:

1. `code` (dir): contains code for analyses. In this you will find several scripts, including `code/mipstr_phenos_lme_Atwell_logs_SHUFFLE_031717.R` and `code/mipstr_phenos_lme_Atwell_logs_031717.R`. The first runs the association analysis with permuted phenotypes (useful for evaluating inflation), the second runs it with the actual phenotypes. Other scripts here are useful for making plots or running specific follow-up analyses. See other notes on these below. 

2. `output` (dir): Holds some of the output files that you can expect from running the analyses. 

3. `data` (dir): Contains (processed) input files for association analyses. Includes a kinship matrix, STR genotypes, phenotypes, and some SNP/STR genotype files prepared for specific analyses.

4. `pca_corr_stuff` (dir): Results of an abortive analysis (output as part of the same scripts) that tried to use PCA to correct for population structure. It didn't work very well, so we omitted it, but have it here for archival purposes.

Other files will populate this directory during analysis, but I will not treat them here.


### Running the analysis
To run this analysis and generate plots (all paths are relative to this directory):

#### Dependencies
  * R (>=3.2.1)
  * R libraries:
    * coxme
    * beeswarm
    * stringr
    * MASS
    * car
    * 

#### Steps

0) Preprocess SNP, phenotype, and STR data into a useful data objects. This is already done, but for transparency I say a few words.

  * SNP-->kinship is part of the `code/mipstr_phenos_lme_Atwell_logs*.R` scripts. (Not necessary, already have data file (`data/Kinship.Rdat`), and will not run unless you go download the TAIR9 RegMap data and put it in this dir.)
  * STR genotypes are in `data/mip_geno_filtered_table.txt`.
  * Phenotype data is in `data/phenotype_published_raw_rename.txt`, which has been lightly edited to make it easier to parse.
  * Slices of SNP data useful for adjusting association analyses or effect size estimation are in `data/control_snp_phenos_strwa_control_032317.txt` and `data/
est_effect_snps_031617.txt`. You can get these from the script (in the code dir for the full repo): `../code/preprocess_data_for_controlsnp_012417.R`.

1) Run the association analysis, both actual and shuffled data:

```
$ Rscript code/mipstr_phenos_lme_Atwell_logs_031717.R # real data
$ Rscript code/mipstr_phenos_lme_Atwell_logs_SHUFFLE_031717.R # real data
```

This takes a LONG time (~30hr). I usually run it in a shell script, or submit it to a cluster computer, and takes 2-4G memory.

2) Estimate effect sizes. Script: `code/est_snp_effectsize_032417.R`. 
Run:
```
$ Rscript code/est_snp_effectsize_032417.R 
```

3) Control for nearby SNPs (determined manually) affecting LD flowering. Script: `code/control_for_snps_many_strwas_032317.R`.
Run:
```
$ Rscript code/control_for_snps_many_strwas_032317.R 
```

4) Make beeswarm/boxplots of the phenotype/STR associations at some threshold. Script: `code/str_pheno_swarmer.R`.
Run:
```
$ Rscript code/str_pheno_swarmer.R 
```
