# Exploratory analysis of new HGT results 
## 11/05/2014, max
#### following a depressing but predictable election day

### first read in and preprocess data.
#### note that there are both pvals and overall counts (i.e. the observed data of gains of gene 1 in presence of gene 2), and the two have non-identical dimensions due to some filtering.

```r
# working_dir = 'MOtree_GLrun'
working_dir = "noalphabeta2014_MOtree_results"

# comment out if rerunning this code in an R session because these take
# forever to read in.
pval_splits = dir(file.path(working_dir, paste(working_dir, "_null_simed_genes_new", 
    sep = "")), pattern = "sim_null_pvals")
allps = c()
for (split in pval_splits) {
    load(file.path(working_dir, paste(working_dir, "_null_simed_genes_new", 
        sep = ""), split))
    allps = rbind(allps, pvals)
}

load(file.path(working_dir, "Cijmat.Rdat"))

pvals = as.matrix(allps[sort(rownames(allps)), sort(rownames(allps))])
koko = as.matrix(C_ij[rownames(pvals), rownames(pvals)])
pvals = as.matrix(pvals)
koko = as.matrix(koko)
```


### look at pvals and qvals a little.
#### now plot pvals overall... looks terrible. the fdr correction is a little nasty.

```r
hist(as.matrix(pvals), 100, xlab = "P-val", main = "All, no filtering")
```

![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-2.png) 


```r
# how many pvals ~= 0?
length(which(as.matrix(pvals) == 0))
```

```
## [1] 1150
```

```r

# now estimate qvals - how do different thresholds look?
qs = p.adjust(pvals, method = "fdr")
length(which(qs < 0.05))
```

```
## [1] 5688
```

```r
length(which(qs < 0.01))
```

```
## [1] 1150
```

```r
length(which(qs < 0.1))
```

```
## [1] 15239
```

```r
summary(qs)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   0.989   1.000   1.000
```

### this is obviously terrible, but this includes a lot of meaningless hypothesis tests.  for instance, for a value c_ij that represents a probabilistic count of <1, it's not even worth testing the upper tail for that due to the sparsity of the data.  there are probably elegant ways to decide which tests are worth doing, but our purpose is fairly heuristic here, so we are going to impose an arbitrary cutoff 

### try filtering out observations with <1 count... looks a lot better.  could go further with the filter, but this is likely to be good enough for government work.

```r
hist(as.matrix(pvals)[koko > 1], 100, xlab = "P-val", main = "Filter all comparisons of <1 count")
```

![plot of chunk unnamed-chunk-4](figure/unnamed-chunk-4.png) 

### a little less than half of tests/observations are thus filtered.  the fdr correction gets quite a bit better.

```r
length(koko[koko > 1])
```

```
## [1] 11728670
```

```r
length(koko)
```

```
## [1] 23639044
```

```r
filterps = as.matrix(pvals[koko > 1])

# now estimate qvals
qs = p.adjust(filterps, method = "fdr")
summary(qs)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.876   1.000   0.896   1.000   1.000
```

```r
length(which(qs < 0.01))
```

```
## [1] 1727
```

```r
length(which(qs < 0.05))
```

```
## [1] 15517
```

```r

# figure out the threshold and use that to get an adjacency matrix of the
# network.
qthresh = 0.01
summary(filterps[qs < qthresh])
```

```
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 3.34e-07 1.00e-06 1.00e-06
```

```r

# window dressing
thresh = max(filterps[qs < qthresh])
adjmat = as.matrix(pvals)
adjmat[pvals <= thresh] = 1
adjmat[pvals > thresh] = 0
cat("total link number =", sum(adjmat), "for raw network\n")
```

```
## total link number = 1727 for raw network
```

```r

# do a little tidying, writing, reprocessing of files
write.table(adjmat, file.path(working_dir, "raw_hgt_net_adj.txt"), quote = FALSE)
source("code/linklist_adjmat_thin.R")
links = adjmat_to_list(adjmat)
write.table(links, file.path(working_dir, "raw_hgt_net_list.cyto"), quote = FALSE, 
    row.name = FALSE, col.name = FALSE)
# remove(adjmat)
remove(links)

# at some point need a dag- here is one potential command to get it - hope
# for minimal FAS of 1.  if the net is too much bigger/more complicated,
# could reimplement using igraph for faster network functionality.

# system(paste('python code/arcset_remover.py',
# file.path(working_dir,'raw_hgt_net_list.cyto'), '2',sep=' ')) SAGE
# DOESN'T HAVE NETWORKX INSTALLED SO I AM RUNNING THIS LOCALLY


# reconstruct adjacency matrix of net- now DAG
links = read.table(file.path(working_dir, "dag_hgt_net_list.cyto"))
```

```
## Warning: cannot open file
## 'noalphabeta2014_MOtree_results/dag_hgt_net_list.cyto': No such file or
## directory
```

```
## Error: cannot open the connection
```

```r
adjmat[adjmat > 0] = 0
for (link in 1:nrow(links)) {
    adjmat[links[link, 1], links[link, 2]] = 1
}
```

```
## Error: object 'links' not found
```

```r
cat("total link number =", sum(adjmat), "for DAG network\n")
```

```
## total link number = 0 for DAG network
```

```r

rowed = rownames(adjmat)[rowSums(adjmat) > 0]
coled = colnames(adjmat)[colSums(adjmat) > 0]
both = sort(unique(append(rowed, coled)))
adjmat = adjmat[both, both]

# next, do a transitive reduction.  this will take a while.
source("code/transitivereduction.R")
```

```
## [1] "whatev"
```

```r
adjmat = transReduce(adjmat)
```

```
## Error: subscript out of bounds
```

```r

cat("total link number =", sum(adjmat), "for transitive-reduced DAG\n")
```

```
## total link number = 0 for transitive-reduced DAG
```

```r

save(adjmat, file = file.path(working_dir, "hgt_net_dag_transreduced.Rdat"))
# load('hgt_net_dag_transreduced.Rdat') adjmat = tr_adj_mat
```


## describing the net.
### great! things coming along.  now, calculate some very basic network stats and look at them.

```r
indeg = colSums(adjmat)
outdeg = rowSums(adjmat)
summary(indeg)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 
```

```r
summary(outdeg)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 
```

### so it looks like in and out degree don't follow the same distribution.
### plot these values.

```r
hist(indeg, 100, xlab = "In-degree", main = "Filtered<1")
```

```
## Warning: no non-missing arguments to min; returning Inf
```

```
## Warning: no non-missing arguments to max; returning -Inf
```

```
## Error: hist.default: pretty() error, breaks=
```

```r
hist(outdeg, 50, xlab = "Out-degree", main = "Filtered<1")
```

```
## Warning: no non-missing arguments to min; returning Inf
```

```
## Warning: no non-missing arguments to max; returning -Inf
```

```
## Error: hist.default: pretty() error, breaks=
```


### no obvious weirdness there.  but are they super-well explained by the parameters of the genes in question?  that would be a bad sign.  read in these parameters.


```r
real_genes = read.table(file.path(paste(working_dir, "_null_simed_genes_new", 
    sep = ""), "real_gene_bins_1M.txt"), header = T)
```

```
## Warning: cannot open file
## 'noalphabeta2014_MOtree_results_null_simed_genes_new/real_gene_bins_1M.txt':
## No such file or directory
```

```
## Error: cannot open the connection
```

```r
cor.test(real_genes[, "prevalence"], outdeg)
```

```
## Error: object 'real_genes' not found
```

```r
cor.test(real_genes[, "prevalence"], outdeg, method = "spearman")
```

```
## Error: object 'real_genes' not found
```

```r
cor.test(real_genes[, "gain_num"], indeg)
```

```
## Error: object 'real_genes' not found
```

```r
cor.test(real_genes[, "gain_num"], indeg, method = "spearman")
```

```
## Error: object 'real_genes' not found
```


### so some slight correlations, but nothing like the crazy previous observations.  suggests that we are right that these parameters affect POWER, but are not confounders for the detection of edges themselves.  

### plot the values.

```r
plot(real_genes[, "gain_num"], indeg, xlab = "Gain count", ylab = "In-degree", 
    main = "filter<1, q<.05", pch = ".")
```

```
## Error: object 'real_genes' not found
```

```r
plot(real_genes[, "prevalence"], outdeg, xlab = "Prevalence", ylab = "Out-degree", 
    main = "filter<1, q<.05", pch = ".")
```

```
## Error: object 'real_genes' not found
```

```r


# and the relationship between the 2 measures of degree?
plot(jitter(outdeg), jitter(indeg), ylab = "In-degree", xlab = "Out-degree", 
    main = "filter<1, q<.05", pch = ".")
```

```
## Warning: no non-missing arguments to min; returning Inf
```

```
## Warning: no non-missing arguments to max; returning -Inf
```

```
## Warning: no non-missing arguments to min; returning Inf
```

```
## Warning: no non-missing arguments to max; returning -Inf
```

```
## Error: need finite 'xlim' values
```

```r
cor.test(outdeg, indeg, method = "spearman")
```

```
## Error: not enough finite observations
```





