Results

Some results from the SBIR project. We first load in data which we’ll proceed to analyze and visualize.

# load in requisite packages
library(tidyverse)
library(ggthemes)
library(patchwork)
library(rms)
library(glue)
library(knitr)
library(kableExtra)

# load in precomputed data
scores = read_csv("../data/scores.csv")
coverage = read_csv("../data/coverage.csv")
concordance = read_csv("../data/concordance.csv")
binned_r2s = read_csv("../data/binned_r2s.csv")
random_samples = read_csv("../data/all.random.samples", col_names = c("sample"))
bgi_cl = read_csv("../data/bgi-cell-lines.csv", col_names = c("cell_line"))

Recall that we sequenced 120 individuals (60 AFR and 60 EUR) samples in triplicate at 0.5x and 1x on Illumina machines and a subset of 30 each from those populations on BGI machines at 1x (with no replicates). We also did NA12878 30 times on Illumina machines at a target coverage of 1x.

There’s a couple views we can take of the data — first, the most obvious one is just look at every single sample that we sequenced. Another view that we want to take a look at is to, for each cell line, take a random sample of one individual from the replicates for each “assay type” (of which there are 4: Illumina 1x, Illumina 0.5x, Illumina GSA, BGI 1x) and then compare the same stats for the same cell lines across assay types.

We treated the 1000 Genomes Phase 3 (1KGP3) release on GRCh37 as the gold standard “truth” against which we computed accuracy metrics — for imputation for all assay types we performed imputation in a leave-one-out manner.

We will proceed by giving a quick outline of the data that we have, before proceeding to an outline of “effective coverage” and presenting cohort-level results, after which we will discard replicates and look at a single iteration of a cell line for each assay type and examine apples-to-apples results.

Data

We have coverage data on our samples:

head(coverage)
## # A tibble: 6 x 12
##   cell_line sample bases_all bases_deduped bases_deduped_m… coverage_all
##   <chr>     <chr>      <dbl>         <dbl>            <dbl>        <dbl>
## 1 HG00096   HG000…    2.29e9    2144581577       2113404909        0.693
## 2 HG00096   HG000…    2.32e9    2119265370       2106079748        0.704
## 3 HG00101   HG001…    2.65e9    2481927204       2460618235        0.803
## 4 HG00101   HG001…    4.81e9    4362939338       4346604158        1.46 
## 5 HG00101   HG001…    2.85e9    2586347177       2574220820        0.865
## 6 HG00101   HG001…    4.98e9    4494825154       4477408663        1.51 
## # … with 6 more variables: coverage_deduped <dbl>,
## #   coverage_deduped_mapped <dbl>, pileup_covered <dbl>, assay_type <chr>,
## #   prop_pileup_cov <dbl>, eff_cov <dbl>

We have concordance information at the allele frequency bin level broken down by a number of things including

glimpse(concordance)
## Rows: 68,820
## Columns: 39
## $ cell_line               <chr> "HG00096", "HG00096", "HG00096", "HG00096", "…
## $ sample                  <chr> "HG00096-0-0-1-0", "HG00096-0-0-1-0", "HG0009…
## $ stat_type               <chr> "GCsAF", "GCsAF", "GCsAF", "GCsAF", "GCsAF", …
## $ set_id                  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ af_midpoint             <dbl> 0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.0…
## $ rr_hom_matches          <dbl> 65522996, 2558239, 1167426, 704983, 488206, 3…
## $ ra_het_matches          <dbl> 36669, 27024, 23208, 22665, 22782, 24777, 258…
## $ aa_hom_matches          <dbl> 219, 485, 1070, 792, 926, 988, 1296, 1384, 15…
## $ rr_hom_mismatches       <dbl> 17967, 3811, 2162, 1553, 1337, 1225, 1268, 12…
## $ ra_het_mismatches       <dbl> 34500, 6801, 3984, 2792, 2346, 2304, 2349, 21…
## $ aa_hom_mismatches       <dbl> 193, 229, 172, 160, 158, 185, 180, 204, 220, …
## $ dosage_rsquared         <dbl> 0.354936, 0.711463, 0.802475, 0.841478, 0.861…
## $ n_genotypes             <dbl> 65612544, 2596589, 1198022, 732945, 515755, 3…
## $ site_type               <chr> "allsites", "allsites", "allsites", "allsites…
## $ tech                    <chr> "sequence", "sequence", "sequence", "sequence…
## $ pop                     <chr> "GBR", "GBR", "GBR", "GBR", "GBR", "GBR", "GB…
## $ super_pop               <chr> "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EU…
## $ rr_hom_concordance      <dbl> 0.9997259, 0.9985125, 0.9981515, 0.9978020, 0…
## $ ra_het_concordance      <dbl> 0.5152384, 0.7989357, 0.8534863, 0.8903249, 0…
## $ aa_hom_concordance      <dbl> 0.5315534, 0.6792717, 0.8615137, 0.8319328, 0…
## $ total_matches           <dbl> 65559884, 2585748, 1191704, 728440, 511914, 3…
## $ total_sites             <dbl> 65612544, 2596589, 1198022, 732945, 515755, 3…
## $ overall_concordance     <dbl> 0.9991974, 0.9958249, 0.9947263, 0.9938536, 0…
## $ nr_concordance          <dbl> 0.4119355, 0.7173142, 0.7935024, 0.8388885, 0…
## $ nrd_numerator           <dbl> 52660, 10841, 6318, 4505, 3841, 3714, 3797, 3…
## $ nrd_denominator         <dbl> 89548, 38350, 30596, 27962, 27549, 29479, 309…
## $ rr_hom_count            <dbl> 65540963, 2562050, 1169588, 706536, 489543, 3…
## $ ra_het_count            <dbl> 71169, 33825, 27192, 25457, 25128, 27081, 282…
## $ aa_hom_count            <dbl> 412, 714, 1242, 952, 1084, 1173, 1476, 1588, …
## $ assay_type              <chr> "ilmn_0.5x", "ilmn_0.5x", "ilmn_0.5x", "ilmn_…
## $ bases_all               <dbl> 2287104286, 2287104286, 2287104286, 228710428…
## $ bases_deduped           <dbl> 2144581577, 2144581577, 2144581577, 214458157…
## $ bases_deduped_mapped    <dbl> 2113404909, 2113404909, 2113404909, 211340490…
## $ coverage_all            <dbl> 0.693062, 0.693062, 0.693062, 0.693062, 0.693…
## $ coverage_deduped        <dbl> 0.649873, 0.649873, 0.649873, 0.649873, 0.649…
## $ coverage_deduped_mapped <dbl> 0.640426, 0.640426, 0.640426, 0.640426, 0.640…
## $ pileup_covered          <dbl> 26484435, 26484435, 26484435, 26484435, 26484…
## $ prop_pileup_cov         <dbl> 0.3128861, 0.3128861, 0.3128861, 0.3128861, 0…
## $ eff_cov                 <dbl> 0.3752553, 0.3752553, 0.3752553, 0.3752553, 0…

We also have polygenic risk scores for breast cancer (brca) and coronary artery disease (CAD) which were calculated off the imputed dosages for each sample:

head(scores)
## # A tibble: 6 x 19
##   cell_line sample trait score true_score pop   super_pop assay_type
##   <chr>     <chr>  <chr> <dbl>      <dbl> <chr> <chr>     <chr>     
## 1 HG00096   HG000… brca  0.131      0.139 GBR   EUR       ilmn_GSA  
## 2 HG00096   HG000… cad   0.460      0.370 GBR   EUR       ilmn_GSA  
## 3 HG00096   HG000… brca  0.147      0.139 GBR   EUR       ilmn_GSA  
## 4 HG00096   HG000… cad   0.436      0.370 GBR   EUR       ilmn_GSA  
## 5 HG00096   HG000… brca  0.145      0.139 GBR   EUR       ilmn_GSA  
## 6 HG00096   HG000… cad   0.478      0.370 GBR   EUR       ilmn_GSA  
## # … with 11 more variables: squared_error <dbl>, error <dbl>, bases_all <dbl>,
## #   bases_deduped <dbl>, bases_deduped_mapped <dbl>, coverage_all <dbl>,
## #   coverage_deduped <dbl>, coverage_deduped_mapped <dbl>,
## #   pileup_covered <dbl>, prop_pileup_cov <dbl>, eff_cov <dbl>

We have a list of a randomly selected samples, one for each assay type so that we can compare head-to-head stats across assay types:

head(random_samples)
## # A tibble: 6 x 1
##   sample   
##   <chr>    
## 1 HG00101_1
## 2 HG00102_2
## 3 HG00105_2
## 4 HG00107_3
## 5 HG00108_1
## 6 HG00110_3

as well as the imputation \(r^2\)s as calculated for each assay type between those individuals and the gold standard data, binned by minor allele frequency:

head(binned_r2s)
## # A tibble: 6 x 8
##   cohort   af_bin   mean_r2 n_snps af_bin_index af_midpoint super_pop assay_type
##   <chr>    <chr>      <dbl>  <dbl>        <dbl>       <dbl> <chr>     <chr>     
## 1 afr-0-i… [0,0.01]   0.618 8.77e6            1       0.005 AFR       ilmn_0.5x 
## 2 afr-0-i… (0.01,0…   0.853 2.06e6            2       0.015 AFR       ilmn_0.5x 
## 3 afr-0-i… (0.02,0…   0.873 1.03e6            3       0.025 AFR       ilmn_0.5x 
## 4 afr-0-i… (0.03,0…   0.880 6.59e5            4       0.035 AFR       ilmn_0.5x 
## 5 afr-0-i… (0.04,0…   0.880 4.80e5            5       0.045 AFR       ilmn_0.5x 
## 6 afr-0-i… (0.05,0…   0.879 3.80e5            6       0.055 AFR       ilmn_0.5x

We also have a list of the cell lines that BGI ran (recall 58 out of 60 of them succeeded, so it’s useful to have a note of which didn’t).

head(bgi_cl)
## # A tibble: 6 x 1
##   cell_line
##   <chr>    
## 1 HG00102  
## 2 HG00105  
## 3 HG00107  
## 4 HG00111  
## 5 HG00131  
## 6 HG00132

Let’s start out from the bottom; i.e., details of what we got out of the sequencer.

Samples

As mentioned previously, we have 60 AFR and 60 EUR individuals in this dataset: specifically,

concordance %>% distinct(cell_line, super_pop, pop)
## # A tibble: 120 x 3
##    cell_line pop   super_pop
##    <chr>     <chr> <chr>    
##  1 HG00096   GBR   EUR      
##  2 HG00101   GBR   EUR      
##  3 HG00102   GBR   EUR      
##  4 HG00105   GBR   EUR      
##  5 HG00107   GBR   EUR      
##  6 HG00108   GBR   EUR      
##  7 HG00110   GBR   EUR      
##  8 HG00111   GBR   EUR      
##  9 HG00116   GBR   EUR      
## 10 HG00119   GBR   EUR      
## # … with 110 more rows

we can group these by population and superpopulation and print them out into a latex table for the manuscript

concordance %>%
  distinct(cell_line, super_pop, pop) %>% 
  group_by(pop, super_pop) %>% 
  tally() %>% 
  arrange(super_pop, -n) %>% 
  rename( #rename for human readable
    Population = pop,
    `Super Population` = super_pop,
  ) %>% 
  kable("latex", booktabs = TRUE, linesep = "") %>% 
  cat(file = "../paper/src/tabs/sample-pops.tex")

We also write out the number of samples for each cell line for each experiment which passed QC.

# count how many samples for each experiment passed QC 
samples_passing_qc = concordance %>% 
  distinct(sample, cell_line, pop, super_pop, assay_type) %>% 
  group_by(cell_line, pop, super_pop, assay_type) %>% 
  tally() %>% 
  pivot_wider(names_from = assay_type, values_from = n) %>% 
  rename(
    # rename to human-readable
    `Cell line` = cell_line,
    Population = pop,
    `Super population` = super_pop,
    `Exp. A` = ilmn_0.5x,
    `Exp. B` = ilmn_1x,
    `Exp. C` = ilmn_1x_repl,
    `Exp. D` = bgi_1x,
    `Exp. E` = ilmn_GSA
  ) %>% 
  select(`Cell line`, Population, `Super population`, sort(current_vars())) 
## Warning: `current_vars()` is deprecated as of dplyr 0.8.4.
## Please use `tidyselect::peek_vars()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# replace NA with more human readable. depending on the experiment either 0 it out (for those which ran but did not succeed) or replace NA with a dot, which indicates its not part of the experiment
samples_passing_qc_print = samples_passing_qc %>% 
  mutate(`Exp. D` = as.character(`Exp. D`)) %>% 
  mutate(`Exp. D` = if_else(`Cell line` %in% bgi_cl$cell_line, `Exp. D`, ".")) %>% 
  replace_na(
    list(
      `Exp. A` = 0,
      `Exp. B` = 0,
      `Exp. C` = ".",
      `Exp. D` = 0,
      `Exp. E` = 0
    )
  ) %>% 
  arrange(
    `Super population`
  )

# write out 
samples_passing_qc_print %>% 
  kable(
    "latex", 
    booktabs = TRUE, 
    longtable = TRUE, 
    caption = "This table shows the number of samples for each cell line in each experiment which passed QC. A dot indicates that the particular cell line was not part of the experiment.",
    label = "supptab:samples-passing-qc"        
  ) %>% 
  kable_styling(
    latex_options = c("repeat_header", "HOLD_position")
    ) %>% 
  cat(file = "../paper/src/tabs/samples-passing-qc.tex")

Effective coverage

Even though we have target coverages to which we sequence samples, there’s always variation in what you actually get back. For example, we can look at the expected number of bases sequenced for each sample and the actual number of bases sequenced for each sample. We’ve precomputed the overall coverage (bases sequenced / size of genome) and the same for deduped coverage, deduped mapped coverage, and more: this is stored in coverage:

head(coverage)
## # A tibble: 6 x 12
##   cell_line sample bases_all bases_deduped bases_deduped_m… coverage_all
##   <chr>     <chr>      <dbl>         <dbl>            <dbl>        <dbl>
## 1 HG00096   HG000…    2.29e9    2144581577       2113404909        0.693
## 2 HG00096   HG000…    2.32e9    2119265370       2106079748        0.704
## 3 HG00101   HG001…    2.65e9    2481927204       2460618235        0.803
## 4 HG00101   HG001…    4.81e9    4362939338       4346604158        1.46 
## 5 HG00101   HG001…    2.85e9    2586347177       2574220820        0.865
## 6 HG00101   HG001…    4.98e9    4494825154       4477408663        1.51 
## # … with 6 more variables: coverage_deduped <dbl>,
## #   coverage_deduped_mapped <dbl>, pileup_covered <dbl>, assay_type <chr>,
## #   prop_pileup_cov <dbl>, eff_cov <dbl>

which we recode to include Experiment notation:

coverage = coverage %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "ilmn_1x_repl" ~ "C",
      assay_type == "bgi_1x" ~ "D"
    ),
    
    # clean up names
    `Mapped coverage` = coverage_deduped_mapped,
    `Effective coverage` = eff_cov
    )

and we can plot out the target vs realized coverages:

realized_plot = coverage %>%
  ggplot(aes(x = coverage_all, fill = assay_type)) + 
  geom_density(alpha = 0.7) + theme_few() + scale_fill_tableau() + 
  labs(title = "Overall realized coverage") 

deduped_plot = coverage %>% 
  ggplot(aes(x = coverage_deduped, fill = assay_type)) + 
  geom_density(alpha = 0.7) + theme_few() + scale_fill_tableau() + 
  labs(title = "Overall deduped coverage") 

mapped_deduped_plot = coverage %>% 
  ggplot(aes(x = coverage_deduped_mapped, fill = assay_type)) + 
  geom_density(alpha = 0.7) + theme_few() + scale_fill_tableau() + 
  labs(title = "Overall mapped deduped coverage")

realized_plot / deduped_plot / mapped_deduped_plot

However, the raw number of bases is not the only thing that matters; ideally, we performing shotgun sequencing where reads are randomly distributed across the whole genome. We can illustrate how closely this hews to reality by counting the number of sites which are imputed which are actually covered by at least one read and seeing how close that number is to the expected value of the number of these sites covered by at least one read.

We can obtain this by treating the number of reads \(k\) on a site as a random variable with a Poisson distribution (whose PMF \(f(k; \lambda) = \lambda^k e^{-\lambda} / k!\)) where the target coverage is \(\lambda\). Then

\[ \begin{aligned} P(k >0; \lambda) &= 1 - \sum_{k=1}^{\infty}f(k; \lambda) \\ &= 1 - f(k = 0; \lambda) \\ &= 1 - e^{-\lambda}\ \end{aligned}. \]

so for example a given site on a genome sequenced at a target coverage of \(\lambda = 1\) the probability that it is has at least one read on it is \(P(k > 0 ; 1) = 1 - e ^ {-1} = 0.632120559\). Then assuming the \(n\) sites we impute are IID we would expect to see around \(n(1 - e^{-\lambda})\) sites covered w/ at least one read. From this the idea of an “effective coverage” follows naturally for a sequenced genome: i.e., the value of \(\lambda_{\mathrm{eff}}\) such that \(f_{\mathrm{covered}} = 1 - e^{-\lambda_{\mathrm{eff}}}\) holds, where \(f_{\mathrm{covered}}\) is the fraction of sites actually covered with at least one read.

More formally, for a single site, we can model whether or not it was covered by at least one read as a Bernoulli process, with its being covered w/ at least one read to be considered a “success” — as previously derived, this would follow a Bernoulli distribution with probability of success being \(p = 1 - e^{-\lambda}\). Then the number of pileup sites covered by at least one read over the whole genome (denoted \(X\)) can be modeled by a binomial distribution with a PMF

\[ \begin{align} f(k, n, p) = \mathrm{Pr}(k; n, p) = Pr(X = k) = {n \choose k}p^k (1 - p)^{n-k} = {n \choose k}(1 - e^{-\lambda}) ( e^{-\lambda})^{n - k} = {n \choose k}(1 - e^{-\lambda})e^{\lambda(k -n)} \end{align} \] where we plug in \(p = 1 - e ^ {-\lambda}\). Then \(\mathrm{E}[X] = np = n(1 - e^{-\lambda})\) and \(\mathrm{Var}(X) = np(1 - p) = n(1-e^{-\lambda})e^{-\lambda}\). Then defining \(f_{\mathrm{covered}} = X / n\) we have \(\mathrm{E}[f_{\mathrm{covered}}] = (1 - e^{-\lambda})\).

We therefore obtained \(f_{\mathrm{covered}}\) for each sample and calculated the \(\lambda_{\mathrm{eff}} = -\ln(1 - f_{\mathrm{covered}})\). Plotting, we find that the effective coverage is consistently lower than the target coverage, even though our overall realized coverage is higher than the target coverage, indicating that sequencing oftentimes deviates from truly spatially uniform sampling:

Let’s write out these histograms to plots as well

# nominal mapped coverage 
coverage %>% 
  ggplot(aes(x = `Mapped coverage`, fill = Experiment)) + 
  geom_density(alpha = 0.7) + 
  scale_fill_tableau() + theme_few() + 
  labs(
    title = "Distribution of nominal mapped coverage by experiment"
  )

ggsave("../paper/src/figs/density-mapped-coverage.pdf")
## Saving 6 x 4 in image
# effective  coverage 
coverage %>% 
  ggplot(aes(x = `Effective coverage`, fill = Experiment)) + 
  geom_density(alpha = 0.7) + 
  scale_fill_tableau() + theme_few() + 
  labs(
    title = "Distribution of effective coverage by experiment"
  )

ggsave("../paper/src/figs/density-effective-coverage.pdf")
## Saving 6 x 4 in image

Indeed, we can also plot the realized overall mapped bases against the effective coverage:

## Saving 5 x 4 in image

and see that that there’s nontrivial variation in effective coverage; however, given a sequencing technology type (e.g., Illumina sequencing), effective coverage tracks overall coverage quite well:

# calculate pearsons correlation between the points plotted above^
coverage %>% filter(assay_type != "bgi_1x") %>% summarise(`r between overall coverage and effective coverage` = cor(coverage_all, eff_cov)) 
## # A tibble: 1 x 1
##   `r between overall coverage and effective coverage`
##                                                 <dbl>
## 1                                               0.926

We can summarize the effective and nominal coverages by experiment:

# group by assay type, then get mean and std
coverage_summary = coverage %>% 
  group_by(assay_type) %>% 
  summarise(
    mean_mapped_coverage = mean(coverage_deduped_mapped),
    std_mapped_coverage = sqrt(var(coverage_deduped_mapped)),
    
    # eff cov
    mean_eff_cov = mean(eff_cov),
    std_eff_cov = sqrt(var(eff_cov)),
    
    n = n()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
coverage_summary
## # A tibble: 4 x 6
##   assay_type  mean_mapped_cove… std_mapped_cover… mean_eff_cov std_eff_cov     n
##   <chr>                   <dbl>             <dbl>        <dbl>       <dbl> <int>
## 1 bgi_1x                  1.26             0.0892        1.24        0.106    58
## 2 ilmn_0.5x               0.668            0.318         0.412       0.188   351
## 3 ilmn_1x                 1.25             0.263         0.717       0.180   350
## 4 ilmn_1x_re…             1.20             0.289         0.526       0.112    30

which summarizes the coverage trends by assay type. ## Cohort trends

We can first examine cohort-level trends by assay and effective coverage — we want to what extent this construct effective coverage actually trends with performance. But first, a word about non-reference concordance:

Non-reference Concordance (NRC)

Since by and large variants are found at low minor allele frequencies in 1KGP3, overall concordance can be a little less than informative since low MAF variants can be imputed to hom ref with high confidence as the prior is overwhelmingly against there being variation at that site.

We therefore consider non-reference concordance, and we define it as follows: taking the variant calls at the same site between two normed VCFs, we consider all possible pairwise combinations of genotype calls (order doesn’t matter):

0 1 2
0 a b c
1 d e f
2 g h i

then we write non-reference concordance \(\mathrm{NRC} = (e + i) / (b + c + d + e + f + g + h + i)\); in other words, it’s the concordance at all sites where the pair of genotypes is not 0 and 0.

We can compute this from our concordance data for each sample, taking care to group by whether the site is imputed confidently and also whether the variant is an indel or a SNP:

# sequence non-reference concordance by sample
seq_conc_bysample = concordance %>% 
  filter(tech == "sequence") %>% 
  group_by(
    cell_line, sample, pop, super_pop, assay_type, eff_cov, coverage_deduped_mapped, site_type, stat_type
    ) %>% 
  summarise(
    # non-reference and overall concordance
    nrc = 1 - sum(nrd_numerator) / sum(nrd_denominator),
    conc = sum(total_matches) / sum(total_sites)
    )
## `summarise()` regrouping output by 'cell_line', 'sample', 'pop', 'super_pop', 'assay_type', 'eff_cov', 'coverage_deduped_mapped', 'site_type' (override with `.groups` argument)
head(seq_conc_bysample)
## # A tibble: 6 x 11
## # Groups:   cell_line, sample, pop, super_pop, assay_type, eff_cov,
## #   coverage_deduped_mapped, site_type [3]
##   cell_line sample pop   super_pop assay_type eff_cov coverage_dedupe… site_type
##   <chr>     <chr>  <chr> <chr>     <chr>        <dbl>            <dbl> <chr>    
## 1 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640 allsites 
## 2 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640 allsites 
## 3 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640 passing  
## 4 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640 passing  
## 5 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638 allsites 
## 6 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638 allsites 
## # … with 3 more variables: stat_type <chr>, nrc <dbl>, conc <dbl>

we can then plot the broken-down sample-level NRCs against effective coverage for SNPs, stratified by population and whether to filter to passing sites. Let’s first make a copy of the dataframe and rename it with more human-readable labels so we can plot it without having to deal with manually relabeling labels on plots

# make a copy of by-sample concordance, replacing the labels to be plotted with human-friendly values
seq_conc_bysample_plot = seq_conc_bysample %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "ilmn_1x_repl" ~ "C",
      assay_type == "bgi_1x" ~ "D"
    ),
    # just capitalize it
    NRC = nrc,
    Concordance = conc,
    # unfiltered status 
    `Site Type` = case_when(
      site_type == "allsites" ~ "Unfiltered",
      site_type == "passing" ~ "Filtered"
    ),
    # coverage columns
    `Nominal Mapped Coverage` = coverage_deduped_mapped,
    `Effective Coverage` = eff_cov
  )
seq_conc_bysample_plot
## # A tibble: 3,156 x 17
## # Groups:   cell_line, sample, pop, super_pop, assay_type, eff_cov,
## #   coverage_deduped_mapped, site_type [1,578]
##    cell_line sample pop   super_pop assay_type eff_cov coverage_dedupe…
##    <chr>     <chr>  <chr> <chr>     <chr>        <dbl>            <dbl>
##  1 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640
##  2 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640
##  3 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640
##  4 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.375            0.640
##  5 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638
##  6 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638
##  7 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638
##  8 HG00096   HG000… GBR   EUR       ilmn_0.5x    0.380            0.638
##  9 HG00101   HG001… GBR   EUR       ilmn_0.5x    0.534            0.746
## 10 HG00101   HG001… GBR   EUR       ilmn_0.5x    0.534            0.746
## # … with 3,146 more rows, and 10 more variables: site_type <chr>,
## #   stat_type <chr>, nrc <dbl>, conc <dbl>, Experiment <chr>, NRC <dbl>,
## #   Concordance <dbl>, `Site Type` <chr>, `Nominal Mapped Coverage` <dbl>,
## #   `Effective Coverage` <dbl>
# plot NRC by effective coverage for SNPs
seq_conc_bysample_plot %>% filter(stat_type == "GCsAF") %>% 
  ggplot(aes(x = `Effective Coverage`, y = NRC, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "Non-reference concordance by effective coverage for SNPs")

ggsave("../paper/src/figs/effcov_nrc_snps.pdf")
## Saving 7 x 6 in image

and the same for indels:

# filtered NRC by indels 
seq_conc_bysample_plot %>% filter(stat_type == "GCiAF") %>%
  ggplot(aes(x = `Effective Coverage`, y = nrc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "Non-reference concordance by effective coverage for indels")

ggsave("../paper/src/figs/effcov_nrc_indels.pdf")
## Saving 7 x 6 in image

in both of which we observe firstly that SNPs are consistently imputed with greater accuracy with respect to NRC, imputation quality is relatively well-calibrated (i.e., the NRC at sites confidently imputed, “passing”, is consistently higher than at unfiltered sites), AFR samples tend to be imputed with lower accuracy with respect to NRC, and finally that the NRC depends quite a bit on effective coverage.

The same pattern holds for overall concordance as well – for SNPs, we see `

# plot overall concordance for snps
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCsAF") %>% 
  ggplot(aes(x = `Effective Coverage`, y = conc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + 
  scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "Overall concordance by effective coverage for SNPs")

ggsave("../paper/src/figs/effcov_oconc_snps.pdf")
## Saving 7 x 6 in image

and the same for indels

# plot overall concordance for indels
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCiAF") %>% 
  ggplot(aes(x = `Effective Coverage`, y = conc, color = Experiment)) + 
  geom_point(alpha = 0.7) + theme_few() + scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) +
  labs(title = "Overall concordance by effective coverage for indels")

ggsave("../paper/src/figs/effcov_oconc_snps.pdf")
## Saving 7 x 6 in image

For nominal coverage as measured by deduped mapped bases, we can plot the same for SNPs for NRC:

# plot NRC for snps for unfiltered snps
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCsAF") %>% 
  ggplot(aes(x = `Nominal Mapped Coverage`, y = nrc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + 
  scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "NRC by nominal mapped coverage for SNPs")

ggsave("../paper/src/figs/nomcov_nrc_snps.pdf")
## Saving 7 x 6 in image

and indels

# plot NRC for indels for nominal  coverage
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCiAF") %>% 
  ggplot(aes(x = `Nominal Mapped Coverage`, y = nrc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + 
  scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "NRC by nominal mapped coverage for indels")

ggsave("../paper/src/figs/nomcov_nrc_indels.pdf")
## Saving 7 x 6 in image

And the same for overall concordance but by nominal coverage, for SNPs

# plot overall concordance for snps by nominal coverage
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCsAF") %>% 
  ggplot(aes(x = `Nominal Mapped Coverage`, y = conc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + 
  scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "Overall concordance by nominal mapped coverage for SNPs")

ggsave("../paper/src/figs/nomcov_conc_snps.pdf")
## Saving 7 x 6 in image

and indels

# plot overall concordance for indels by nominal coverage
seq_conc_bysample_plot %>% 
  filter(stat_type == "GCiAF") %>% 
  ggplot(aes(x = `Nominal Mapped Coverage`, y = conc, color = Experiment)) + 
  geom_point(alpha = 0.7) + 
  theme_few() + 
  scale_color_tableau() + 
  facet_grid(rows = vars(`Site Type`), cols = vars(super_pop)) + 
  labs(title = "Overall concordance by nominal mapped coverage for indels")

ggsave("../paper/src/figs/nomcov_conc_indels.pdf")
## Saving 7 x 6 in image

Fitting effective coverage and NRC

Let’s see whether nominal coverage or effective coverage better predict NRC. We use the rms package to fit splines with knots at percentile locations on the x axis as by Harrell and evaluate the spline fits by \(r^2\).

We restrict ourselves to EUR samples sequenced on Illumina machines and consider the NRC for unfiltered SNPs. Modeling the NRC as the response variable for a \(k=5\) knot cubic restricted spline with knot locations at the 5%, 27.5%, 50%, 72.5%, and 95% percentiles of the \(x\)-axis using Harrell’s rule of thumb, we can plot the spline fits with 95% confidence intervals along the range of the predictor.

# filter down to relevant
cov_predict = seq_conc_bysample %>% 
  filter(stat_type == "GCsAF" & site_type == "allsites" & super_pop == "EUR") %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "ilmn_1x_repl" ~ "C",
      assay_type == "bgi_1x" ~ "D"
    )
  )
  

# define distribution
ddist = datadist(cov_predict, q.display = c(0, 1))
options(datadist = "ddist")

# fit and plot
model = ols(
  nrc ~ rcs(eff_cov, quantile(eff_cov, c(0, .05, 0.275, 0.5, 0.725, 0.95, 1))), 
  data = cov_predict, 
  x = TRUE, 
  y = TRUE
  )

model_r2 = signif(model$stats[4], 2)

eff_cov_model_plot = ggplot(Predict(model)) + 
  geom_point(data = cov_predict, aes(x = eff_cov, y = nrc, color = Experiment)) + 
  theme_few() + 
  scale_color_tableau() + 
  labs(
    x = "Effective Coverage",
    y = "NRC"
    ) + 
  annotate(
    "text", x = 1, y = .88, label = glue("R ^ 2 == {model_r2}"),
    parse = TRUE
  ) + 
  theme(legend.position = "none")

# model for nominal coverage
model1 =  ols(
  nrc ~ rcs(coverage_deduped_mapped, quantile(coverage_deduped_mapped,  c(0, .05, 0.275, 0.5, 0.725, 0.95, 1))),
  data = cov_predict, 
  x = TRUE, 
  y = TRUE
  )
model1_r2 = signif(model1$stats[4], 2)

nominal_cov_model_plot = ggplot(Predict(model1)) + 
  geom_point(data = cov_predict, aes(x = coverage_deduped_mapped, y = nrc, color = Experiment)) + 
  theme_few() + scale_color_tableau() + 
  labs(
    x = "Nominal Mapped Coverage",
    y = "NRC"
    ) + 
  annotate(
    "text", x = 1.5, y = .88, label = glue("R ^ 2 == {model1_r2}"),
    parse = TRUE
  )

# put these together
patched = (eff_cov_model_plot | nominal_cov_model_plot)
patched + plot_annotation(
  title = "NRC by effective coverage vs. NRC by nominal mapped coverage"
)

ggsave("../paper/src/figs/nrc_coverage_splines.pdf")
## Saving 10 x 6 in image

from which we observe that effective coverage is a better predictor for non-reference concordance. As such, coverage-dependent metrics will be considered relative to effective coverage rather than nominal mapped coverage in the following.

Polygenic risk scores and effective coverage

Overall pattern

Polygenic risk scores are calculated for brca and CAD for all individuals; the CAD score comprises ~1.7M variants while the brca score comprises only ~130 variants. If we treat the PRS calculated off the 1KGP3 gold standards as “truth”, we can calculate the error of each estimate from imputed data: we can plot this against effective coverage.

Let’s start out (since plots in this section will probably make it into the paper) by recoding some variables like we did for the previous section

# replace with human-readable experiment and coverage indications
scores_plot = scores %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "ilmn_1x_repl" ~ "C",
      assay_type == "bgi_1x" ~ "D", 
      assay_type == "ilmn_GSA" ~ "E"
    ),
    # coverage columns
    `Nominal Mapped Coverage` = coverage_deduped_mapped,
    `Effective Coverage` = eff_cov,
    # PRS names
    Trait = case_when(
      trait == "brca" ~ "Breast Cancer",
      trait == "cad" ~ "CAD"
    )
  )

scores_plot
## # A tibble: 2,294 x 23
##    cell_line sample trait  score true_score pop   super_pop assay_type
##    <chr>     <chr>  <chr>  <dbl>      <dbl> <chr> <chr>     <chr>     
##  1 HG00096   HG000… brca   0.131     0.139  GBR   EUR       ilmn_GSA  
##  2 HG00096   HG000… cad    0.460     0.370  GBR   EUR       ilmn_GSA  
##  3 HG00096   HG000… brca   0.147     0.139  GBR   EUR       ilmn_GSA  
##  4 HG00096   HG000… cad    0.436     0.370  GBR   EUR       ilmn_GSA  
##  5 HG00096   HG000… brca   0.145     0.139  GBR   EUR       ilmn_GSA  
##  6 HG00096   HG000… cad    0.478     0.370  GBR   EUR       ilmn_GSA  
##  7 HG00101   HG001… brca  -0.311    -0.229  GBR   EUR       ilmn_GSA  
##  8 HG00101   HG001… cad    0.148     0.0221 GBR   EUR       ilmn_GSA  
##  9 HG00101   HG001… brca  -0.333    -0.229  GBR   EUR       ilmn_GSA  
## 10 HG00101   HG001… cad    0.126     0.0221 GBR   EUR       ilmn_GSA  
## # … with 2,284 more rows, and 15 more variables: squared_error <dbl>,
## #   error <dbl>, bases_all <dbl>, bases_deduped <dbl>,
## #   bases_deduped_mapped <dbl>, coverage_all <dbl>, coverage_deduped <dbl>,
## #   coverage_deduped_mapped <dbl>, pileup_covered <dbl>, prop_pileup_cov <dbl>,
## #   eff_cov <dbl>, Experiment <chr>, `Nominal Mapped Coverage` <dbl>,
## #   `Effective Coverage` <dbl>, Trait <chr>

and we can get started. We can then drill down to sequence scores and calculate absolute and squared error:

# filter down to sequence samples only and add absolute error and squared error variables
seq_scores = scores_plot %>% 
  filter(assay_type != "ilmn_GSA") %>% 
  mutate(
    abs_error = abs(error),
    squared_error = error ^ 2,
    
    # the same, but for display
    `Absolute Error` = abs_error,
    `Squared Error` = squared_error
    )
seq_scores
## # A tibble: 1,578 x 26
##    cell_line sample trait   score true_score pop   super_pop assay_type
##    <chr>     <chr>  <chr>   <dbl>      <dbl> <chr> <chr>     <chr>     
##  1 HG00096   HG000… cad    0.368      0.370  GBR   EUR       ilmn_0.5x 
##  2 HG00096   HG000… brca   0.0923     0.139  GBR   EUR       ilmn_0.5x 
##  3 HG00096   HG000… cad    0.268      0.370  GBR   EUR       ilmn_0.5x 
##  4 HG00096   HG000… brca   0.212      0.139  GBR   EUR       ilmn_0.5x 
##  5 HG00101   HG001… cad    0.124      0.0221 GBR   EUR       ilmn_0.5x 
##  6 HG00101   HG001… brca  -0.314     -0.229  GBR   EUR       ilmn_0.5x 
##  7 HG00101   HG001… cad    0.0629     0.0221 GBR   EUR       ilmn_0.5x 
##  8 HG00101   HG001… brca  -0.268     -0.229  GBR   EUR       ilmn_0.5x 
##  9 HG00101   HG001… cad    0.0701     0.0221 GBR   EUR       ilmn_0.5x 
## 10 HG00101   HG001… brca  -0.305     -0.229  GBR   EUR       ilmn_0.5x 
## # … with 1,568 more rows, and 18 more variables: squared_error <dbl>,
## #   error <dbl>, bases_all <dbl>, bases_deduped <dbl>,
## #   bases_deduped_mapped <dbl>, coverage_all <dbl>, coverage_deduped <dbl>,
## #   coverage_deduped_mapped <dbl>, pileup_covered <dbl>, prop_pileup_cov <dbl>,
## #   eff_cov <dbl>, Experiment <chr>, `Nominal Mapped Coverage` <dbl>,
## #   `Effective Coverage` <dbl>, Trait <chr>, abs_error <dbl>, `Absolute
## #   Error` <dbl>, `Squared Error` <dbl>

Effective coverage, absolute error, squared error

We can get an idea of what the “typical” measurement error is in estimating an individual’s PRS using a microarray by taking the array scores and calculating the absolute error as well as the squared error in the estimate for each sample, averaging across replicates, and then averaging across cell lines. We can also calculate the standard error of the mean for each of these:

# take array scores absolute error and average among replicates, then group by trait and super pop to get population-level error
# we do this for squared error as well
array_score_summ = scores_plot %>% 
  filter(assay_type == "ilmn_GSA") %>% 
  mutate(
    abs_error = abs(error), # compute absolute error
    squared_error = error ^ 2 # compute squared error
    ) %>% 
  group_by(cell_line, Trait, super_pop) %>% 
  summarise(
    mean_abs_error = mean(abs_error), # mean across replicates
    mean_squared_error = mean(squared_error) # mean across replicates
    ) %>% 
  group_by(super_pop, Trait) %>% 
  summarise(
    # absolute error
    population_mean_abs_error = mean(mean_abs_error), # mean across samples
    population_std_abs_error = sqrt(var(mean_abs_error)), # std across samples
    population_mean_abs_error_stderr = population_std_abs_error / sqrt(n()),
    
    # mse
    population_mean_squared_error = mean(mean_squared_error),
    population_std_squared_error = sqrt(var(mean_squared_error)),
    population_mean_squared_error_stderr = population_std_squared_error / sqrt(n())
    )
## `summarise()` regrouping output by 'cell_line', 'Trait' (override with `.groups` argument)
## `summarise()` regrouping output by 'super_pop' (override with `.groups` argument)
array_score_summ
## # A tibble: 4 x 8
## # Groups:   super_pop [2]
##   super_pop Trait population_mean… population_std_… population_mean…
##   <chr>     <chr>            <dbl>            <dbl>            <dbl>
## 1 AFR       Brea…           0.152            0.123           0.0159 
## 2 AFR       CAD             0.0703           0.0442          0.00571
## 3 EUR       Brea…           0.111            0.0790          0.0102 
## 4 EUR       CAD             0.0893           0.0585          0.00755
## # … with 3 more variables: population_mean_squared_error <dbl>,
## #   population_std_squared_error <dbl>,
## #   population_mean_squared_error_stderr <dbl>

We can also obtain the average error by sequence experiment group

# take MSE on sequence data similarly
seq_score_summ = seq_scores %>% 
  group_by(Experiment, Trait, super_pop, cell_line) %>% 
  summarise(cl_mse = mean(squared_error)) %>% 
  group_by(Experiment, Trait, super_pop) %>% 
  summarise(
    mse = mean(cl_mse),
    mse_se = sqrt(var(cl_mse)) / sqrt(n())
    )
## `summarise()` regrouping output by 'Experiment', 'Trait', 'super_pop' (override with `.groups` argument)
## `summarise()` regrouping output by 'Experiment', 'Trait' (override with `.groups` argument)
seq_score_summ
## # A tibble: 14 x 5
## # Groups:   Experiment, Trait [8]
##    Experiment Trait         super_pop     mse    mse_se
##    <chr>      <chr>         <chr>       <dbl>     <dbl>
##  1 A          Breast Cancer AFR       0.0429   0.00547 
##  2 A          Breast Cancer EUR       0.0347   0.00586 
##  3 A          CAD           AFR       0.00651  0.00100 
##  4 A          CAD           EUR       0.0124   0.00169 
##  5 B          Breast Cancer AFR       0.0247   0.00318 
##  6 B          Breast Cancer EUR       0.0197   0.00260 
##  7 B          CAD           AFR       0.00494  0.000770
##  8 B          CAD           EUR       0.00512  0.000633
##  9 C          Breast Cancer EUR       0.0411  NA       
## 10 C          CAD           EUR       0.00725 NA       
## 11 D          Breast Cancer AFR       0.0112   0.00434 
## 12 D          Breast Cancer EUR       0.0180   0.00508 
## 13 D          CAD           AFR       0.00271  0.000623
## 14 D          CAD           EUR       0.00272  0.00105

We can then plot the absolute error of PRS estimated from sequence data and marking the mean absolute error for each super population and trait from array estimates:

## Saving 8 x 6 in image

We can also plot the same for squared error instead: with standard error

## Saving 8 x 6 in image

we can also write out the cohort level means to tables:

# array score summary 
array_score_summ_table = array_score_summ %>% 
  select(super_pop, Trait, population_mean_squared_error, population_mean_squared_error_stderr) %>% 
  rename(
    `Super population` = super_pop,
    `MSE` = population_mean_squared_error,
    `SE of MSE` = population_mean_squared_error_stderr
  ) %>% 
  arrange(Trait)

array_score_summ_table %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat( file = "../paper/src/tabs/array-score-mse.tex")

array_score_summ_table
## # A tibble: 4 x 4
## # Groups:   Super population [2]
##   `Super population` Trait             MSE `SE of MSE`
##   <chr>              <chr>           <dbl>       <dbl>
## 1 AFR                Breast Cancer 0.0400      0.00749
## 2 EUR                Breast Cancer 0.0195      0.00347
## 3 AFR                CAD           0.00757     0.00102
## 4 EUR                CAD           0.0116      0.00168

and similarly for sequence data

# reshape to human readable.
# we also take care to join with the array mse in order to get the ratio 
array_mses_tojoin = array_score_summ %>% 
  select(super_pop, Trait, population_mean_squared_error) %>% 
  rename(array_mse = population_mean_squared_error)

# left join with seq score summs and get the ratio of array / seq 
seq_score_summ_table = seq_score_summ %>% 
  left_join(array_mses_tojoin, by = c("super_pop", "Trait")) %>% 
  mutate(
    `MSE fold decrease` = array_mse / mse
  ) %>% 
  rename(
    `Super population` = super_pop,
    `MSE` = mse,
    `SE of MSE` = mse_se
  ) %>% 
  select(-array_mse) %>% 
  arrange(Trait)

seq_score_summ_table %>% 
  kable("latex", booktabs = TRUE, linesep = "", digits = 4) %>% 
  cat(file = "../paper/src/tabs/seq-score-mse.tex")

seq_score_summ_table
## # A tibble: 14 x 6
## # Groups:   Experiment, Trait [8]
##    Experiment Trait      `Super populatio…     MSE `SE of MSE` `MSE fold decrea…
##    <chr>      <chr>      <chr>               <dbl>       <dbl>             <dbl>
##  1 A          Breast Ca… AFR               0.0429     0.00547              0.932
##  2 A          Breast Ca… EUR               0.0347     0.00586              0.563
##  3 B          Breast Ca… AFR               0.0247     0.00318              1.62 
##  4 B          Breast Ca… EUR               0.0197     0.00260              0.988
##  5 C          Breast Ca… EUR               0.0411    NA                    0.475
##  6 D          Breast Ca… AFR               0.0112     0.00434              3.57 
##  7 D          Breast Ca… EUR               0.0180     0.00508              1.09 
##  8 A          CAD        AFR               0.00651    0.00100              1.16 
##  9 A          CAD        EUR               0.0124     0.00169              0.935
## 10 B          CAD        AFR               0.00494    0.000770             1.53 
## 11 B          CAD        EUR               0.00512    0.000633             2.27 
## 12 C          CAD        EUR               0.00725   NA                    1.60 
## 13 D          CAD        AFR               0.00271    0.000623             2.80 
## 14 D          CAD        EUR               0.00272    0.00105              4.27

Significance tests

We can perform significance tests for differences in means of the squared error in PRS estimates by trait, population, and comparing the sequence experiment results to the array estimates.

# inelegant variable naming
exp_A_eur_cad = scores_plot %>% filter(Experiment == "A", super_pop == "EUR", trait == "cad") %>% pull(squared_error)
exp_A_eur_brca = scores_plot %>% filter(Experiment == "A", super_pop == "EUR", trait == "brca") %>% pull(squared_error)
exp_B_eur_cad = scores_plot %>% filter(Experiment == "B", super_pop == "EUR", trait =="cad") %>% pull(squared_error)
exp_B_eur_brca = scores_plot %>% filter(Experiment == "B", super_pop == "EUR", trait == "brca") %>% pull(squared_error)
exp_C_eur_cad = scores_plot %>% filter(Experiment == "C", super_pop == "EUR", trait == "cad") %>% pull(squared_error)
exp_C_eur_brca = scores_plot %>% filter(Experiment == "C", super_pop == "EUR", trait == "brca") %>% pull(squared_error)
exp_D_eur_cad = scores_plot %>% filter(Experiment == "D", super_pop == "EUR", trait == "cad") %>% pull(squared_error)
exp_D_eur_brca = scores_plot %>% filter(Experiment == "D", super_pop == "EUR", trait == "brca") %>% pull(squared_error)
exp_E_eur_cad = scores_plot %>% filter(Experiment == "E", super_pop == "EUR", trait == "cad") %>% pull(squared_error)
exp_E_eur_brca = scores_plot %>% filter(Experiment == "E", super_pop == "EUR", trait == "brca") %>% pull(squared_error)

exp_A_afr_cad = scores_plot %>% filter(Experiment == "A", super_pop == "AFR", trait == "cad") %>% pull(squared_error)
exp_A_afr_brca = scores_plot %>% filter(Experiment == "A", super_pop == "AFR", trait == "brca") %>% pull(squared_error)
exp_B_afr_cad = scores_plot %>% filter(Experiment == "B", super_pop == "AFR", trait == "cad") %>% pull(squared_error)
exp_B_afr_brca = scores_plot %>% filter(Experiment == "B", super_pop == "AFR", trait == "brca") %>% pull(squared_error)
# exp_C_afr_cad = scores_plot %>% filter(Experiment == "C", super_pop == "AFR", trait == "cad") %>% pull(squared_error)
# exp_C_afr_brca = scores_plot %>% filter(Experiment == "C", super_pop == "AFR", trait == "brca") %>% pull(squared_error)
exp_D_afr_cad = scores_plot %>% filter(Experiment == "D", super_pop == "AFR", trait == "cad") %>% pull(squared_error)
exp_D_afr_brca = scores_plot %>% filter(Experiment == "D", super_pop == "AFR", trait == "brca") %>% pull(squared_error)
exp_E_afr_cad = scores_plot %>% filter(Experiment == "E", super_pop == "AFR", trait == "cad") %>% pull(squared_error)
exp_E_afr_brca = scores_plot %>% filter(Experiment == "E", super_pop == "AFR", trait == "brca") %>% pull(squared_error)

print("==========EUR=========")
## [1] "==========EUR========="
paste("t test p value EUR exp A vs array, cad", t.test(exp_A_eur_cad, exp_E_eur_cad)$p.value)
## [1] "t test p value EUR exp A vs array, cad 0.624883771856766"
paste("t test p value EUR exp A vs array, brca", t.test(exp_A_eur_brca, exp_E_eur_brca)$p.value)
## [1] "t test p value EUR exp A vs array, brca 0.00620730712207844"
paste("t test p value EUR exp B vs array, cad", t.test(exp_B_eur_cad, exp_E_eur_cad)$p.value)
## [1] "t test p value EUR exp B vs array, cad 4.6990465415024e-08"
paste("t test p value EUR exp B vs array, brca", t.test(exp_B_eur_brca, exp_E_eur_brca)$p.value)
## [1] "t test p value EUR exp B vs array, brca 0.935713738119754"
paste("t test p value EUR exp C vs array, cad", t.test(exp_C_eur_cad, exp_E_eur_cad)$p.value)
## [1] "t test p value EUR exp C vs array, cad 0.040631359553411"
paste("t test p value EUR exp C vs array, brca", t.test(exp_C_eur_brca, exp_E_eur_brca)$p.value)
## [1] "t test p value EUR exp C vs array, brca 0.0352135014412338"
paste("t test p value EUR exp D vs array, cad", t.test(exp_D_eur_cad, exp_E_eur_cad)$p.value)
## [1] "t test p value EUR exp D vs array, cad 2.44475644336255e-08"
paste("t test p value EUR exp D vs array, brca", t.test(exp_D_eur_brca, exp_E_eur_brca)$p.value)
## [1] "t test p value EUR exp D vs array, brca 0.76799891806175"
print("==========AFR=========")
## [1] "==========AFR========="
paste("t test p value AFR exp A vs array, cad", t.test(exp_A_afr_cad, exp_E_afr_cad)$p.value)
## [1] "t test p value AFR exp A vs array, cad 0.308274683938933"
paste("t test p value AFR exp A vs array, brca", t.test(exp_A_afr_brca, exp_E_afr_brca)$p.value)
## [1] "t test p value AFR exp A vs array, brca 0.672128739346235"
paste("t test p value AFR exp B vs array, cad", t.test(exp_B_afr_cad, exp_E_afr_cad)$p.value)
## [1] "t test p value AFR exp B vs array, cad 0.00426109269477669"
paste("t test p value AFR exp B vs array, brca", t.test(exp_B_afr_brca, exp_E_afr_brca)$p.value)
## [1] "t test p value AFR exp B vs array, brca 0.00471850482505407"
paste("t test p value AFR exp D vs array, cad", t.test(exp_D_afr_cad, exp_E_afr_cad)$p.value)
## [1] "t test p value AFR exp D vs array, cad 6.07713503377865e-07"
paste("t test p value AFR exp D vs array, brca", t.test(exp_D_afr_brca, exp_E_afr_brca)$p.value)
## [1] "t test p value AFR exp D vs array, brca 1.31143555348574e-05"

Head-to-head results

In order to obtain directly comparable results, we chose one sample for each cell line for each assay type for each super population at random out of the replicates. We can filter down to those individuals and then aggregate by assay types.

Concordance

We filter down the results to the samples in question. Then, as before, since stuff in this section will eventually be put into figures, we must make a version where the column names and values more clear to humans:

# just filtered down, no annotations
keep_concordance = concordance %>% 
  filter(sample %in% random_samples$sample)

# for plotting down the line
keep_concordance_plot = keep_concordance %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "bgi_1x" ~ "D",
      assay_type == "ilmn_GSA" ~ "E"
    ),
    # unfiltered status 
    `Site Type` = case_when(
      site_type == "allsites" ~ "Unfiltered",
      site_type == "passing" ~ "Filtered",
    ),
    # allele frequency
    `Non-Reference Allele Frequency` = af_midpoint,
    
    `Super population` = super_pop
  )
  
head(keep_concordance)
## # A tibble: 6 x 39
##   cell_line sample stat_type set_id af_midpoint rr_hom_matches ra_het_matches
##   <chr>     <chr>  <chr>      <dbl>       <dbl>          <dbl>          <dbl>
## 1 HG00101   HG001… GCsAF          2       0.005       65520717          44179
## 2 HG00101   HG001… GCsAF          2       0.015        2560128          27662
## 3 HG00101   HG001… GCsAF          2       0.025        1169680          22824
## 4 HG00101   HG001… GCsAF          2       0.035         706309          22324
## 5 HG00101   HG001… GCsAF          2       0.045         489012          22909
## 6 HG00101   HG001… GCsAF          2       0.055         368235          23583
## # … with 32 more variables: aa_hom_matches <dbl>, rr_hom_mismatches <dbl>,
## #   ra_het_mismatches <dbl>, aa_hom_mismatches <dbl>, dosage_rsquared <dbl>,
## #   n_genotypes <dbl>, site_type <chr>, tech <chr>, pop <chr>, super_pop <chr>,
## #   rr_hom_concordance <dbl>, ra_het_concordance <dbl>,
## #   aa_hom_concordance <dbl>, total_matches <dbl>, total_sites <dbl>,
## #   overall_concordance <dbl>, nr_concordance <dbl>, nrd_numerator <dbl>,
## #   nrd_denominator <dbl>, rr_hom_count <dbl>, ra_het_count <dbl>,
## #   aa_hom_count <dbl>, assay_type <chr>, bases_all <dbl>, bases_deduped <dbl>,
## #   bases_deduped_mapped <dbl>, coverage_all <dbl>, coverage_deduped <dbl>,
## #   coverage_deduped_mapped <dbl>, pileup_covered <dbl>, prop_pileup_cov <dbl>,
## #   eff_cov <dbl>

where we pause to note exactly what samples we have here. In order to calculate comparable statistics across assay types for a given ancestry group, a cell line must have at least one replicate in each assay type that passed QC upstream. That was not the case for all cell lines: in particular, there were three EUR cell lines which did not satisfy this condition and therefore were excluded.

Specifically, the two TSI cell lines NA20802 and NA20822 did not have a single passing sample, and HG00096 did not have any successful 1x Illumina samples.

We can tally the number of samples in each of these groups to confirm:

keep_concordance %>% 
  select(cell_line, super_pop, assay_type, sample) %>% 
  distinct(sample, .keep_all = TRUE) %>% 
  group_by(super_pop, assay_type) %>% 
  tally()
## # A tibble: 8 x 3
## # Groups:   super_pop [2]
##   super_pop assay_type     n
##   <chr>     <chr>      <int>
## 1 AFR       bgi_1x        30
## 2 AFR       ilmn_0.5x     60
## 3 AFR       ilmn_1x       60
## 4 AFR       ilmn_GSA      60
## 5 EUR       bgi_1x        28
## 6 EUR       ilmn_0.5x     57
## 7 EUR       ilmn_1x       57
## 8 EUR       ilmn_GSA      57

Also, it’s important to note that although I’m including the BGI samples here, they can’t be as directly compared since they aren’t the same samples and the \(n\) is lower.

Overall results

We can calculate sample-level results and then aggregate by assay type: for each assay type, site type, stat type, and super population, we can take the mean concordance and NRC of the samples in each of those bins.

Coverage and effective coverage for random samples

Let’s take a look at what the actual coverages look like for the representative samples:

keep_concordance_plot %>% 
  distinct(cell_line, sample, coverage_deduped_mapped, eff_cov, assay_type, super_pop) %>%
  group_by(assay_type, super_pop) %>% 
  summarise(
    mean_eff_cov = mean(eff_cov),
    std_eff_cov = sqrt(var(eff_cov)),
    mean_nominal_cov = mean(coverage_deduped_mapped), 
    std_nominal_cov = sqrt(var(coverage_deduped_mapped))

    )
## `summarise()` regrouping output by 'assay_type' (override with `.groups` argument)
## # A tibble: 8 x 6
## # Groups:   assay_type [4]
##   assay_type super_pop mean_eff_cov std_eff_cov mean_nominal_cov std_nominal_cov
##   <chr>      <chr>            <dbl>       <dbl>            <dbl>           <dbl>
## 1 bgi_1x     AFR              1.24       0.0877            1.26           0.0850
## 2 bgi_1x     EUR              1.23       0.124             1.26           0.0948
## 3 ilmn_0.5x  AFR              0.371      0.0859            0.641          0.117 
## 4 ilmn_0.5x  EUR              0.474      0.298             0.728          0.515 
## 5 ilmn_1x    AFR              0.669      0.167             1.25           0.241 
## 6 ilmn_1x    EUR              0.751      0.157             1.21           0.171 
## 7 ilmn_GSA   AFR             NA         NA                NA             NA     
## 8 ilmn_GSA   EUR             NA         NA                NA             NA

or, if we look at overall experiments (without stratifying into pops),

# coverage by experiment 
keep_concordance_plot %>% 
  distinct(cell_line, sample, coverage_deduped_mapped, eff_cov, assay_type, super_pop) %>%
  group_by(assay_type) %>% 
  summarise(
     mean_eff_cov = mean(eff_cov),
    std_eff_cov = sqrt(var(eff_cov)),
    mean_nominal_cov = mean(coverage_deduped_mapped), 
    std_nominal_cov = sqrt(var(coverage_deduped_mapped))

  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 5
##   assay_type mean_eff_cov std_eff_cov mean_nominal_cov std_nominal_cov
##   <chr>             <dbl>       <dbl>            <dbl>           <dbl>
## 1 bgi_1x            1.24        0.106            1.26           0.0892
## 2 ilmn_0.5x         0.421       0.222            0.683          0.370 
## 3 ilmn_1x           0.709       0.167            1.23           0.210 
## 4 ilmn_GSA         NA          NA               NA             NA

and we can also write out all the samples actually kept, along with their metadata

keep_concordance_plot %>% 
  distinct(cell_line, sample, pop, super_pop, Experiment) %>% 
  write_csv("../data/out/representative-cohorts.csv")
Performance

Since we are going to be printing out tables of summary statistics we should also recode the assay type to the Experiment nomenclature.

## `summarise()` regrouping output by 'sample', 'cell_line', 'stat_type', 'site_type', 'super_pop' (override with `.groups` argument)
## `summarise()` regrouping output by 'super_pop', 'stat_type', 'site_type' (override with `.groups` argument)
## # A tibble: 32 x 14
##    super_pop stat_type site_type assay_type mean_concordance mean_nrc
##    <chr>     <chr>     <chr>     <chr>                 <dbl>    <dbl>
##  1 AFR       GCiAF     allsites  bgi_1x                0.978    0.888
##  2 AFR       GCiAF     allsites  ilmn_0.5x             0.972    0.858
##  3 AFR       GCiAF     allsites  ilmn_1x               0.975    0.874
##  4 AFR       GCiAF     allsites  ilmn_GSA              0.960    0.800
##  5 AFR       GCiAF     passing   bgi_1x                0.987    0.930
##  6 AFR       GCiAF     passing   ilmn_0.5x             0.987    0.926
##  7 AFR       GCiAF     passing   ilmn_1x               0.987    0.929
##  8 AFR       GCiAF     passing   ilmn_GSA              0.988    0.914
##  9 AFR       GCsAF     allsites  bgi_1x                0.997    0.942
## 10 AFR       GCsAF     allsites  ilmn_0.5x             0.994    0.900
## # … with 22 more rows, and 8 more variables: concordance_quantile_25 <dbl>,
## #   concordance_quantile_75 <dbl>, nrc_quantile_25 <dbl>,
## #   nrc_quantile_75 <dbl>, mean_concordance_display <chr>,
## #   mean_nrc_display <chr>, Experiment <chr>, `Super Population` <chr>

which is still grouped by quite a few variables. Comparing assay types directly is a bit clearer if we restrict ourselves to NRC for unfiltered SNPs.

Note that in the course of coming up with these data we also save them in TeX format to disk so that we can directly import them into our manuscript

Overall results, NRC

# show mean NRC across samples for unfiltered SNPs by assay type and super pop
# unfiltered_snps_mean_nrc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCsAF", site_type == "allsites") %>% 
#   select(`Super Population`, Experiment, mean_nrc) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_nrc) %>% 
#   select(`Super Population`, sort(current_vars())) %>% 
#   rename(`E (array)` = E) 

unfiltered_snps_mean_nrc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCsAF", site_type == "allsites") %>% 
  select(`Super Population`, Experiment, mean_nrc_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_nrc_display) %>% 
  select(`Super Population`, sort(current_vars())) %>% 
  rename(`E (array)` = E) 

unfiltered_snps_mean_nrc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  add_header_above(c(" ", "Experiment" = 4)) %>% 
  cat(file = "../paper/src/tabs/unfiltered-snps-mean-nrc.tex")

unfiltered_snps_mean_nrc 
## # A tibble: 2 x 5
##   `Super Populatio… A              B              D             `E (array)`     
##   <chr>             <chr>          <chr>          <chr>         <chr>           
## 1 AFR               0.9002 (0.889… 0.9217 (0.912… 0.9419 (0.93… 0.8308 (0.8119-…
## 2 EUR               0.9227 (0.913… 0.9438 (0.939… 0.9562 (0.95… 0.9067 (0.898-0…

from which a sharp contrast can be immediately drawn across assay types and populations — basically, for this particular set of sites the GSA performs more poorly than all the sequence results. If we consider filtered SNPs instead we find

# do the same but for filtered sites
# filtered_snps_mean_nrc = keep_sample_concordance_aggregated %>% 
#   filter(stat_type == "GCsAF", site_type == "passing") %>% 
#   select(`Super Population`, Experiment, mean_nrc) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_nrc) %>% 
#   select(`Super Population`, sort(current_vars()))

filtered_snps_mean_nrc = keep_sample_concordance_aggregated %>% 
  filter(stat_type == "GCsAF", site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_nrc_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_nrc_display) %>% 
  select(`Super Population`, sort(current_vars()))


filtered_snps_mean_nrc %>%
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/filtered-snps-mean-nrc.tex")

filtered_snps_mean_nrc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9456 (0.939… 0.9533 (0.947… 0.9609 (0.958… 0.9214 (0.908…
## 2 EUR                0.9626 (0.958… 0.9694 (0.966… 0.9731 (0.970… 0.9663 (0.961…

we find less of a difference; in fact, the GSA in Europeans outperforms the 0.5x Illumina sequence data, while still performing less well on African populations.

We can do the same for NRC for filtered and unfiltered indels:

# show mean NRC across samples for unfiltered indels by assay type and super pop
# unfiltered_indels_mean_nrc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCiAF", site_type == "allsites") %>% 
#   select(`Super Population`, Experiment, mean_nrc) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_nrc) %>% 
#   select(`Super Population`, sort(current_vars()))
unfiltered_indels_mean_nrc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCiAF", site_type == "allsites") %>% 
  select(`Super Population`, Experiment, mean_nrc_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_nrc_display) %>% 
  select(`Super Population`, sort(current_vars()))

unfiltered_indels_mean_nrc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/unfiltered-indels-mean-nrc.tex")

unfiltered_indels_mean_nrc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.8577 (0.846… 0.8737 (0.863… 0.8875 (0.881… 0.7998 (0.778…
## 2 EUR                0.8662 (0.855… 0.8823 (0.876… 0.8908 (0.886… 0.8549 (0.848…

for filtered sites:

# mean NRC for filtered indels
# filtered_indels_mean_nrc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCiAF", site_type == "passing") %>% 
#   select(`Super Population`, Experiment, mean_nrc) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_nrc) %>% 
#   select(`Super Population`, sort(current_vars()))
filtered_indels_mean_nrc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCiAF", site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_nrc_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_nrc_display) %>% 
  select(`Super Population`, sort(current_vars()))

filtered_indels_mean_nrc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/filtered-indels-mean-nrc.tex")

filtered_indels_mean_nrc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9256 (0.919… 0.9286 (0.923… 0.9303 (0.926… 0.9144 (0.901…
## 2 EUR                0.9379 (0.932… 0.9398 (0.934… 0.9389 (0.935… 0.948 (0.943-…

Overall results, overall concordance

Similarly, we can compute the same for overall concordance:

unfiltered:

# show mean overall concordance across samples for unfiltered SNPs by assay type and super pop
# unfiltered_snps_mean_oconc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCsAF", site_type == "allsites") %>% 
#   select(`Super Population`, Experiment, mean_concordance) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_concordance)%>% 
#   select(`Super Population`, sort(current_vars()))
unfiltered_snps_mean_oconc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCsAF", site_type == "allsites") %>% 
  select(`Super Population`, Experiment, mean_concordance_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance_display)%>% 
  select(`Super Population`, sort(current_vars()))
# print to latex 
unfiltered_snps_mean_oconc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/unfiltered-snps-mean-oconc.tex")

unfiltered_snps_mean_oconc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9943 (0.993… 0.9956 (0.995… 0.9967 (0.996… 0.9903 (0.989…
## 2 EUR                0.9964 (0.996… 0.9974 (0.997… 0.998 (0.9978… 0.9957 (0.995…

filtered:

# show mean overall concordance across samples for filtered SNPs by assay type and super pop
# filtered_snps_mean_oconc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCsAF", site_type == "passing") %>% 
#   select(`Super Population`, Experiment, mean_concordance) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_concordance)%>% 
#   select(`Super Population`, sort(current_vars()))

filtered_snps_mean_oconc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCsAF", site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_concordance_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance_display)%>% 
  select(`Super Population`, sort(current_vars()))

filtered_snps_mean_oconc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/filtered-snps-mean-oconc.tex")

filtered_snps_mean_oconc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9973 (0.997… 0.9976 (0.997… 0.9979 (0.997… 0.9967 (0.996…
## 2 EUR                0.9985 (0.998… 0.9987 (0.998… 0.9988 (0.998… 0.9988 (0.998…

and the same for indels: unfiltered

# show mean overall concordance across samples for unfiltered indels by assay type and super pop
# unfiltered_indels_mean_oconc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCiAF", site_type == "allsites") %>% 
#   select(`Super Population`, Experiment, mean_concordance) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_concordance)%>% 
#   select(`Super Population`, sort(current_vars()))

unfiltered_indels_mean_oconc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCiAF", site_type == "allsites") %>% 
  select(`Super Population`, Experiment, mean_concordance_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance_display)%>% 
  select(`Super Population`, sort(current_vars()))

unfiltered_indels_mean_oconc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/unfiltered-indels-mean-oconc.tex")

unfiltered_indels_mean_oconc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9716 (0.968… 0.9749 (0.972… 0.9777 (0.976… 0.9596 (0.955…
## 2 EUR                0.9766 (0.974… 0.9795 (0.978… 0.981 (0.9803… 0.9746 (0.973…

vs filtered:

# show mean overall concordance across samples for filtered indels by assay type and super pop
# filtered_indels_mean_oconc = keep_sample_concordance_aggregated %>%
#   filter(stat_type == "GCiAF", site_type == "passing") %>% 
#   select(`Super Population`, Experiment, mean_concordance) %>% 
#   pivot_wider(names_from = Experiment, values_from = mean_concordance)%>% 
#   select(`Super Population`, sort(current_vars()))

filtered_indels_mean_oconc = keep_sample_concordance_aggregated %>%
  filter(stat_type == "GCiAF", site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_concordance_display) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance_display)%>% 
  select(`Super Population`, sort(current_vars()))

filtered_indels_mean_oconc %>% 
  kable("latex", booktabs = TRUE, digits = 4) %>% 
  cat(file = "../paper/src/tabs/filtered-indels-mean-oconc.tex")

filtered_indels_mean_oconc
## # A tibble: 2 x 5
##   `Super Population` A              B              D              E             
##   <chr>              <chr>          <chr>          <chr>          <chr>         
## 1 AFR                0.9873 (0.986… 0.9874 (0.986… 0.9873 (0.986… 0.9879 (0.986…
## 2 EUR                0.9909 (0.990… 0.9908 (0.99-… 0.9904 (0.99-… 0.9933 (0.992…

We can also compute the mean unfiltered overall concordance by experiment for variants with MAF >1% (for marketing one-pager).

# compute mean concordance metrics at common variants
sample_common_concordance = keep_concordance %>%
  filter(af_midpoint != 0.005) %>% 
  group_by(sample, cell_line, stat_type, site_type, super_pop, assay_type) %>%  
  summarise(
    overall_concordance = sum(total_matches) / sum(total_sites),
    nrc = 1 - sum(nrd_numerator) / sum(nrd_denominator),
    total_nrd_numerator = sum(nrd_numerator),
    total_nrd_denominator = sum(nrd_denominator)
    ) %>% 
  group_by(super_pop, stat_type, site_type, assay_type) %>% 
  summarise(
    mean_concordance = mean(overall_concordance), 
    mean_nrc = mean(nrc)
    ) %>% 
  ungroup() %>%  
  mutate(
        # replace assay type with Experiment X
        Experiment = case_when(
            assay_type == "ilmn_0.5x" ~ "A",
            assay_type == "ilmn_1x" ~ "B",
            assay_type == "bgi_1x" ~ "D",
            assay_type == "ilmn_GSA" ~ "E"
        ),
        `Super Population` = super_pop
    )
## `summarise()` regrouping output by 'sample', 'cell_line', 'stat_type', 'site_type', 'super_pop' (override with `.groups` argument)
## `summarise()` regrouping output by 'super_pop', 'stat_type', 'site_type' (override with `.groups` argument)
# mean concordance at AF >1%
sample_common_concordance %>% 
  filter(stat_type == "GCsAF" & site_type == "allsites") %>% 
  select(`Super Population`, Experiment, mean_concordance) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance) %>% 
  select(`Super Population`, sort(current_vars()))
## # A tibble: 2 x 5
##   `Super Population`     A     B     D     E
##   <chr>              <dbl> <dbl> <dbl> <dbl>
## 1 AFR                0.973 0.980 0.987 0.950
## 2 EUR                0.982 0.988 0.991 0.977

and the same for filtered

# mean concordance at AF >1%, filtered
sample_common_concordance %>% 
  filter(stat_type == "GCsAF" & site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_concordance) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance) %>% 
  select(`Super Population`, sort(current_vars()))
## # A tibble: 2 x 5
##   `Super Population`     A     B     D     E
##   <chr>              <dbl> <dbl> <dbl> <dbl>
## 1 AFR                0.989 0.990 0.992 0.985
## 2 EUR                0.993 0.994 0.995 0.995

and we can also do this for maf >0.05:

# concordance at common (af >5%) 
sample_common_concordance_05 = keep_concordance %>%
  filter(af_midpoint > 0.045) %>% 
  group_by(sample, cell_line, stat_type, site_type, super_pop, assay_type) %>%  
  summarise(
    overall_concordance = sum(total_matches) / sum(total_sites),
    nrc = 1 - sum(nrd_numerator) / sum(nrd_denominator),
    total_nrd_numerator = sum(nrd_numerator),
    total_nrd_denominator = sum(nrd_denominator)
    ) %>% 
  group_by(super_pop, stat_type, site_type, assay_type) %>% 
  summarise(
    mean_concordance = mean(overall_concordance), 
    mean_nrc = mean(nrc)
    ) %>% 
  ungroup() %>%  
  mutate(
        # replace assay type with Experiment X
        Experiment = case_when(
            assay_type == "ilmn_0.5x" ~ "A",
            assay_type == "ilmn_1x" ~ "B",
            assay_type == "bgi_1x" ~ "D",
            assay_type == "ilmn_GSA" ~ "E"
        ),
        `Super Population` = super_pop
    )
## `summarise()` regrouping output by 'sample', 'cell_line', 'stat_type', 'site_type', 'super_pop' (override with `.groups` argument)
## `summarise()` regrouping output by 'super_pop', 'stat_type', 'site_type' (override with `.groups` argument)
# mean concordance at AF >1%
sample_common_concordance_05 %>% 
  filter(stat_type == "GCsAF" & site_type == "passing") %>% 
  select(`Super Population`, Experiment, mean_concordance) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_concordance) %>% 
  select(`Super Population`, sort(current_vars()))
## # A tibble: 2 x 5
##   `Super Population`     A     B     D     E
##   <chr>              <dbl> <dbl> <dbl> <dbl>
## 1 AFR                0.985 0.987 0.990 0.983
## 2 EUR                0.990 0.992 0.993 0.993

Tests for significance

We can perform two-sample paired t-tests for differences in means between the methods, given superpopulation.

# snp data
tdata = keep_concordance %>%
  filter(af_midpoint > 0.045, stat_type == "GCsAF", site_type == "allsites") %>% 
  group_by(sample, cell_line, stat_type, site_type, super_pop, assay_type) %>%  
  summarise(
    overall_concordance = sum(total_matches) / sum(total_sites),
    nrc = 1 - sum(nrd_numerator) / sum(nrd_denominator),
    total_nrd_numerator = sum(nrd_numerator),
    total_nrd_denominator = sum(nrd_denominator)
    )
## `summarise()` regrouping output by 'sample', 'cell_line', 'stat_type', 'site_type', 'super_pop' (override with `.groups` argument)
# compare against array results 
eur_array_nrcs = tdata %>% filter(super_pop == "EUR", assay_type == "ilmn_GSA") %>% ungroup() %>% select(nrc, cell_line)
afr_array_nrcs = tdata %>% filter(super_pop == "AFR", assay_type == "ilmn_GSA") %>% ungroup() %>% select(nrc, cell_line)


print("Two-sample paired t-tests for difference in means between sequence and array NRCs")
## [1] "Two-sample paired t-tests for difference in means between sequence and array NRCs"
for (i in c("ilmn_0.5x", "ilmn_1x", "bgi_1x")) {
    df = tdata %>% filter(super_pop == "EUR", assay_type == i) %>% ungroup() %>% select(nrc, cell_line) %>%
      inner_join(eur_array_nrcs, by = "cell_line")
    print(paste("p value for eur", i, "vs array", t.test(df$nrc.x, df$nrc.y, paired = TRUE)$p.value))
    
    df = tdata %>% filter(super_pop == "AFR", assay_type == i) %>% ungroup() %>% select(nrc, cell_line) %>%
      inner_join(afr_array_nrcs, by = "cell_line")
    print(paste("p value for afr", i, "vs array", t.test(df$nrc.x, df$nrc.y, paired = TRUE)$p.value))
}
## [1] "p value for eur ilmn_0.5x vs array 3.82136349569441e-08"
## [1] "p value for afr ilmn_0.5x vs array 1.61839038739243e-35"
## [1] "p value for eur ilmn_1x vs array 3.06898401678837e-43"
## [1] "p value for afr ilmn_1x vs array 9.93919703169578e-47"
## [1] "p value for eur bgi_1x vs array 2.94571376861514e-28"
## [1] "p value for afr bgi_1x vs array 1.07839738467597e-24"

We can also perform unpaired two-sample t-tests for differences in means between the AFR and EUR results

print("two-sample unpaired t-tests for differences in means between populations")
## [1] "two-sample unpaired t-tests for differences in means between populations"
for (i in c("ilmn_0.5x", "ilmn_1x", "bgi_1x")) {
  eur_nrcs = tdata %>% filter(assay_type == i) %>% ungroup() %>% filter(super_pop == "EUR") %>% pull(nrc)
  afr_nrcs = tdata %>% filter(assay_type == i) %>% ungroup() %>% filter(super_pop == "AFR") %>% pull(nrc)
  
  print(paste("p value for", i, "between eur and afr NRCs", t.test(eur_nrcs, afr_nrcs, paired = FALSE)$p.value))
}
## [1] "p value for ilmn_0.5x between eur and afr NRCs 3.7395104769465e-06"
## [1] "p value for ilmn_1x between eur and afr NRCs 5.78135528251449e-18"
## [1] "p value for bgi_1x between eur and afr NRCs 2.4241127136494e-12"

By allele frequency

Aggregating sample metrics for each bin

Here we calculate NRC for each sample for each bin and then average across samples for each bin.

# concordance by nraf 
keep_concordance_plot_bynraf = keep_concordance_plot %>% 
  group_by(Experiment, super_pop, `Site Type`, stat_type, `Non-Reference Allele Frequency`) %>% 
  summarise(
    `Mean NRC` = mean(nr_concordance), 
    n = n(),
    `Mean concordance` = mean(overall_concordance)
    ) 
## `summarise()` regrouping output by 'Experiment', 'super_pop', 'Site Type', 'stat_type' (override with `.groups` argument)
keep_concordance_plot_bynraf = keep_concordance_plot %>% 
  group_by(Experiment, super_pop, `Site Type`, stat_type, `Non-Reference Allele Frequency`) %>% 
  summarise(
    `Mean NRC` = mean(nr_concordance), 
    n = n(),
    `Mean concordance` = mean(overall_concordance),
    
    # these are replicated to create the display names blow
    mean_nrc = mean(nr_concordance),
    mean_concordance = mean(overall_concordance),
    
    concordance_quantile_25 = quantile(overall_concordance, 1/4),
    concordance_quantile_75 = quantile(overall_concordance, 3/4),
    nrc_quantile_25 = quantile(nr_concordance, 1/4),
    nrc_quantile_75 = quantile(nr_concordance, 3/4),
    
    # display results, which is `mean (25th - 75th)` quantiles
    mean_concordance_display = paste(round(mean_concordance, 3), " (", round(concordance_quantile_25, 3), "-", round(concordance_quantile_75, 3), ")", sep = ""),
    mean_nrc_display = paste(round(mean_nrc, 3), " (", round(nrc_quantile_25, 3), "-", round(nrc_quantile_75, 3), ")", sep = "")
    ) 
## `summarise()` regrouping output by 'Experiment', 'super_pop', 'Site Type', 'stat_type' (override with `.groups` argument)
keep_concordance_plot_bynraf
## # A tibble: 480 x 16
## # Groups:   Experiment, super_pop, Site Type, stat_type [32]
##    Experiment super_pop `Site Type` stat_type `Non-Reference … `Mean NRC`     n
##    <chr>      <chr>     <chr>       <chr>                <dbl>      <dbl> <int>
##  1 A          AFR       Filtered    GCiAF                0.005      0.723    60
##  2 A          AFR       Filtered    GCiAF                0.015      0.891    60
##  3 A          AFR       Filtered    GCiAF                0.025      0.917    60
##  4 A          AFR       Filtered    GCiAF                0.035      0.924    60
##  5 A          AFR       Filtered    GCiAF                0.045      0.926    60
##  6 A          AFR       Filtered    GCiAF                0.055      0.926    60
##  7 A          AFR       Filtered    GCiAF                0.065      0.924    60
##  8 A          AFR       Filtered    GCiAF                0.075      0.923    60
##  9 A          AFR       Filtered    GCiAF                0.085      0.921    60
## 10 A          AFR       Filtered    GCiAF                0.095      0.919    60
## # … with 470 more rows, and 9 more variables: `Mean concordance` <dbl>,
## #   mean_nrc <dbl>, mean_concordance <dbl>, concordance_quantile_25 <dbl>,
## #   concordance_quantile_75 <dbl>, nrc_quantile_25 <dbl>,
## #   nrc_quantile_75 <dbl>, mean_concordance_display <chr>,
## #   mean_nrc_display <chr>

Since we’re plotting here, let’s use the new variables we renamed above:

# for each AF bin, group by assay type and average the NRC for _all_ sites, not just passing
keep_concordance_plot_bynraf %>% 
  ungroup() %>% 
  mutate(
    Experiment = if_else(Experiment == "E", "E (array)", Experiment)
  ) %>% 
  filter(stat_type == "GCsAF" & `Site Type` == "Unfiltered") %>% 
  ggplot(
    aes(x = `Non-Reference Allele Frequency`, y = `Mean NRC`, color = Experiment)
    ) + 
  facet_wrap(~super_pop) + 
  geom_line() + geom_point() + theme_few() + scale_color_tableau() + 
  labs(title = "Average NRC by non-reference allele frequency\nfor unfiltered sites")

ggsave("../paper/src/figs/unfiltered_snp_nrc_by_nraf.pdf")
## Saving 7 x 4 in image

where we observe higher performance for imputed sequence data compared to imputed array data across the allele frequency spectrum for unfiltered variant calls.

We can do this on a log scale too:

# for each AF bin, group by assay type and average the NRC for _all_ sites, not just passing
keep_concordance_plot_bynraf %>% 
  ungroup() %>% 
  mutate(
    Experiment = if_else(Experiment == "E", "E (array)", Experiment)
  ) %>% 
  filter(stat_type == "GCsAF" & `Site Type` == "Unfiltered") %>% 
  ggplot(
    aes(x = `Non-Reference Allele Frequency`, y = `Mean NRC`, color = Experiment)
    ) + 
  facet_wrap(~super_pop) + 
  geom_line() + geom_point() + theme_few() + scale_color_tableau() + 
  labs(title = "Average NRC by non-reference allele frequency\nfor unfiltered sites") + 
  scale_x_log10()

ggsave("../paper/src/figs/unfiltered_snp_nrc_by_nraf_logscale.pdf")
## Saving 7 x 4 in image

Filtering down to passing sites, we observe

# for each AF bin, group by assay type and average the NRC for passing sites
keep_concordance_plot_bynraf %>% 
  ungroup() %>% 
  filter(stat_type == "GCsAF" & `Site Type` == "Filtered") %>% 
  mutate(
    Experiment = if_else(Experiment == "E", "E (array)", Experiment)
  ) %>% 
  ggplot(
    aes(x = `Non-Reference Allele Frequency`, y = `Mean NRC`, color = Experiment)
    ) + 
  facet_wrap(~super_pop) + 
  geom_line() + geom_point() + theme_few() + scale_color_tableau() + 
  labs(title = "Average NRC by non-reference allele frequency for filtered sites")

ggsave("../paper/src/figs/filtered_snp_nrc_by_nraf.pdf")
## Saving 7 x 4 in image

We can also write out the exact numbers here to a latex table for completeness:

# values plotted above for unfiltered snps NRC
keep_concordance_plot_bynraf %>%
  filter(stat_type == "GCsAF" & `Site Type` == "Unfiltered") %>%
  ungroup() %>%
  rename(`Super Population` = super_pop) %>%
  select(`Super Population`, `Non-Reference Allele Frequency`, `Mean NRC`, Experiment) %>%
  pivot_wider(names_from = Experiment, values_from = `Mean NRC`) %>%
  kable("latex", booktabs = TRUE, align = c("llrrrr"), digits = 4) %>%
  cat(file = "../paper/src/tabs/unfiltered-snp-nrc-by-nraf.tex")
# 
# keep_concordance_plot_bynraf %>%
#   filter(stat_type == "GCsAF" & `Site Type` == "Unfiltered") %>%
#   ungroup() %>%
#   rename(`Super Population` = super_pop, NRAF =  `Non-Reference Allele Frequency`) %>%
#   select(`Super Population`, NRAF, mean_nrc_display, Experiment) %>%
#   pivot_wider(names_from = Experiment, values_from = mean_nrc_display) %>%
#   kable("latex", booktabs = TRUE, align = c("llrrrr"), digits = 4) %>%
#   cat(file = "../paper/src/tabs/unfiltered-snp-nrc-by-nraf.tex")

Imputation r2s

We have our binned \(r^2\)s for biallelic SNPs and are interested in looking at the performance across allele frequencies: where the bin breakpoints are 0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, 0.25, 0.5 and the x-axis is the minor allele frequency in the 1KGP3.

Let’s recode the assay type to Experiment notation

binned_r2s = binned_r2s %>% 
  mutate(
    # replace assay type with Experiment X
    Experiment = case_when(
      assay_type == "ilmn_0.5x" ~ "A",
      assay_type == "ilmn_1x" ~ "B",
      assay_type == "bgi_1x" ~ "D",
      assay_type == "gsa" ~ "E"
    ),
    `Minor Allele Frequency` = af_midpoint,
  )

Looking at the number of variants in each of these bins, we are reminded that the n in each bin differs from cohort to cohort due to the fact that a correlation can only be computed between two vectors if the elements of each are not all the same value.

# for EUR
binned_r2s %>% 
  filter(super_pop == "EUR") %>% 
  ggplot(aes(x = factor(af_midpoint), y = n_snps)) + geom_bar(stat = "identity") + 
  facet_wrap(~assay_type) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(title = "EUR")

# for AFR
binned_r2s %>% 
  filter(super_pop == "AFR") %>% 
  ggplot(aes(x = factor(af_midpoint), y = n_snps)) + geom_bar(stat = "identity") + 
  facet_wrap(~assay_type) + 
  theme(axis.text.x = element_text(angle = 90)) + labs(title = "AFR")

With that in mind, we can proceed to plotting the mean \(r^2\) for each cohort for each assay type for each allele frequency bin. On a log scale, we have

## Saving 7 x 4 in image

Interestingly, the GSA performs better than all other assay types for Africans at the lowest AF bin. We can also plot this on a linear scale

## Saving 7 x 4 in image

We can also write out the exact values plotted for each super population to latex tables

# AFR
binned_r2s %>% 
  filter(super_pop == "AFR") %>% 
  select(mean_r2, `Minor Allele Frequency`, Experiment) %>% 
  rename(`Mean r^2` = mean_r2) %>% 
  pivot_wider(names_from = Experiment, values_from = `Mean r^2`) %>% 
  kable("latex", booktabs = TRUE, align = c("lrrrr"), digits = 4) %>% 
  cat(file = "../paper/src/tabs/r2-by-maf-afr.tex")

# EUR
binned_r2s %>% 
  filter(super_pop == "EUR") %>% 
  select(mean_r2, `Minor Allele Frequency`, Experiment) %>% 
  rename(`Mean r^2` = mean_r2) %>% 
  pivot_wider(names_from = Experiment, values_from = `Mean r^2`) %>% 
  kable("latex", booktabs = TRUE, align = c("lrrrr"), digits = 4) %>% 
  cat(file = "../paper/src/tabs/r2-by-maf-eur.tex")

We can summarize this by taking the average \(r^2\)s by experiment and population as well:

# we can back-calculate the average for bins 
mean_r2s = binned_r2s %>% 
  select(`Minor Allele Frequency`, mean_r2, super_pop, n_snps, Experiment) %>% 
  group_by(super_pop, Experiment, `Minor Allele Frequency`) %>% 
  mutate(r2xnsnps = mean_r2 * n_snps) %>% 
  group_by(Experiment, super_pop) %>% 
  filter(`Minor Allele Frequency` > 0.045) %>% # anything above a bin midpoint is >5% MAF
  summarise(
    mean_r2 = sum(r2xnsnps) / sum(n_snps)
    ) %>% 
  pivot_wider(names_from = Experiment, values_from = mean_r2) %>% 
  rename(`Super population` = super_pop)
## `summarise()` regrouping output by 'Experiment' (override with `.groups` argument)
# 
# mean_r2s %>% 
#   kable("latex", booktabs = TRUE, digits = 4) %>% 
#   add_header_above(c(" ", "Experiment" = 4)) %>% 
#   cat(file = "../paper/src/tabs/mean_r2s_common.tex")

mean_r2s
## # A tibble: 2 x 5
##   `Super population`     A     B     D     E
##   <chr>              <dbl> <dbl> <dbl> <dbl>
## 1 AFR                0.892 0.921 0.949 0.830
## 2 EUR                0.916 0.943 0.960 0.907

Polygenic scores

We can examine the PRS head to head for these cohorts by filtering down to those samples we keep and then getting the MSE within each category:

scores_keep = scores %>% filter(sample %in% random_samples$sample)
scores_keep_mse = scores_keep %>% 
  group_by(assay_type, super_pop, trait) %>% 
  summarise(
    mse = mean(squared_error), 
    mean_eff_cov = mean(eff_cov)
    )
## `summarise()` regrouping output by 'assay_type', 'super_pop' (override with `.groups` argument)
head(scores_keep_mse)
## # A tibble: 6 x 5
## # Groups:   assay_type, super_pop [3]
##   assay_type super_pop trait     mse mean_eff_cov
##   <chr>      <chr>     <chr>   <dbl>        <dbl>
## 1 bgi_1x     AFR       brca  0.0112         1.24 
## 2 bgi_1x     AFR       cad   0.00271        1.24 
## 3 bgi_1x     EUR       brca  0.0180         1.23 
## 4 bgi_1x     EUR       cad   0.00272        1.23 
## 5 ilmn_0.5x  AFR       brca  0.0393         0.371
## 6 ilmn_0.5x  AFR       cad   0.00672        0.371

and see the average squared error

scores_keep_mse %>% ggplot(aes(x = assay_type, fill = super_pop, y = mse)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~trait) + theme_few() + scale_fill_tableau() + labs(title = "MSE for PRS by super population and trait")

from which we observe that the 1x-target-sequenced Illumina data have consistently lower measurement error compared to imputed GSA data across traits. However, if we pay attention to our nifty new metric \(\lambda_{\mathrm{eff}}\), we can see that the target coverages of 0.5x and 1x for Illumina have rather lower mean effective coverages.

scores_keep_mse
## # A tibble: 16 x 5
## # Groups:   assay_type, super_pop [8]
##    assay_type super_pop trait     mse mean_eff_cov
##    <chr>      <chr>     <chr>   <dbl>        <dbl>
##  1 bgi_1x     AFR       brca  0.0112         1.24 
##  2 bgi_1x     AFR       cad   0.00271        1.24 
##  3 bgi_1x     EUR       brca  0.0180         1.23 
##  4 bgi_1x     EUR       cad   0.00272        1.23 
##  5 ilmn_0.5x  AFR       brca  0.0393         0.371
##  6 ilmn_0.5x  AFR       cad   0.00672        0.371
##  7 ilmn_0.5x  EUR       brca  0.0221         0.474
##  8 ilmn_0.5x  EUR       cad   0.0126         0.474
##  9 ilmn_1x    AFR       brca  0.0237         0.669
## 10 ilmn_1x    AFR       cad   0.00695        0.669
## 11 ilmn_1x    EUR       brca  0.0164         0.751
## 12 ilmn_1x    EUR       cad   0.00638        0.751
## 13 ilmn_GSA   AFR       brca  0.0337        NA    
## 14 ilmn_GSA   AFR       cad   0.00757       NA    
## 15 ilmn_GSA   EUR       brca  0.0211        NA    
## 16 ilmn_GSA   EUR       cad   0.0128        NA

We can additionally just see how the PRS correlates by experiment

scores_plot %>% 
  ggplot(aes(x = score, y = true_score, color = Experiment)) + 
  facet_grid(cols = vars(super_pop), rows = vars(Trait)) + 
  geom_point(alpha = 0.7) + geom_abline() + theme_few() + scale_color_tableau() + 
  labs(
    title = "PRS estimates vs. true score by population and trait",
    x = "Score",
    y = "True Score"
  )

ggsave("../paper/src/figs/prs-estimated-true-scatterplot.pdf")
## Saving 7 x 7 in image

We can also write out the pearson correlations for each experiment for each super population for each trait

# correlation between true and imputed scores
scores_correlation = scores_plot %>% 
  filter(Experiment != "C") %>% 
  group_by(Experiment, super_pop, Trait) %>% 
  summarise(correlation = cor(score, true_score) ^ 2)  %>% 
  pivot_wider(names_from = Experiment, values_from = correlation) %>% 
  arrange(Trait) %>% 
  rename(`Super population` = super_pop)
## `summarise()` regrouping output by 'Experiment', 'super_pop' (override with `.groups` argument)
scores_correlation %>% 
  kable("latex", booktabs = TRUE,  digits = 4) %>% 
  add_header_above(c(" " = 2, "Experiment" = 4)) %>% 
  cat(file = "../paper/src/tabs/prs-corr-with-truth.tex")

scores_correlation
## # A tibble: 4 x 6
## # Groups:   Super population [2]
##   `Super population` Trait             A     B     D     E
##   <chr>              <chr>         <dbl> <dbl> <dbl> <dbl>
## 1 AFR                Breast Cancer 0.887 0.935 0.975 0.894
## 2 EUR                Breast Cancer 0.905 0.947 0.970 0.949
## 3 AFR                CAD           0.892 0.920 0.938 0.873
## 4 EUR                CAD           0.957 0.983 0.989 0.972

Tables as latex

Copy/paste these code snippets and run them in a console to get the latex code:

concordance %>%
  distinct(cell_line, super_pop, pop) %>% 
  group_by(pop, super_pop) %>% 
  tally() %>% 
  arrange(super_pop, -n) %>% 
  kable("latex", booktabs = TRUE)

Blog post figures

We want to have some figures for our blog post/ one pager. In particular we want the \(r^2\) and the PRS figures, the same but with different legends:

## Warning: Problem with `mutate()` input `Experiment`.
## x Unknown levels in `f`: Illumina 0.53x effective coverage
## ℹ Input `Experiment` is `fct_relevel(...)`.
## Warning: Unknown levels in `f`: Illumina 0.53x effective coverage

## Saving 9 x 4 in image

and we can also plot this with just the BGI and GSA data.

## Warning: Problem with `mutate()` input `Experiment`.
## x Unknown levels in `f`: Illumina 0.53x effective coverage
## ℹ Input `Experiment` is `fct_relevel(...)`.
## Warning: Unknown levels in `f`: Illumina 0.53x effective coverage

## Saving 9 x 4 in image

and the same for PRS

## Saving 10 x 6 in image

Revisions

Revisions requested included an analysis using different imputation programs for both the array and sequence data. We can do this on a subset of the individuals (the randomly selected subset) and using GLIMPSE and IMPUTE5 for the sequence and array data respectively.