# PRiMeR Software Examples

## Introduction
This directory contains the `PRiMeR_quickstart.ipynb` notebook, which demonstrates how to use the PRiMeR framework along with baseline models on your data. Originally, our experiments utilized real exposures from the UK Biobank (UKBB), including simulations. Here, we provide an example using synthetic data, which can be generated using the `generate_data.py` script, enabling users to explore PRiMeR's functionalities without access to UKBB data.

## Inputs to PRiMeR Quickstart
The notebook utilizes the following data files:
- **risk factor file**: Parquet file containing data for `n_individuals` x `n_risk factors`.
- **genotype file**: Binary PLINK files for `n_individuals` and selected independent variants (variants are selected through the GWAS and joint clumping procedure on risk factors as described in the paper).
- **covariate file**: Parquet file containing data for `n_individuals` x `n_covariates`.
- **sumstats file**: Parquet file containing external disease summary statistics, harmonized with the variants in the BED file.
- **followup outcome**: Parquet file containing `n_individuals` x `1` followup data. This file is not required to train PRiMeR and may remain unobserved in real data applications. In the notebook, it is available as we consider simulated data and is used for the validation of our predictors learned by the MR-based models.

## Example with Generated Data
To showcase the formats of the data and the execution of all methods, we use the simulated data generated by `generate_data.py`.
The script simulates the data using the following steps:
1. **Simulate Variants**: Variants are simulated from a binomial distribution with two trials and allele frequencies uniformly distributed between 2% and 20%. We set `num_qtls_rf` QTLs affecting each risk factor with `num_shared_qtls_rf` QTLs affecting all risk factors.
2. **Simulate Risk Factors**: `num_rfs` risk factors are simulated such that the variance explained cumulatively by all QTLs is `var_geno` for each risk factor.
3. **Simulate Aggregate Risk and Outcome**: From all risk factors using the strategy explained in the simulations section of our paper.
4. **Data Splitting**: The dataset is split into an outcome split with sample size `no` (where GWAS of the outcome is performed and summary statistics are computed) and a population cohort split `ne` (where population-level data for risk factors and genetics are available, see also Figure 1 of the main manuscript).
5. **Data Export**: All data are exported to the designated output directory, typically `./../data`.

We can generate data with default parameters `num_rfs=30`, `num_qtls_rf=30`, `num_shared_qtls_rf=5`, `var_geno=0.2`, `no=50_000`, and `ne=50_000`, using the following command:
```bash
python generate_data.py
```
All files are exported by default at `../data`.

## Conclusion
This example serves as a practical guide to understanding and using the PRiMeR software with simulated data, allowing users to fully replicate our computational experiments and adapt them to their datasets. The released software, example code and simulation functions can be used to reproduce all experiments in the PRiMeR paper, assuming access to the relevant fields in the UKBB dataset.
