Overview of DSM. (A) In our model, we consider the design of a match score function that quantifies how likely a given pair of gene expression and genotype profiles originated from the same individual. (B) Such a function can be used by a malicious actor to link individuals across different data sets, which could lead to the reidentification of a data sample corresponding to the individual. (C) Existing works investigating this possibility analyzed only a subset of eQTLs (i.e., unshaded nodes, genetic variants associated with gene expression levels) that are statistically independent owing to model limitations. (D) We introduce the discriminative sequence model (DSM), which builds upon the standard Li–Stephens hidden Markov model of genetic sequences to jointly leverage the predictive signals across all known eQTLs; it incorporates necessary calibration for redundancy and correlation among eQTLs to provide a more accurate, sequence-level match score. Our modeling approach helps reveal the full extent of genotypic information in gene expression profiles to better inform privacy risk assessment.
