DSM leads to fewer false positives when linking individuals with unknown membership in the target genotype set. When it is unknown whether the individual associated with a query expression profile is included in the target data set, a match score threshold can be used to draw links only for scores greater than the threshold. The choice of this threshold determines the tradeoff between detecting true matches and avoiding spurious links. Receiver-operating-characteristic (ROC) curves (A) and precision-recall curves (B) illustrating this tradeoff are shown for DSM, EBL, and GNB, evaluated on the expanded HRC data set with 22,288 individuals. The average AUROC and AUPRC metrics are reported in parentheses for each method. Full descriptions of the relevant evaluation metrics (e.g., true-positive rate and false-positive rate) are provided in Supplemental Note S2. The curves are averaged over 100 trials of the holdout experiment in which the genotype profiles of a random half of FUSION individuals were excluded from the target set before each trial of the linking procedure to assess false detection error. DSM outperforms previous methods in finding more true matches while minimizing the number of incorrect links for individuals without a match.
