De novo detection of somatic variants in high-quality long-read single-cell RNA sequencing data

  1. Niko Beerenwinkel1,2
  1. 1Department of Biosystems Science and Engineering, ETH Zurich, 4056 Basel, Switzerland;
  2. 2SIB Swiss Institute of Bioinformatics, 4056 Basel, Switzerland;
  3. 3Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02142, USA;
  4. 4Ovarian Cancer Research, Department of Biomedicine, University Hospital Basel and University of Basel, 4031 Basel, Switzerland
  • Corresponding author: niko.beerenwinkel{at}bsse.ethz.ch
  • Abstract

    In cancer, genetic and transcriptomic variations generate clonal heterogeneity, leading to treatment resistance. Long-read single-cell RNA sequencing (LR scRNA-seq) has the potential to detect genetic and transcriptomic variations simultaneously. Here, we present LongSom, a computational workflow leveraging high-quality LR scRNA-seq data to call de novo somatic single-nucleotide variants (SNVs), including in mitochondria (mtSNVs), copy number alterations (CNAs), and gene fusions, to reconstruct the tumor clonal heterogeneity. Before somatic variant calling, LongSom reannotates marker gene-based cell types using cell mutational profiles. LongSom distinguishes somatic SNVs from noise and germline polymorphisms by applying an extensive set of hard filters and statistical tests. Applying LongSom to human ovarian cancer samples, we detected clinically relevant somatic SNVs that were validated against matched DNA samples. Leveraging somatic SNVs and fusions, LongSom found subclones with different predicted treatment outcomes. In summary, LongSom enables de novo variant detection without the need for normal samples, facilitating the study of cancer evolution, clonal heterogeneity, and treatment resistance.

    Cancer cells accumulate somatic genomic variations, such as single-nucleotide variants (SNVs), copy number alterations (CNAs), and gene fusions during their lifetime, leading to intratumor heterogeneity, i.e., the existence of cancer subpopulations with distinct genotypes and phenotypes. This is presumed to be a leading cause of therapy resistance and one of the main reasons for poor overall survival in cancer patients with metastatic disease (Jamal-Hanjani et al. 2015; Vasan et al. 2019; Ramón Y Cajal et al. 2020). While genetic mechanisms are often evoked as the primary cause of therapeutic resistance in those subpopulations, the adaptive mechanisms underlying therapy resistance are both of genetic (SNVs, CNAs, gene fusions, etc.) and nongenetic (epigenetic, transcriptomic, microenvironment, etc.) origin (Mansoori et al. 2017; Marine et al. 2020). The first step in identifying therapy-resistant subclones is, therefore, to examine these interlinked features jointly (Foord et al. 2023) by capturing genetic and transcriptomic variants at the single-cell level (Mansoori et al. 2017; Dagogo-Jack and Shaw 2018; Marine et al. 2020).

    Droplet-based scRNA-seq (e.g., 10x Genomics Chromium) can detect same-cell genetic and transcriptomic variants. However, those protocols can only capture RNA molecules via their 3′ or 5′ ends, and short-read (SR) scRNA-seq coverage is heavily biased toward the 3′/5′ end of genes. Recently, methods to call SNVs (Zhang et al. 2023; Muyas et al. 2024) and CNAs (Serin Harmanci et al. 2020; Gao et al. 2021, 2023) in SR scRNA-seq were developed, compensating the 3′ capture bias by pooling large amounts of cells or sequencing at very high read depths. However, SR scRNA-seq is unsuited to detect isoforms or gene fusions. Long-read (LR) scRNA-seq, in contrast, sequences full-length RNA molecules, and we have shown in recent work that high-quality LR scRNA-seq can simultaneously detect clinically relevant SNVs, CNAs, fusions, and isoform-level expression in the same cells (Joglekar et al. 2021; Dondi et al. 2023; Shiau et al. 2023; Al'Khafaji et al. 2024; Qin et al. 2025).

    Somatic variants are typically identified by comparing variants from tumor biopsies with those from matched normal biopsies derived from respective healthy tissue. However, as matched normals are rarely available, methods to detect somatic variants de novo were developed for bulk sequencing, mainly relying on germline variant allele fraction (VAF) profiles and tumor purity estimates combined with extensive filtering against public databases (Teer et al. 2017; Sun et al. 2018; Shiau et al. 2023). As sensitivity/recall is similar to variant detection with a matched normal, but precision is lower, tumor-only detection in bulk sequencing is usually deemed more appropriate for detecting known mutations than detecting variants de novo (Teer et al. 2017). For scRNA-seq, SComatic (Muyas et al. 2024) was developed to call variants de novo without matched DNA-seq normal, leveraging noncancer microenvironment cells in the tumor biopsy to differentiate somatic from germline variants. This approach relies on initial cell type annotation based on gene expression patterns, an open challenge due to overlapping, poorly expressed, or incomplete marker gene sets, and even a low percentage of cancer cells misannotated as noncancer will lead to false-negative variants filtered out as germline (Muyas et al. 2024). Consequently, as somatic variant calling depends intrinsically on the quality of cell type annotations, methods ensuring the correctness of the annotations are needed.

    Cell types and clonal substructures are traditionally identified in scRNA-seq using gene expression profiles. However, identifying different cancer clones requires sufficient transcriptional divergence between them. Instead, clonal substructures can also be reconstructed using somatic variants (Zhou et al. 2020; Gao et al. 2021, 2023; Kannan et al. 2022; Muyas et al. 2024). For this, mitochondrial SNVs (mtSNVs) serve as an excellent complement to nuclear SNVs and fusions (Kwok et al. 2022; Miller et al. 2022), as mitochondrial RNA (mtRNA) is highly available (Osorio and Cai 2021) and mutated (>10-fold higher than in the nuclear genome) (Wallace 1994) in scRNA-seq. mtSNVs can also be pathogenic in cancer (Koshikawa et al. 2017), yet, few variant calling methods properly integrate mtSNVs detection (Mukherjee et al. 2023).

    This study aims to provide a computational workflow for detecting variants (SNVs, mtSNVs, fusions, and CNAs) in LR scRNA-seq of tumor tissue samples without requiring matched normal samples, subsequently integrating them to reconstruct the samples’ clonal heterogeneity, and identifying subclones with different predicted treatment outcomes.

    Results

    Overview of LongSom, a computational workflow for LR scRNA-seq variant detection and clonal reconstruction

    We developed LongSom, a workflow for detecting genetic variants and finding cancer subclones in LR scRNA-seq data without requiring matched normal. Briefly, LongSom takes BAM files and cell type annotations as input, reassesses the cell type annotations, subsequently calls SNVs, mtSNVs, fusions, and CNAs in single cells based on the reannotated cell types, and finally reconstructs the clonal heterogeneity (Fig. 1A).

    Figure 1.

    Overview of LongSom. (A) LongSom's methodology for detecting somatic SNVs, fusions, and CNAs and subsequently inferring cancer subclones in LR scRNA-seq individual patients data. (1) SNV and (2) fusion candidates are detected from pseudo-bulk samples. (3) High-confidence cancer variants (SNVs and fusions) are selected based on mutated cell fraction in cancer and noncancer cells. (4) Cells are reannotated based on high-confidence cancer variants. (5) A new set of candidate variants is called based on reannotated barcodes. (6) Candidate SNVs are filtered through a set of 10 filters. (7) Cells are clustered based on somatic fusions and SNVs. In parallel, (8) gene expression per cell is computed, (9) CNAs are detected, (10) cells are clustered based on CNAs, and (11) CNA clones are incorporated into the fusions and SNVs clustered matrix. (B) Candidate nuclear SNV successive filtering steps. Candidates passing all 10 steps are called as somatic SNVs (Methods). (C) Candidate mtSNVs filtering steps. ΔMCF represents the difference of mutated cell fractions (MCFs) between cancer and noncancer cells. Candidates passing all five steps are called as somatic mtSNVs (Methods).

    To avoid false-negative calls due to cell type misannotation or cells containing high levels of ambient cancer RNA, LongSom reannotates the marker-based cell types. For this, LongSom calls a set of “high-confidence cancer variants” (SNVs, mtSNVs, and fusions) following eight filtering steps, and reannotates cells based on their mutational burden (Methods). LongSom then recalls variants using reannotated cell types. Somatic SNVs are identified in 10 filtering steps including hard filters and statistical testing (Fig. 1B; Methods). Calling mtSNVs remains challenging, as high levels of ambient mtRNA, released by dead or dying cells (Young and Behjati 2020), can cause false-negative calls on the bulk level (loci wrongly excluded as germline) and false-positive calls on the single-cell level (contaminated noncancer cells called as mutated). Therefore, LongSom treats mtSNVs differently from nuclear SNVs and calls them in five filtering steps (Fig. 1C; Methods). Finally, LongSom infers the clonal structure of the samples using two different approaches. One approach leverages the detected SNVs, mtSNVs, and fusions as input for the Bayesian nonparametric clustering method BnpC (Borgsmüller et al. 2020). The other approach predicts CNAs based on gene expression in cancer cells and defines subclusters using inferCNV (https://github.com/broadinstitute/infercnv) (Methods).

    LongSom reannotates cell types based on mutational profiles

    We applied LongSom to previously published high-quality (Pacific Biosciences [PacBio]) LR scRNA-seq data of omentum metastasis samples obtained from three chemo-naive high-grade serous carcinoma (HGSOC) patients: P1, P2, and P3 (Dondi et al. 2023). Those samples were composed of 337 cancer cells (41,959 median unique molecular identifiers [UMI] per cell) and 1225 microenvironment cells (11,716 median UMI per cell), referred to as “noncancer cells” in the following. After cell type reannotation, we found that cells reannotated as cancer were mostly clustering with cells previously annotated as cancer cells based on expression data, but some clustered with noncancer cells (Fig. 2A). We found that 8%, 2%, and 27% of the cells that LongSom annotated as cancer were previously annotated as noncancer cells in the tumor biopsy samples of patients P1, P2, and P3, respectively (Fig. 2B). The tumor biopsy of patient P3 had only 10% cancer cells (Dondi et al. 2023), which could explain the high level of cell misannotation. Cells reannotated from cancer to noncancer cells had a similar mutational burden than cells previously annotated as noncancer in all patients. In patients P1 and P2, cells reannotated from noncancer to cancer cells had a similar mutational burden and mean fraction of mutated loci as cells previously annotated as cancer (P > 0.05, Tukey–Krammer's test) (Kramer 1956), while being significantly different from cells previously annotated a noncancer (P < 0.001) (Fig. 2C,D). In P3, cells reannotated from noncancer to cancer cells were significantly different from cells previously annotated as both cancer and noncancer. Those cells were likely misannotated due to low expression in the first place, leading to a lower mutational burden. Furthermore, cells with the lowest mutational burden are also cells not clustering with cancer cells (Fig. 2A; Supplemental Fig. S1A). Those are likely cells with high levels of ambient RNA, and while likely not cancer cells, they would still cause false negatives if they were not reannotated. In summary, cell type reannotation reduced the cell-variants noise (Fig. 2E; Supplemental Fig. S1B,C), and in the following, cancer or noncancer cells refer to the reannotated cell types.

    Figure 2.

    Validation of the cell type reannotation based on mutational profiles. (A) UMAP embeddings of LR scRNA-seq expression per patient. Cells are colored by annotation status; light-red cells were previously predicted as noncancer using marker gene expression-based annotation and were reannotated as cancer by LongSom based on high-confidence cancer variants. (B) Confusion matrices of cells predicted as cancer or noncancer using marker genes, and cells reannotated as cancer or noncancer by LongSom, colored and annotated by the percentage of the total number of cells in each category. For example, the bottom left square represents cells previously annotated as noncancer that were reannotated as cancer (false-negative cancer cells). (C,D) Boxplots of (C) the fraction of SNV loci that were found mutated in each cell, considering only loci with minimum coverage of one read at the locus in a cell, and (D) the total number of SNVs mutated in each cell, per patient, colored by their annotation status. Points represent individual cells, and boxes display the first to third quartile with the median as horizontal line, whiskers encompass 1.5 times the interquartile range. P values were calculated using Tukey–Krammer's test and are described with the following symbols: n.s: P > 0.05, (*) P ≤ 0.05, (**) P ≤ 0.01, (***) P ≤ 0.001. (E) Cell-variant matrixes of patient P3 before (left) and after (right) reannotation. Red indicates that a locus is mutated in a cell (bet-binomial test P-value < 0.05), and white that it is not (either P > 0.05 or no coverage).

    Validation of LongSom somatic calls using scRNA-seq and scWGS data

    After reannotation, we processed 4,271,449 candidate loci with at least one mutated read and five reads coverage in aggregated cancer cells (Fig. 3A). LongSom called 822 somatic SNVs passing all filters, which mapped to intronic regions (75%), exonic regions (10.9%), 3′-UTR regions (6.2%), intergenic regions (4.8%), splicing sites (1.7%), and 5′-UTR regions (1.4%) (Fig. 3A,B). In this data set, we previously identified that 32% of the reads resulted from contaminating DNA internally primed on their intronic poly(A)-rich regions (Dondi et al. 2023), a common phenomenon in scRNA-seq known as “intrapriming” (Verwilt et al. 2023). While those reads are removed from transcript counts, they are valuable to call SNVs in scRNA-seq (as they come from same-cell DNA) and they can explain the large fraction of intronic variants we observed. A whole-genome sequencing (WGS) study of a cohort of 962 individuals (Morrison et al. 2013) found 58% of intergenic variants, 36% of intronic variants, and 6% of “functional” (exonic, splicing, 3′ UTR, and 5′ UTR and splicing) variants. This study observed a 1–17 ratio of functional to intronic and intergenic variants, while we observed a 1–5 ratio, indicating that droplet-based scRNA-seq still selects functional variants despite intrapriming.

    Figure 3.

    Validation of LongSom somatic calls using scRNA and scWGS data. (A) LongSom SNV filtering workflow, indicating the number of loci passing each of the 10 filtering steps (top, green) or being filtered (bottom, red), starting from all loci with at least one mutated read and five reads coverage in cancer cells. (B) Waffle plot representing each of the 822 somatic SNVs detected by LongSom, colored by their RefSeq functional annotation. (C) Waffle plot of all loci called by LongSom with sufficient power in scWGS for validation (Methods), colored by their status in scWGS. (D) Boxplot of the fraction of all loci called after filtering step 9 that is supported by scWGS data as either somatic or germline, colored by the distance from the closest mapping SNV also detected. LongSom calls represent all 822 calls after filtering step 10 (SNVs not within 10,000 bp or less from each other). Each point is a patient.

    To validate those calls, we used single-cell WGS (scWGS) data from matched omental metastases for each patient. In addition to diploid clones (likely noncancer), we found two aneuploid clones (likely cancer) in patient P1 scWGS data, one in P2, and two in P3 (Methods; Supplemental Fig. S2). In the 211 loci called as somatic by LongSom with sufficient scWGS depth, we called SNVs in scWGS clones, and loci called in at least one aneuploid clone and in none of the diploid clones were defined as somatic, while loci called in diploid clones were defined as germline. We found that 80% of the calls were supported as somatic in scWGS, while 11% were supported as germline calls and therefore likely to be false positive (Fig. 3C). We also found that 9% of the loci were not called in any clone, possibly due to tumor heterogeneity between the scWGS and scRNA-seq samples. We also investigated the scWGS support of exonic, 3′ UTR, 5′ UTR, and splice-site variants versus intronic and intergenic variants, and found similar results, confirming that the intronic and intergenic variants identified are not false positives (Supplemental Fig. S3).

    For patients P1 and P3, we had access to scRNA-seq data from matching distal tumor-free omental tissues (Dondi et al. 2023). As an additional validation, we called SNVs in those samples and found that 13% (P1) and 4% (P3) of LongSom somatic calls in the omentum metastasis samples were also mutated in the matched normal, i.e., germline false positives (Supplemental Fig. S4). We used patients P1 and P3's matched normal as a long-read “panel of normals” (PoN) in our analysis (excluding the matched samples for P1 and P3, see Methods). This PoN filtered 9%–25% of the somatic variants called by LongSom (Supplemental Fig. S4). Those filtered loci had low noncancer coverage (typically 5–10 reads); therefore, they were more sensitive to germline false positives. Altogether, the high support for somatic mutations in scWGS data, as well as the low amount of false-positive germline calls found in both scWGS and matched normal scRNA data show LongSom's ability to call somatic variants without matched normal.

    We investigated the correlation between the distance separating somatic SNVs and their support in scWGS. When taking all 2519 loci passing filter 9 (Fig. 3A), we found that the lower the mapping distance between two somatic SNVs, the lower the support in scWGS data, with high rates of germline and low rates of somatic variants (Fig. 3D). Even SNVs within a 1000–10,000 bp distance had low scWGS support, suggesting gene-wise allelic expression differences between cancer and noncancer cells (Fig. 3D). Therefore, LongSom filters loci within a 10,000 bp mapping distance from each other.

    Somatic mitochondrial reads contaminate tumor microenvironment cells in scRNA-seq and scWGS data

    LongSom detected five mtSNVs in patient P1 at positions 2815, 3092, 5179, 13,635, and 16,192, two in P2 at positions 2573 and 16,065, and none in P3. In patient P1, cancer cells had a mean VAF per cell ranging from 10% to 98% for all identified loci, while noncancer cells had a mean VAF per cell ranging from 0.1% to 3% (Fig. 4A). Cells from a matched normal biopsy had mean VAF per cell <0.001% at all loci, discarding germline heteroplasmies and suggesting a contamination of noncancer cells by cancer-derived mtRNA. In Patient P2, we also observed a similar VAF profile in noncancer cells at locus 2573 (Fig. 4B). To assess the levels of ambient mtRNA derived from cancer cells, we computed the VAF of each mtSNV loci in all empty droplets containing no cell and only ambient RNA (Methods). We found that empty droplets had cancer-like VAF profiles, with mean VAFs 3.4 (±1) times higher than the mean VAF of the biopsy (Fig. 4C). We also found a strong correlation (R = 0.93) between aggregated mutated reads in empty droplets and noncancer cells (Fig. 4D). In contrast, the correlation between the total coverage and the number of mutated reads was weaker (R = 0.84). This supports that mutated reads observed in noncancer cells come from cancer ambient mtRNA. We investigated the mtSNVs in a matched scWGS sample and found similar contamination profiles in cells from noncancer clones at all loci (Fig. 4E,F). This suggests the presence of ambient cancer mtDNA, too, and possibly of entire mitochondria. In comparison, the seven mtSNVs detected by LongSom were all filtered by SComatic (Muyas et al. 2024) due to the mitochondrial noise levels described above.

    Figure 4.

    Mitochondrial reads harboring somatic mutations are detected in noncancer cells. (A,B) Violin plots of the VAF of each cell for (A) patient P1 and (B) P2 mtSNV loci in scRNA-seq data, categorized by reannotated cell types and empty droplets in tumor and normal biopsies. Individual points are cells or droplets. The blue dashed line represents the mean VAF in cells from the tumor biopsy. The red dashed line represents the mean VAF in each category. n refers to the number of cells with at least one read covering the locus. (C,D) Log aggregated mutated reads in noncancer cells, as a function of (C) log aggregated mutated reads in empty droplets and (D) log aggregated total reads in noncancer cells, for all loci from P1 and P2 except locus 16,065 in P2 which was discarded due to low expression. (E,F) Violin plots of the VAF of each cell for (E) patient P1 and (F) P2 mtSNV loci in scWGS data, categorized by clones in the tumor biopsy. Individual points are cells. The red dashed line represents the mean VAF in each clone. n refers to the number of cells with at least one read covering the locus.

    LongSom outperforms SComatic for somatic variant calling

    We compared somatic SNV calls from LongSom and SComatic. The main differences between the two algorithms are that LongSom corrects cell type annotations, uses a stricter beta-binomial threshold when filtering out germline variants, and filters the final calls within 10,000 bp of each other. To ensure that LongSom performance was not only driven by the 10,000 bp filter, we also applied this filter to SComatic calls. LongSom found 342, 145, and 340 somatic SNVs in patients P1, P2, and P3, respectively, while SComatic found 319, 155, and 260 (Fig. 5A; Supplemental Tables S1, S2). The calls overlapped by only 63%, 40%, and 72% between the two methods in each patient, while 21%, 10%, and 33% of the calls were unique to LongSom, versus 16%, 18%, and 13% for SComatic. The largest difference was observed in P3, which is concordantly the patient with the largest proportion of cells reannotated, followed by P1 (Figs. 2, 5C). Calls unique to LongSom had 0.82–0.83 somatic support in scWGS, similar to calls common to both methods (0.72–0.86), while calls unique to SComatic had lower somatic support (0.38–0.50) (Fig. 5C). Calls unique to LongSom (0–0.17) and common (0.11–0.14) to both methods also had a lower proportion of germline support in scWGS data (0.11–0.5) than calls unique to SComatic or common to both methods (Fig. 5C).

    Figure 5.

    Comparison of LongSom versus SComatic performance using scWGS data. (A) Venn diagrams of the intersection between LongSom and SComatic somatic calls in LR scRNA-seq data from patients P1, P2, and P3. (B) Venn diagrams of the intersections between scWGS SNV calls. LongSom LR scRNA-seq calls and SComatic LR scRNA-seq calls. (C) Boxplot of all loci called after filtering step 9, colored by the distance from the closest mapping SNV also detected, and classified by somatic or germline scWGS support. Each point is a patient. (D) Performance of LongSom and SComatic for detecting somatic mutations in LR scRNA-seq data. Each point is a patient, the bars represent the mean value, and the error bars are the standard deviation for each statistic computed.

    Next, we compared the performances of LongSom versus SComatic. To this end, we considered a set of 323 mutations detected in scWGS data with sufficient coverage in scRNA-seq data for benchmarking (Methods). LongSom call set had more overlap with scWGS callset than SComatic in all patients, especially in patient P3 (Fig. 5B). In scWGS calls supported by at least one read in scRNA-seq data, LongSom achieved a score of 0.44–0.5 precision across the three samples, similar to SComatic (0.41–0.46) (Fig. 5D). However, LongSom achieved a superior sensitivity (0.19–0.55) than SComatic (0–0.13), and a higher F1 score. The higher sensitivity is largely due to LongSom cell reannotation that prevents false-negative germline calls, whereas SComatic filters out SNVs that are supported as somatic by scWGS data. The LR PoN filtered very similar sets of false-positive somatic calls from LongSom and SComatic (overlap 0.82–0.91), further suggesting that those false positives are caused by low noncancer coverage (Supplemental Fig. S4).

    LongSom detects panel-validated variants

    The three patients also underwent bulk panel DNA sequencing (Methods), where 29 SNVs were found (Supplemental Table S3). All three patients had at least one somatic SNV called in TP53 (including a variant introducing a stop codon in patient P3) with a VAF >30%, and LongSom detected all of those. The SNVs detected in other genes of the panel were not retained with our method for the following reasons: they were either identified as germline variants (n = 5), detected in cancer but with insufficient coverage in noncancer cells (n = 3), detected but not in enough cancer cells (n = 7), not detected despite sufficient coverage (n = 3), or not covered (n = 8) (Supplemental Table S3). Overall, 62% of the SNVs detected in the panel also found support in scRNA data. Since the scRNA-seq and panel-seq samples originated from different regions of the biopsy, the false negatives could be due to tumor heterogeneity or to the low VAF (<0.1) of some variants (n = 4) (Supplemental Table S3). In comparison, SComatic only detected one TP53 variant (P1) in LR scRNA-seq data. Of note, two deletions were found in panel sequencing, and they were detected manually in the LR scRNA-seq data (Supplemental Table S3).

    LongSom identifies subclones in LR scRNA-seq data matching subclones in scWGS data

    LongSom detected four fusions in patient P1, 16 in P2, and 2 in P3, using CTAT-LR-fusion as described in Qin et al. (2025) (Supplemental Table S4). Next, LongSom inferred the clonal structure of the tumors based on the SNVs and fusions it detected using BnpC. LongSom also inferred the clonal structure from CNA profiles in the same cells, using inferCNV (Supplemental Fig. S5; Methods). We also clustered the cells based on their gene expression, manually annotated the cancer clusters, and used those clusters as transcriptomic validation. Finally, we used the subclones inferred from scWGS as external validation (Fig. 3).

    In patient P1, LongSom found two cancer subclones based on SNVs and fusions, referred to as A and B (Fig. 6A). The larger subclone A (n = 50 cells) was predominantly defined by a set of seven SNVs, including mtSNVs chrM:5179 and chrM:13635, and the smaller subclone B (n = 30 cells) was mainly defined by a set of seven SNVs, including mtSNV chrM:2815, as well as two fusions SMG7‐‐CH507-513H4.1 and GS1-279B7.2‐‐GNG4. In expression-based Uniform Manifold Approximation and Projection (UMAP) embedding, cancer cells formed two distinct expression clusters that near perfectly overlapped the genotypic cancer subclones found based on SNVs and fusions (Fig. 6A–D). Clonal assignments based on SNVs and fusions and on CNA data were also very similar (Fig. 6A,E; Supplemental Fig. S5). In patient P1's matched scWGS data, we also found two aneuploid (cancer) subclones, CC1 and CC3, based on CNA profiles (Supplemental Fig. S2). We found that three loci exclusively mutated in subclone A (Chr7:25123800, chrM:5179, chrM:13635) were also exclusively mutated in subclone CC3, and one locus exclusively mutated in subclone B (chrM:2815) was exclusively mutated in subclone CC1 (Figs. 4E, 6A). All subclone-specific loci had <5 reads coverage in scWGS clones, except for Chr M loci, which had 186–861 reads coverage and were strictly subclone exclusive (Fig. 4E). Therefore, using mtSNVs, we could confidently match scRNA-seq subclones to scWGS subclones.

    Figure 6.

    Analysis of intratumor heterogeneity using somatic variants detected in LR scRNA-seq in Patient P1. (A) BnpC clustering of single cells from the tumor biopsy of patient P1 (columns) by somatic SNVs and fusions called by LongSom in LR scRNA-seq data (rows). Red indicates that a loci is mutated in a cell (beta-binomial P-value < 0.05), white that it is not, and gray indicates no coverage in the cell at a given locus. Rows are colored according to the mutation status of aggregated scWGS diploid (Noncancer Clones) or aneuploid (Cancer Clones 1 and 3) cells. Fusions appear in blue. Columns are colored from top to bottom by cell types reannotated by LongSom, inferCNV CNAs subclones, expression clusters, and BnpC subclones inferred from somatic SNVs and fusions. (B,C) UMAP embedding of patient P1 gene expression data, colored by (B) Seurat clusters and (C) BnpC subclones. (D,E) Confusion matrix of cells in each expression-derived cancer cluster (rows) and (D) cells in the subclones inferred from BnpC, and (E) cells in the subclones inferred from inferCNV (columns), colored by the percentage of the total number of cells in each subclone and annotated by absolute numbers. (F) Volcano plot of differentially expressed genes identified between subclones B and A. Keratin genes downregulated in subclone B are annotated. (G) ScisorWiz representation of CHPF isoforms in subclones A and B. Colored areas are exons, whitespace areas are intronic space, not drawn to scale, and each horizontal line represents a single read colored according to subclones.

    In patient P2, LongSom found one cancer clone using mutations and fusions, coinciding well with the aneuploid CNA scRNA clone and the gene-expression-based cancer cluster, similarly, in scWGS data, we only saw one aneuploid CNA clone (Supplemental Figs. S2B, S5B, S6). Therefore, all available data modalities point toward a monoclonal cancer population in this patient.

    In patient P3, LongSom found one clone, coinciding with the gene expression-based cancer cluster; however, two aneuploid subclones were detected in both scWGS and scRNA-seq data using CNA analysis (Supplemental Figs. S2C, S5C, S7). This difference could be due to the low number of cancer cells or intersample heterogeneity.

    Subclones identified in patient P1 have differing predicted treatment outcomes

    To explore the potential therapeutic resistance of subclones A and B identified in patient P1, we investigated the genomic and transcriptomic variations between them. In subclone A, we identified a missense variant in the ferroptosis regulator ALDH3A2 (Val321Leu) (Supplemental Table S5) indicating a lower cisplatin resistance (Dong et al. 2023). In subclone B, we identified a missense in CCAR2 (Arg722Trp) (Supplemental Table S5), a suppressor of homologous recombination, indicating a potential resistance against poly(ADP-ribose) polymerase (PARP) inhibitors (Iyer et al. 2022). Therefore, based on SNVs, subclone A is more likely to be treatment-sensitive, while subclone B is more likely to be treatment-resistant. On the transcriptomic level, subclone B had notably downregulated expression of keratin genes KRT8 and KRT18, two epithelial markers used to differentiate HGSOC cells from noncancer cells (Fig. 6F; Supplemental Fig. S8A,B). It has been shown in vitro that KRT8 and KRT18 have a protective effect against cell death (Bozza et al. 2018), and their loss leads to increased invasiveness but also cisplatin sensitivity (Fortier et al. 2013). Subclone B is, therefore, more likely to be chemosensitive than subclone A. We additionally investigated differential isoform usage, and while both subclones were mostly similar, we found a significant difference in the isoform expression of genes CHPF (Fig. 6G), MYL6, the tumor suppressor BTG2, and NUTM2B-AS1 (Supplemental Fig. S8C–E); however, we could not predict their pathogenicity.

    LR greatly outperforms SR scRNA-seq for variant detection and clonal reconstruction

    Finally, we aimed to compare LR to SR scRNA-seq ability to call somatic SNVs. The HGSOC study had LR and SR scRNA-seq data from the same cells available. When we applied SComatic to SR scRNA-seq, we found only 114 loci (7.3 times less than LongSom calls in LR data), with only nine SNVs common to both technologies (Fig. 7A). The lower amount of loci identified could be related to the fact that, while the SR data set had 4.3 times more sequenced reads compared to LR (mean 117.4 k vs. 26.9 k reads per cell), it had 3.5 times fewer mapped bases (mean 11.4 Gb mapped vs. 3.3 Gb mapped) due to shorter read length (Supplemental Fig. S9A,B). Notably, somatic SNVs identified in SR data contained 26% of 3′-UTR variants, against 6% in LR data (Figs. 3B, 7B). As reads are captured by their 3′ end poly(A), a large fraction of the SR coverage is located in 3′ UTR as reads are too short to exceed it. This explains the overrepresentation of 3′-UTR variants in SR data, and previous studies reported up to 40% of them (Muyas et al. 2024). To validate in scWGS data the somatic SNV loci identified in SR data, we pooled samples together due to the low number of calls in patients P1 and P3 (Supplemental Fig. S9C–E). Out of the 18 loci called in SR scRNA-seq data with sufficient coverage in scWGS, 50% were supported as somatic in scWGS, a lower support than LR data, and no SNV supported as germline was found (Fig. 7C). Furthermore, SR data identified none of the panel-supported TP53 variants nor other clinically relevant variants. Therefore, we found fewer calls in SR data, and those had lower scWGS and bulk panel data support. Using those calls, we identified no clonal structure in either patient P1 or P3, and only identified a partial clonal reconstruction in P2, the patient in which we detected most SR calls (Fig. 7D; Supplemental Fig. S9F,G).

    Figure 7.

    Analysis of SR scRNA-seq data and comparison with LR data. (A) Venn diagram of the intersection between SR (SComatic) and LR (LongSom) somatic calls in LR scRNA-seq data from all patients aggregated. (B) Waffle plot representing each of the 114 somatic SNVs detected by SComatic in SR scRNA-seq data, colored by their RefSeq functional annotation. (C) Barplot of the fraction of all loci that is supported by scWGS data as either somatic or germline, colored by whether they were called in LR data, SR data, or both. (D) BnpC clustering of single cells from the tumor biopsy of patient P1 (columns) by somatic SNVs and fusions (rows) called in SR scRNA-seq data. Red indicates that a loci is mutated in a cell (bet-binomial P-value < 0.05), white that it is not, and gray indicates no coverage in the cell at a given locus. Rows are colored according to the mutation status of aggregated scWGS diploid (Noncancer Clones) or aneuploid (Cancer Clones) cells. Fusions appear in blue. Columns are colored from top to bottom by marker-based annotation and BnpC subclones inferred from somatic SNVs and fusions.

    Discussion

    SNVs, mtSNVs, CNAs, fusions, gene expression, isoforms expression, and the microenvironment composition can all affect cancer treatment outcomes (Marine et al. 2020). Assessing all of these variations simultaneously from a single patient sample is particularly relevant in a clinical setting, where biological material is limited. Here, we show for the first time that this is possible using LR scRNA-seq data and we introduce LongSom, a workflow for detecting de novo somatic SNVs, fusions, and CNAs in LR scRNA-seq without matched normal. When applied to data from three HGSOC patients, we showed the LongSom outperformed SComatic for SNV detection and detected scWGS- and panel-validated SNVs, including clinically relevant ones. By integrating SNVs and fusions, LongSom successfully reconstructed the clonal heterogeneity and identified scWGS-matched subclones. Finally, in each subclone, we identified differentially expressed genes and subclone-specific SNVs with different implications for treatment resistance. Thus, we demonstrated that LR scRNA-seq is suitable for predicting treatment outcomes.

    The performance of somatic SNV calling using noncancer cells from the tumor biopsy as a “pseudo-normal” is contingent on reliable cell type annotations (Muyas et al. 2024). The cell type reannotation step implemented in LongSom, based on the somatic variation profile of cells, led to the detection of up to 31% more somatic SNVs (patient P3) and significantly increased sensitivity without sacrificing precision.

    LongSom is the first method combining de novo detection of SNVs, mtSNVs, and fusions from the same cell to reconstruct clonal heterogeneity. In the HGSOC data set, the mtSNVs were called in most cancer and noncancer cells, and some fusion calls were expressed in most clones or subclones (P2: IGF2BP2::TESPA1, P1: SMG7::CH507-513H4.1, etc.), making them ideal variations for cell type reannotation and clustering. However, we demonstrated that mtSNVs require special filtering thresholds, as noncancer cells frequently contained cancer mitochondrial reads. We showed that those reads were likely ambient mtRNA from dead or dying cancer cells encapsulated jointly with noncancer cells during single-cell preparation. A conclusive answer to whether those are only technical artifacts or also originate from a biological mechanism, e.g., microenvironment revitalization (Liu et al. 2021; Zampieri et al. 2021), will require further investigation.

    Despite the popularity of droplet-based scRNA-seq, multiple technical limitations remain unsolved, limiting the potential of downstream analysis. First, variant detection remains challenging due to the sparsity and low coverage of scRNA-seq assays, especially in LR assays despite rapid progress in the field (Dondi et al. 2023; Joglekar et al. 2023; Marx 2023; Al'Khafaji et al. 2024). To ensure that somatic calls are not germline polymorphisms, LongSom excludes sites with coverage <5 in noncancer cells (39%), leading to potential false negatives. Second, read coverage is also uneven within a transcript, as transcripts produced by droplet-based scRNA-seq remain incomplete on the 5′ end due to intrapriming from intronic poly(A)-rich regions and on the 3′ end due to incomplete cDNA production (Nam et al. 2002; Hsu et al. 2022; Dondi et al. 2023; Verwilt et al. 2023). Third, RNA-seq is inherently limited to detecting only expressed SNVs and fusions. Nevertheless, as described above, LongSom detected a large fraction of variants in intronic or even intergenic regions. Last, similar to the mitochondrial reads contamination we observed, droplet-based scRNA-seq is sensitive to the encapsulation and subsequent sequencing of ambient RNA. This is especially true in cancer, where RNA from dead cancer cells is encapsulated with noncancer cells, leading to false-negative calls.

    The performance of LongSom is dependent on a high sequencing quality (>Q20) and it has only been tested with PacBio data so far. Nanopore recently greatly improved its read quality and reached the Q20 threshold with chemistry R.10.4 (Ni et al. 2023), and LongSom will need to be tested on cancer biopsy scRNA-seq data sets generated with this technology.

    In summary, LR scRNA-seq provides a unique snapshot of the cellular mechanisms by capturing multiple genomic and transcriptomic readouts from the same cell. With decreasing costs and increasing data size, we envision that LR scRNA-seq will become more common, potentially facilitating a better understanding of the processes underlying cancer treatment resistance. LongSom can be a valuable first step in guiding these analyses.

    Methods

    scRNA expression analysis

    The raw sequencing data from HGSOC samples was retrieved from the European Genome–phenome Archive (EGA; https://ega-archive.org) under accession number EGAS00001006807.

    Marker gene expression–based annotation

    Cell annotation was retrieved from GitHub (https://github.com/cbg-ethz/scIsoPrep/tree/master/bc_to_celltype) (Dondi et al. 2023). We used “HGSOC” labels as cancer cells and “Mesothelial.cells,” “Fibroblast,” “T.NK.cells,” “B.cells,” “Myeloid.cells,” and “Endothelial.cells” labels as noncancer cells.

    Clustering and visualization

    Similar cells were grouped using Seurat FindClusters (Hao et al. 2024), and clusters with a majority (>90%) of noncancer cells were grouped together as “noncancer.” The results of the clustering and cell typing are visualized in a low-dimensional representation using UMAP.

    Differential gene expression analysis

    Differential expression was computed using Seurat FindMarkers (Hao et al. 2024), which uses a Wilcoxon test, corrected for multiple testing using the Bonferroni correction. A threshold of corrected P-value <0.01 and abs(log2(fold change)) > 1 was used for significance.

    Differential isoform usage analysis

    Isoform classification and quantification were performed using scIsoPrep. Differential isoform testing was performed using a χ2 test as previously described in Scisorseqr (Joglekar et al. 2021). Differentially used isoforms were visualized using ScisorWiz (Stein et al. 2022).

    Somatic variant calling in LR scRNA-seq data with LongSom

    To call somatic variants in LR scRNA-seq, we developed LongSom, a workflow implemented in Python 3 using Snakemake (Köster and Rahmann 2012) and available at GitHub (see Software availability). LongSom is designed to be run on a high-performance cluster. The HGSOC data set was analyzed using 16 CPUs for a total of 64 GB memory for 3 h. For high-quality data (median > Q30, e.g., PacBio data), we recommend running LongSom on data sets with at least 2 billion mapped bases from unique reads per cell type (5–10 billion for data with a median quality <Q30, e.g., Nanopore data), with a minimum coverage of 5 reads per loci per cell type (10–20 reads for <Q30 data), and a minimum of three mutated reads in at least two different cells in a unique cell type (5–10 mutated reads in at least five different cells for <Q30 data). Please note that LongSom has not been tested on <Q30 data.

    Preprocessing

    PacBio long reads with minimal quality Q20 were deconcatenated, adapters were trimmed, demultiplexed, poly(A) tails were trimmed and finally, UMIs were deduplicated using scIsoPrep (https://github.com/cbg-ethz/scIsoPrep/tree/master) as described in Dondi et al. (2023). Reads were mapped to the GRCh38 genome using minimap2 (Li 2018) with options -t 30 -ax splice -uf ‐‐secondary=no -C5.

    Error rate modeling

    To distinguish true somatic SNVs from technical artifacts such as sequencing errors, mapping errors, or ambient RNA captured during cell encapsulation, LongSom models the background error rate using a beta-binomial distribution as described in Muyas et al. (2024). Specifically, nonreference allele counts at homozygous reference sites are modeled using a binomial distribution with parameter P (error rate), which is a random variable that follows a beta distribution with parameters α and β, inferred using base count information from 500,000 sites in the genome randomly selected from patient P1 and P3 normal samples. Next, for each candidate loci, the beta-binomial distribution is used to test whether the nonreference allele counts are significantly higher than expected based on the error rate computed.

    Panel of normals

    To discard positions affected by recurrent droplet-based scRNA-seq technical artifacts, LongSom uses a SR PoN derived from Muyas et al. (2024). To also remove technical artifacts specific to LR-seq, we built an LR PoN using the two normal samples available from P1 and P3. The LR PoN includes all sites with nonreference allele counts significantly higher than the background error rate modeled with the beta-binomial distribution in any of the normal samples provided. Matched samples were not included in the LR PoN during analysis: for P1, we only used the P3 normal sample and vice-versa.

    Cell type reannotation

    To reannotate cells, LongSom first identifies high-confidence cancer variants (HCCVs): SNVs, mtSNVs, and fusions. Candidate SNVs are identified using SComatic with default parameters except ‐‐min_mq 60. Then LongSom performs a series of eight filters: (1) loci with <20 reads coverage in aggregated cancer cells or aggregated noncancer cells are filtered. (2) LongSom tests if, when ignoring the reads harboring candidate allele mutation reads, other nonreference allele counts at the locus are significantly higher than expected given the background error rate (beta-binomial test, significance threshold 0.05). Loci (3) within homopolymers, (4) present in the gnomAD database (Chen et al. 2024) with a frequency of at least 1% of the total population, (5) present in RNA-editing databases, and (6) present in LR or SR PoN were filtered. (7) The SNVs where VAFNoncancer < 0.2, and (8) ΔMCF > 0.4 are filtered, with ΔMCF defined as follows: ΔMCF = MCFCancer − MCFNoncancer where MCF(Non)Cancer is the fraction of mutated (non)cancer cells, including only cells with minimum coverage of 1 at the position. Finally, (9) adjacent SNVs mapping within a 10,000 bp distance are filtered. SNVs passing all 8 filters are considered HCCVs. Forcing low VAFNoncancer ensures that the candidate mutation is not a germline polymorphism or resulting from a loss of heterozygosity, while allowing VAFNoncancer > 0 enables us to detect misannotated noncancer cells. For samples with high levels of cross-contamination between cancer and noncancer cells, higher VAFNoncancer threshold can be used.

    mtSNVs are considered HCCVs if they pass all SNV filters except (9) and (6). Instead, only the LR PoN is used. Fusions with a MCFCancer > 0.05 and MCFNoncancer < 0.01 were selected as HCCVs.

    Cells with <3 HCCV covered were filtered. Cells with at least 25% of covered HCCVs mutated were reannotated as cancer, while the others were reannotated as noncancer.

    Somatic nuclear SNV identification

    Candidate SNVs are identified using a modified version of SComatic (available at https://github.com/cbg-ethz/LongSom/tree/main/SComatic) with default parameters except ‐‐min_mq 60. LongSom then applies a set of 10 filters to identify somatic mutations, divided into three categories: coverage, noise, and germline filters. (1) Loci with <5 reads coverage in aggregated cancer cells or aggregated noncancer cells are filtered. (2) Loci with <3 alternative allele reads in at least two cancer cells are filtered. Then, LongSom applies filters intended to remove noise: (3) candidate somatic SNVs are distinguished from background noise and artifacts using a beta-binomial test parameterized using normal samples (Muyas et al. 2024), and loci with a nonsignificant test (threshold 0.001) in cancer cells were filtered as noise. (4) LongSom tests if, when ignoring the reads harboring candidate allele mutation reads, other nonreference allele counts at the locus are significantly higher than expected given the background error rate (beta-binomial test, significance threshold 0.05). (5) Mutations mapping within 4 bp or more of mononucleotide tracts (homopolymers) are filtered. (6) SNV loci present in RNA-editing databases are filtered. (7) SNV loci present in either SR or LR PoN are filtered. Finally, LongSom applies filters targeting germline variants: (8) Loci with a significant beta-binomial test in noncancer cells (threshold 0.05) were filtered. Here, LongSom uses a stricter threshold than the original SComatic (0.001), in order to filter germline variants more efficiently. (9) SNV loci present in the gnomAD database (Chen et al. 2024) with a frequency of at least 1% of the total population were filtered. (10) Finally, adjacent SNV loci within a 10,000 bp distance are filtered, as these are likely misalignment artifacts in low-complexity regions or caused by allele-specific expression in cancer cells. This last filter was not part of the original SComatic.

    Somatic mtSNV calling

    Due to the observed levels of cancer mitochondrial reads in noncancer, we use specific rules to call somatic mtSNVs: (1) Loci with <100 reads coverage in aggregated cancer cells or aggregated noncancer cells are filtered. (2) Loci present in the gnomAD database (Chen et al. 2024) with a frequency of at least 1% of the total population are filtered. (3) Loci where ΔMCF < 0.35 are filtered, with ΔMCF defined above. (4) Loci where VAFCancer < 0.1 are filtered.

    Those VAF and ΔMCF parameters were determined based on the contamination level observed in the HGSOC data set, and can be adjusted depending on the level of mitochondrial contamination.

    Somatic fusion identification

    LongSom detects fusions on the single-cell level using CTAT-LR-fusion v0.13.0 (https://github.com/TrinityCTAT/CTAT-LR-fusion/releases/tag/ctat-LR-fusion-v0.13.0) with standard options: -T fastq –vis (Qin et al. 2025). Fusions present in more than 5% of the cancer cells and <1% of the noncancer cells were considered as somatic. LongSom allows fusions to appear in a low proportion of noncancer cells as they can still harbor fusion reads due to ambient cancer RNA contamination.

    SNV annotation

    SNVs were annotated using ANNOVAR (v2019Oct24) (Wang et al. 2010). An SNV was considered clinically relevant if it completed one of these conditions: it was flagged as pathogenic by ClinVar (Landrum et al. 2014), P-value was <0.05, the VEST (Carter et al. 2013) P-value was <0.05, the DANN (Quang et al. 2015) rankscore was <0.05, or FATHMM (Rogers et al. 2018) flagged it as deleterious.

    Single-cell genotyping

    LongSom computes the alleles observed in each unique cell for each somatic SNV called. A cell is considered mutated at a position if the beta-binomial test is significant (with a significance threshold of 0.01) when applied to reads supporting the alternative allele compared to reads supporting the reference allele. For mtSNVs, to avoid false positives due to contamination, a cell is considered mutated if VAFCell > 0.3, as determined from the HGSOC data (Fig. 4A,B).

    Clonal detection based on SNVs and fusions

    To detect subclones in cancer samples, LongSom only uses somatic SNVs covered in at least five cells, and fusions present in at least three cells, and then filters cells with <3 SNVs or fusions. Cells are then clustered using Bayesian nonparametric clustering (BnpC) (Borgsmüller et al. 2020), with arguments: cores (-n) 16, Markov chain Monte Carlo steps (‐‐steps) 1000, alpha value of the Beta function used as prior for the concentration parameter of the Chinese Restaurant Process (‐‐DPa_prior) [1,1], probability of updating the Chinese Restaurant Process concentration parameter (‐‐conc_update_prob) 0, Beta(a, b) values of the Beta function used as parameter prior (‐‐param_prior) [1,1].

    Clonal detection based on CNAs

    LongSom first computes cell-gene matrices using featureCounts from Subread v2.0.6 (https://subread.sourceforge.net/) with parameters -L, using GRCh38 and GENCODE v36 as reference. It then uses those matrices as input for inferCNV to detect CNA subclones (https://github.com/broadinstitute/infercnv). For running CreateInfercnvObject, reannotated noncancer cells are used as a reference, and the parameter min_max_counts_per_cell=c(1e3,1e7) is used. For running inferCNV, the parameters cutoff=0.1 and leiden_resolution=0.01 are used. The CNA profiles displayed in this study are the ones obtained from the hidden Markov model learned by inferCNV.

    SNV calling using SComatic

    As a comparison for LongSom, we called somatic SNVs in LR scRNA-seq using SComatic (Muyas et al. 2024). For this, we used the marker-based cell types, and default parameters except the mapping quality (‐‐min_mq) of 60 (maximum value for minimap2), the alpha and beta parameters computed for LR data (‐‐alpha1 0.21, ‐‐beta1 104.95, ‐‐alpha2 0.25, ‐‐beta2 162.04), and a minimum distance (‐‐min_distance) between loci of 0. We instead applied the same 10,000 bp distance within “PASS” loci, as we did for LongSom.

    Empty droplet analysis

    To estimate the mtSNV VAF in empty droplets, we first retrieved the empty droplet barcodes in each sample from the 10x Genomics CellRanger analysis. Then, we extracted the reads attached to those barcodes from the original BAM file, and computed the VAF of Chr M loci in each empty droplet barcode using pysam (version 0.21.0) pileup.

    scWGS

    Cell suspensions were loaded and processed using the 10x Genomics Chromium platform with the single-cell CNV kit on the 10x Genomics Chromium Single Cell Controller (10x Genomics PN-120263) according to the manufacturer's instructions. Paired-end sequencing was performed on the Illumina NovaSeq platform (100 cycles, 380 pM loading concentration with 1% addition of PhiX) at 0.1× depth per cell.

    Preprocessing and clonal reconstruction

    Our scDNA-seq data analysis pipeline relied on CellRanger (https://www.10xgenomics.com/products/single-cell-cnv) for read mapping, quantification, binning, and GC and mappability correction. After filtering, we found 282 (P1), 182 (P2), and 290 (P3) cells per sample. The bin size used was 20 kb, leading to very low coverage per bin. We further processed the resulting counts per bin to remove outlier bins with more than 3 times the median counts, and also outlier cells with highly imbalanced read count distribution as assessed by the Gini index. We used SCICoNE (Kuipers et al. 2020) to further segment the data into regions of at least 100 bins, resulting in CNAs that spanned at least 2 Mb. We obtained subclonal copy number trees and assigned cells to the resulting CNA profiles. Subclones were considered as cancer subclones if they had an aneuploid CNA profile, and as noncancer subclones if they had a fully diploid CNA profile.

    Variant allele calling in scWGS subclones

    Cancer subclones were pooled together as well as noncancer subclones due to low coverage (<10× per subclone). To determine if a locus was mutated in scWGS clones, we performed a beta-binomial test parametrized on 500,000 sites randomly selected from aneuploid cells from all samples (significance threshold 0.01). We only considered loci called in scRNA-seq data for scWGS analysis if they fulfilled one of those two conditions: there was at least one mutated read in the locus or at least 17 reads coverage. Loci called in scRNA-seq data were considered as somatic if they were called in cancer clones only, and as germline if they were called in noncancer clones.

    SNV validation in scWGS subclones

    Cancer subclones were pooled together as well as noncancer subclones due to low coverage (<10× coverage per subclone). To determine if a locus was mutated in scWGS clones, we performed a beta-binomial test parametrized on 500,000 sites randomly selected from aneuploid cells from all samples (significance threshold 0.01). We only considered loci called in scRNA-seq data for scWGS analysis if they fulfilled one of two conditions: there was at least one mutated read in the locus or at least 17 reads coverage. Mutations called in scRNA-seq data were considered as somatic if they were called in cancer clones only, and as germline if they were called in noncancer clones.

    De novo scWGS SNV calling

    To call SNVs de novo in scWGS, we considered sites with at least five reads in both cancer and noncancer cells in scRNA-seq, and with at least five reads in both aneuploid and diploid scWGS pooled clones (8–33 M sites). Loci with at least three mutated reads in two aneuploid cells were considered. Loci with a beta-binomial test P < 0.001 in aneuploid clones and P > 0.05 in diploid clones are considered somatic in scWGS. Sensitivity, precision, and F1 performance statistics were computed as described in Muyas et al. (2024).

    Panel DNA sequencing

    Panel sequencing was performed using the FoundationOneCDx assay (FMI, Roche, Switzerland) (Milbury et al. 2022). DNA was extracted from FFPE tissue blocks with at least 20% tumor content with the Maxwell 16 FFPE Plus LEV DNA Purification Kit (AS1135, Promega, Dübendorf, Switzerland). Samples were assayed by adaptor ligation hybrid capture, performed for all coding exons of the FoundationOne panel. Sequencing was performed using the Illumina HiSeq instrument to a median exon coverage ≥500×.

    Software availability

    LongSom is available freely at GitHub (https://github.com/cbg-ethz/LongSom). It can be run as a flexible Snakemake workflow (Köster and Rahmann 2012), which takes as input a high-quality LR scRNA-seq BAM file and a cell type TSV file. All scripts necessary to reproduce this study are available as Supplemental Code and are also available at GitHub (https://github.com/cbg-ethz/LongSom/tree/paper_version).

    Data access

    All raw and processed sequencing data generated in this study have been submitted to the European Genome–phenome Archive (EGA, https://ega-archive.org/) under accession number EGAD50000001292.

    Competing interest statement

    The authors declare no competing interests.

    Acknowledgments

    We thank Joanna Hård for her help with mitochondrial mutation contamination. We thank Ivan Topolsky for his support with cloud computing. We thank Lara Fuhrmann for naming LongSom. BioRender was used to generate the figures. A.D. and N.Bo. were supported by the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement (#766030 to N.Be.). B.J.H. was supported by the National Cancer Institute grant U24CA180922.

    Author contributions: N.Be. acquired the funding for this project. A.D. conceptualized the idea of this project. A.D. designed the LongSom pipeline with the help of N.Bo. The Tumor Profiler Consortium provided scWGS data and NGS panel results, and the full list of authors can be found in Supplementary Data S1. A.D. conducted all computational analyses except scWGS subclones inference (performed by P.F.F.). A.D. did all the visualizations and figures. A.D., N.Bo., F.J., and N.Be. interpreted the results. A.D. and N.Bo. wrote the manuscript with contributions from all authors. B.J.H. assisted with fusion transcript identification via CTAT-LR-fusion. All authors read and approved the final manuscript.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279281.124.

    • Freely available online through the Genome Research Open Access option.

    • Received March 6, 2024.
    • Accepted February 28, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    References

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server