Abstract
Spatial transcriptomics enables fine-scale characterization of spatial heterogeneity and cellular niches within tissues and has substantially advanced our understanding of tissue architecture and functional organization. However, existing spatial transcriptomics integration methods often struggle to effectively capture the rich morphological information provided by the histology and thus further limiting their capacity for comprehensive cross-modality learning. In this paper, we present SYMOL, a unified synergistic self-supervised multimodal framework that integrates spatial coordinates, gene expression, and histological images covering both multichannel immunohistochemistry (IHC) and hematoxylin and eosin (H&E) stains for effective spatial transcriptomics integration and representation learning. Specifically, SYMOL extracts distinct visual characteristics via several pretrained large vision models and synergistically aggregates cross-modal features into unified morphology-aware embeddings. Comprehensive benchmarking on multiple publicly available spatial transcriptomics datasets with multichannel IHC images and H&E images shows that SYMOL consistently surpasses state-of-the-art methods in various downstream tasks including cellular niche identification, multislice integration, cross-dataset label transfer, and gene-expression enhancement. In addition, SYMOL accurately delineates tumor microenvironment in lung tissues with histopathological imaging and enables fine-scale mapping of cellular niches in the mouse brain, thereby demonstrating both clinical relevance and robustness in complex neuroanatomical settings.