Co-expression networks reveal the tissue-specific regulation of transcription and splicing

Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.


Introduction
Gene co-expression networks are an essential framework for elucidating gene function and interactions, identifying sets of genes that respond in a coordinated way to environmental and disease conditions, and identifying regulatory relationships and potential drug targets (Penrod et al., 2011;Xiao et al., 2014;Yang et al., 2014). Each edge in a co-expression network represents a correlation between two transcriptional products, represented as nodes (Stuart et al., 2003). The majority of gene co-expression networks, traditionally estimated from microarray data, have focused on correlation between total gene expression levels, with edges representative of transcriptional co-regulation. However, post-transcriptional modifications, including alternative splicing, have significant effects on the abundance of specific RNA isoforms (Matlin et al., 2005); Mutations that lead to disruption of splicing play an important role in tissue-and disease-specific pathways (Li et al., 2016b;Lee et al., 2012;DeBoever et al., 2015;Ward and Cooper, 2010;López-Bigas et al., 2005), including tumor progression (Ghigna et al., 2008) and risk of Alzheimer's disease (Hutton et al., 1998;Glatz et al., 2006). While a number of splicing factors are known, regulation of splicing and the specific regulatory genes involved remain poorly understood relative to the regulation of transcription (Melé et al., 2015;Scotti and Swanson, 2015).
RNA-sequencing now allows quantification of isoform-level expression, providing an opportunity to study regulation of splicing through network analysis. However, current research estimating RNA isoform-level networks (Li et al., , 2015(Li et al., , 2016a has not used detailed network representations that distinguish identification of splicing regulation-using relative quantities such as isoform ratios-from transcription regulation-using absolute quantities such as total expression level of each isoform. Moreover, these methods, and initial work on clustering relative quantities, have not be applied to large RNA-seq studies for well-powered network reconstruction in diverse contexts (Dai et al., 2012;Iancu et al., 2015).
Another important gap in our interpretation of regulatory effects in complex traits is a global characterization of gene co-expression networks that are only identified in a disease-relevant tissue type. Pertissue networks have been estimated for multiple tissues (Pierson et al., 2015;Piro et al., 2011), but have been based on limited sample size. Recent studies have recognized the essential role that tissue-specific pathways play in disease etiology (Greene et al., 2015), but have developed these per-tissue networks by aggregating single tissue samples across multiple studies. However, differences in study design, technical effects, and tissue-specific expression will make the cross-study results difficult to interpret mechanistically, with large groups of genes expressed in similar tissues and studies tending to be highly connected rather than providing detailed network structure (Lee et al., 2004). Most importantly, these analyses do not directly separate effects unique to each tissue from shared, cross-tissue effects.
In this work, we reconstructed co-expression networks from the Genotype Tissue Expression (GTEx) v6 RNA-sequencing data (GTEx Consortium, 2015). These data include 449 human donors with genotype information and 7,051 RNA-sequencing samples across 44 tissues. Here, we identified networks that reveal novel relationships from previous analyses and address two important goals in regulatory biology: identification of edges reflecting regulation of splicing, and discovery of edges arising from gene relationships unique to specific tissues.
To do this, we first built co-expression networks for sixteen tissues, with our framework of Transcriptome-Wide Networks (TWNs), including as variables both total gene expression levels and transcript isoform ratios. Hub genes in these networks with many isoform ratio neighbors were significantly enriched for RNA binding genes and genes known to be involved in splicing, demonstrating the utility of this framework in characterizing regulation of splicing. From the GTEx tissue TWNs, we identified novel candidates of both shared and per-tissue transcriptional regulators of relative isoform abundance and splicing. Next, we built tissue-specific gene co-expression networks across 27 tissues, where each network edge corresponds to correlation between genes that is uniquely found in a single tissue. We show that the TSN nodes with greatest connectivity correspond to genes that are essential for tissue-specific processes. Finally, we confirmed specific network edges from both network analyses by testing associations between a regulatory genetic variant local to one gene with the neighbors of that gene in the co-expression networks. Through this analysis, we provided a list of tissue-specific trans-expression QTLs (trans-eQTLs) and trans-splicing QTLs (trans-sQTLs). Interpretation of regulatory and disease studies will benefit greatly from these networks, providing a much more comprehensive description of regulatory processing including alternative splicing across diverse tissues.

Results
Reconstructing Transcriptome-Wide Networks (TWNs) across human tissues First, we aimed to identify networks that capture a global view of regulation across the transcriptome of diverse human tissues using the GTEx project v6 data. We developed an approach for learning Transcriptome-Wide Networks (TWNs) from RNA-seq data, which capture diverse regulatory relationships beyond co-expression, including co-regulation of alternative splicing across multiple genes, and interactions between transcription and splicing. To build a TWN, we first quantified both total expression levels and isoform expression levels of each gene in each RNA-seq sample, and then computed isoform ratios (Fig. 1A), representing the relative, rather than total, abundance of each isoform with respect to the total expression of the gene. All values were projected to quantiles of a standard normal distribution. We included both isoform ratios and total expression levels as network nodes, as opposed to estimating a network across all isoform expression levels. This allowed us to distinguish relationships driven by transcriptional regulation from relationships driven by regulation of relative isoform abundance (including alternative splicing). For example, a network over isoform expression levels instead of ratios would represent the effects of a transcription factor on its target with edges to all isoform levels of the target gene, and the same would be true for a splicing factor. In contrast, in a TWN, a transcription factor would only be connected to the total expression node of its target, and a splicing factor to isoform ratio nodes. (Fig. 1C). TWNs can therefore capture, in interpretable form, relationships such as the total expression of a splicing factor affecting relative isoform abundance of other genes (Sveen et al., 2015).
We then applied graphical lasso (Friedman et al., 2008) to estimate edge weights of a sparse Gaussian Markov Random Field (GMRF) (Rue and Held, 2005) over all nodes jointly, including both the total expression of each gene and the isoform ratio for each isoform (Fig. 1B), Methods). A sparse GMRF captures direct relationships between nodes -a nonzero entry in the precision matrix (interpreted as an edge between two nodes) indicates that the nodes are directly correlated even controlling for effects of all other nodes in the network (i.e., a partial correlation) (Schäfer and Strimmer, 2005). We modified the basic graphical lasso approach to penalize edges between different node types with different weights, and selected the 1 regularization parameters for edges between different node types each based separately on  Figure 1. Transcriptome-Wide Network conceptual framework. (A) Schematic of the effect of a splicing regulator on inclusion of a cassette exon, and resulting total expression and isoform ratios of the target gene. Splicing factor expression levels can affect splicing of target genes (Sveen et al., 2015). Higher expression of a splicing regulator S (first row) results in relatively more transcripts of isoform-1 and fewer of isoform-2. Total expression level is constant (5), but isoform ratios are different (0.4 and 0.6) as splicing factor S levels change. (B) The (i,j)th cell of the sample covariance matrix contains covariance (C ij ) between ith and jth feature in data. We created a sparse precision matrix Θ (inverse covariance) from the sample covariance matrix using graphical lasso to estimate parameters of a Gaussian Markov Random Field. A non-zero value (Θ ij ) in the precision matrix denotes an edge between ith feature and jth feature in the network. (C) Edges in a TWN represent diverse relationships between total expression (TE) and isoform ratio (IR) nodes. Dotted rectangles group together isoform ratios for different isoforms of the same gene. Of particular focus are network "hub" nodes; in a TWN, there are four possible hub configurations depending on the node type of the central and neighboring nodes.  Figure 2. GTEx Transcriptome-Wide Networks summary and replication. A) For each tissue, number of edges and number of hub nodes (≥ 10 neighbors), segmented by the type of nodes connected by each edge (total expression or isoform ratio). For instance, a "TE-IR" hub is a total expression node with multiple isoform ratio neighbors, whereas a "IR-TE" hub is an isoform ratio node with multiple total expression neighbors. B) Fraction of whole blood TWN edges replicating in an independent RNA-seq sample (DGN) Mostafavi et al., 2014).

TWN hubs are enriched for regulators of splicing
We used the sixteen TWNs to characterize the regulation of relative isoform abundance in each GTEx tissue. Here, we focused on evaluation of network hubs. Hub genes, or genes with a high number of neighbors, in a biological network tend to be essential in biological mechanisms and, in co-expression networks, are more likely to have regulatory function (Barabasi and Oltvai, 2004;Jeong et al., 2001;Albert, 2005). Unlike traditional networks, TWNs have two different types of nodes-total expression and isoform ratio-therefore, we have four categories of hub genes that likely reflect different essential and regulatory function (Fig. 1C). For instance, a transcription factor may appear as a total expression node with many total expression neighbors. On the other hand, a hub arising from a total expression node connected to a large number of isoform ratio nodes may reflect a gene important in regulation of alternative splicing. We identified the top hub nodes by degree centrality for all categories in each of the 16 tissues (Supplementary Helps forming large splicing enhancing complexes (Chen and Manley, 2009). A mutation in SRRM2 predisposes papillary thyroid carcinoma by changing alternative splicing (Tomsic et al., 2015). 9 SRSF11 14 A known serine/arginine-rich splicing factor (Zhang and Wu, 1996;Wu et al., 2006) Kulisz and Simon, 2008) that regulates alternative splicing in vivo (mouse). (Kumar P. et al., 2014) 17 PNISR 15 Interacts with PNN, a suggested splicing regulator, and colocalizes with SRrp300, a known component of the splicing machinery (Zimowska et al., 2003).

18
PNN 12 Likely to be involved in RNA metabolism including splicing (Li et al., 2003) 19 PTMS 12 Involved in RNA synthesis processing (Vareli et al., 2000). 20 CCDC85B 15 Table 1. Top 20 cross-tissue TE-IR hubs. Rank is the rank-product rank of the gene. #Tissues is the number of tissues, out of 16, for which the hub gene (total expression) has at least one isoform ratio neighbor.

Co-regulation of expression and isoform ratios reflect biological pathways
It has been observed that genes with similar function or that participate in the same pathway often have correlated patterns of gene expression (Khatri et al., 2012;Hormozdiari et al., 2015). In the GTEx TWNs, we observed enrichment of edges between transcription factors and known target genes (Methods, Supplementary Fig. 5), as expected (Prieto et al., 2008;Roider et al., 2009). We also observed enrichment of closely connected genes for a number of Reactome pathways (Fabregat et al., 2016)    components / total number of tested pathways for that tissue, considering only total expression nodes. (B) Enrichment for shared Reactome pathway annotation among gene pairs connected by edges between two isoform ratio nodes. (C) Enrichment for shared Reactome pathway annotation among gene pairs connected by an edge between a total expression node and an isoform ratio node.
In contrast to correlations among total expression levels of genes, patterns of correlation among relative isoform abundances are far less well-studied, and it has not been established whether or not the regulation of splicing is coordinated across functionally related genes. Initial studies have identified such correlation in particular tissues (Iancu et al., 2015) and specific processes (Dai et al., 2012). We evaluated each TWN for enrichment of edges between functionally related genes. For all 16 tissues, the TWNs demonstrated significant abundance of edges between isoform ratios of two distinct genes that participate in the same Reactome pathway (all tissues significant at BH corrected p ≤ 0.05; median p ≤ 10 −14 ; Fig. 4A). Similarly, edges between total expression of one gene and isoform ratio of a second gene were enriched for a gene annotation in the same pathway (median p ≤ 10 −5 ; Fig. 4B). As expected, we also observed pathway enrichment for edges between total expression nodes ( Supplementary Fig. 6). The patterns of functional enrichment were in fact somewhat stronger among total expression nodes, which could be influenced by more accurate quantification of total expression compared with isoform ratios from RNA-seq data, derivation of functional annotation from gene expression studies, or tighter co-regulation of transcription compared with splicing of functionally related genes. These results provide evidence in a large, multitissue data set that functionally related genes display correlated variation in relative isoform abundance, indicative of co-regulation of splicing. Leveraging this finding allows TWNs to be used to identify sets of possibly co-functional genes, or to predict the function of unknown genes (Warde-Farley et al., 2010) based on a more comprehensive understanding of co-regulation including regulation of splicing.

Comparison between TWNs reveals per-tissue hub genes
We evaluated the overall similarity of the TWNs between tissues. We tested concordance of hubs between each pair of tissues using Kendall's rank correlation computed over genes ordered by degree centrality ( Supplementary Fig. 7). We observed greater than random levels of similarity between most tissues for all hub types (median p ≤ 1.0 × 10 −5 for each hub type) and functionally related tissues displaying greater levels of similarity. For example, the two skin tissues were grouped together for each hub type, and were found to be similar to esophagus -mucosa, which contains primarily epithelial tissue (Squier and Kremer, 2001). Skeletal muscle and heart -left ventricle grouped together, and breast was highly similar to the two adipose tissues, reflecting shared adipose cell type composition. While these results may be partially influenced by overlap in donors, they provide evidence that splicing displays variability between tissues and is regulated in a tissue-specific manner (Qian et al., 2005;Ong and Corces, 2011).
To identify candidate tissue-specific regulatory genes, we evaluated TE-IR hubs that were ranked high by degree centrality in related tissues, but ranked low among unrelated tissues (Methods, Supplementary Table 5). Several of the top ranked tissue-specific hubs were genes with evidence of known tissue-specific function or relevance. In the tissue group including breast and two adipose tissues, the top tissue-specific TE-IR hub was TTC36, a gene highly expressed in breast cancer (Liu et al., 2008). The top ranked gene hub for the skeletal muscle and heart -left ventricle tissues group was PLEKHG4, which has been reported to be associated with cardiac failure in mice (Mullick et al., 2006).

Tissue-specific networks identify gene co-expression patterns unique to tissues
To directly assess the tissue-specificity of co-expression relationships, we built tissue=specific networks by considering all GTEx samples across tissues simultaneously, decomposing co-expression signal into co-expression signals shared across tissues and those specific to individual tissues. To do this, we applied a Bayesian biclustering framework, BicMix, and reconstructed tissue-specific networks from the fitted model . BicMix uses a sparsity-inducing prior to differentiate between co-expression relationships specific to a single tissue and those shared across tissues, controlling for batch effects, population structure, and shared individual effects across tissues . The strength of this joint approach applied to over 7,000 RNA-seq samples is that, with more thorough and comprehensive sampling of heterogeneous tissues types, we are able to isolate co-expression signals unique to single tissues and to reconstruct much more precise tissue-specific networks (TSNs).
We built TSNs for 27 GTEx tissues. Here, we limited network nodes to total gene expression rather than isoform ratio for simplicity. Across the 27 TSNs, the mean number of nodes (considering only genes with tissue-specific edges) was 48, and the average number of edges was 436. However, this average is driven by four larger networks (Supplementary Table 6 in these TSNs as compared to the TWNs because we only included genes in the network that had one or more tissue-specific edge. For the eight tissues for which we also constructed TWNs, we found that tissue-specific networks indeed overlapped with the TWNs, generally showing greater concordance for the matched tissue ( Supplementary Fig. 9). As expected, TSNs contained a small subset of edges from full per-tissue TWN, representing the co-expression components that are tissue-specific rather than shared.
Additionally, we built four tissue-specific networks for subsets of similar tissues (GTEx Consortium, 2015) to capture gene relationships common within each subset but unique compared with all other tissues.
Most tissues showed expression patterns close to at least one other assayed tissue (Supplementary Fig.   10), leading to a depletion of tissue-specific effects, and motivating evaluation of similar tissues together.
We considered four groups: 1) all 13 brain tissues; 2) two adipose tissues and breast; 3) two heart tissues and three artery tissues and 4) four digestive tissues. On average, these tissue subset networks contained 4,779 edges and 189 nodes. However, this was skewed by the brain network, which contained 18,854 edges connecting 648 nodes. Excluding the brain network, we found 87 edges and 36 nodes on average across the other three tissue subset networks. The 27 single tissue TSNs and the four tissue subset networks are available at http://gtexportal.org (in progress).
Functional analysis of tissue-specific networks. We investigated the functional properties of the genes and edges identified in each tissue-specific network. First, we measured sharing of network components between the 27 distinct TSNs. We found minimal sharing of network nodes and even less sharing of network edges among all pairs of tissues (Jaccard coefficient; Fig. 5A). This was expected as a result of BicMix's strong control over shared individuals and gene expression levels across tissues. Tissue pairs that appeared to share genes predominantly included brain tissues. We confirmed enrichment of known tissue-specific genes in the TSNs using a previously defined list of Gene Ontology (GO) terms (Ashburner et al., 2000) indicative of tissue-specific function available for eleven tissues (Pierson et al., 2015). We Next, we evaluated the hub genes in each TSN, considering three thresholds of centrality: ≥ 5 edges ("small hubs"), ≥ 10 edges ("hubs"), and ≥ 50 edges ("large hubs"). Hubs were not enriched overall for cross-tissue transcription factors (TFs) (hypergeometric test across all TSNs, p ≤ 0.84; small and large hubs showed similar results), or for cross-tissue and tissue-specific TFs (hypergeometric test across all TSNs, p ≤ 0.90; small and large hubs showed similar results); this result echos previous work on transcription factor enrichment in genes with cis-eQTLs that appear to have broad regulatory effects on transcription (Jo et al., 2016;Weiser et al., 2014). However, hubs in several networks included genes known to play a role in tissue-specific function and disease. Specifically, we found that the single large hub in brain -caudate, MAGOH, which is a part of the exon junction complex that binds RNA, has been found to regulate brain size in mice through its role in neural stem cell division (Silver et al., 2010).
The single large hub for artery -aorta, MAB21L1, has been shown to be an essential gene for embryonic heart and liver development in mice by regulating cell proliferation of proepicardial cells (Saito et al., 2012).
Additionally, we measured enrichment of known pathways in the TSNs. While we did not observe  Figure 5. Cross tissue comparison of TSN results. (A) Jaccard coefficient quantified on shared edges (upper triangular) and shared nodes (lower triangular) across pairs of TSNs; (B) Gene expression levels, removing factors from BicMix not included in the network, for the genes identified in the TSN for artery -aorta. The y-axis is ordered by similarity to artery -aorta, with a star by the samples from artery -aorta. The colors on the y-axis correspond to the GTEx tissue legend above. The x-axis is ordered by expression similarity (i.e., hierarchical clustering), and hub genes are labeled with the large hub denoted in bold; (C) tissue-specific network for artery -aorta. Node size reflects betweenness centrality of the nodes. Orange nodes reflect replication in the BioCarta acute myocardial infarction (AMI) pathway; orange edges show the neighbors of the AMI pathway nodes. enrichment across all tissues, we found that the hematopoietic cell lineage KEGG pathway was significantly enriched in the TSN for EBV transformed lymphoblastoid cell lines (LCLs; Fisher's test, B-H adjusted p ≤ 0.05); a hematopoietic stem cell is the developmental precursor of leukocytes, reflecting the expression signature of the parent cell type. The LCL TSN also had significant enrichment in the BioCarta IL-17 signaling and T cytotoxic cell surface molecules pathways (LCLs; Fisher's test, B-H adjusted p ≤ 1.50 × 10 −4 ). IL-17 is a cytokine produced from T-cells that is involved in inflammation, and the cytotoxic t-cell pathway is involved in eliminating cells with certain surface antibodies. Although not significant after multiple testing correction, artery -coronary showed enrichment in four tissue-relevant pathways (uncorrected p ≤ 0.016 for all): the ACE2 pathway, which regulates heart function, acute myocardial infarction (AMI) pathway, the intrinsic prothrombin activation pathway, which is involved in one phase of blood coagulation, the platelet amyloid precursor protein (APP) pathway, which includes genes involved in anti-coagulation functions in platelets and senile plaques, and the vitamin C in the brain pathway, which is responsible for the cellular uptake of reduced ascorbate from platelets (Fig. 5).

Integration of networks with regulatory genetic variants
Both TWNs and TSNs were estimated using gene expression data alone. However, the GTEx v6 data also include genotype information for each donor. We intersected the edges detected by our networks with QTL (quantitative trait locus) association statistics to replicate specific network edges through evidence of conditional associations with genetic variants across those edges and to increase power to detect long range (trans) effects of genetic variation on gene regulation. First, we demonstrated that, for both TWNs and TSNs, there was enrichment for associations between the top cis-eVariant (the variant with lowest p-value per gene with a significant cis-eQTL) for each gene and the expression level or isoform ratio of its network neighbors based on QTL mapping in the corresponding tissue (Fig. 6). This provides evidence of a relationship between the pairs of genes connected by an edge. For TWNs, among total expression nodes with an isoform ratio neighbor, we found evidence for 61 trans (i.e., inter-chromosomal) associations and 86 intra-chromosomal associations tested between a cis-eVariant for the total expression gene and the isoform ratio of the neighboring node (FDR ≤ 0.05), providing support for these TWN edges. Our top two associations were between two variants, rs113305055 in artery -tibial and rs59153288 in breast (both near TMEM160 ), with isoform ratios of CST3 (p ≤ 9.3 × 10 −8 , and p ≤ 4.0 × 10 −7 , respectively).
TMEM160 is the top cross-tissue hub in our TWNs with many IR neighbors (Table 1). Thus, we tested for association of these variants with all isoform ratios genome-wide and observed a substantial enrichment of low p-values in numerous tissues ( Fig 6A; Supplementary Fig. 11). In the TSNs, we identified seven cis-eVariants across five tissues associated with eight different trans-eGenes through seven unique cis-eGene targets, one of which was intra-chromosomal (FDR ≤ 0.2; Supplementary Table 7). We also observed enrichment for low p-values over the tests corresponding to each network edge (Fig. 6B).
We used the TWNs and TSNs in a trans-QTL discovery framework that does not rely on prior cis-eQTL association testing. In the test described above, it is possible that artifactual correlations among gene expression levels, identified as edges in the networks, may also lead to artifactual trans-eQTL associations among cis-eVariants and the neighboring genes of the cis-eGene detected from the same data. However, large-scale, independent RNA-seq data sets are unavailable for most of the tissue types represented in GTEx, and identifying trans-eQTLs from a standard genome-wide association test has proven challenging (Jo et al., 2016), in part due to the large number of statistical tests. Restricting association tests to a plausible subset, according to prior knowledge (Jo et al., 2016) such as network relationships (Weiser et al., 2014) can increase statistical power substantially.
Here, we identified novel trans-splicing QTLs (trans-sQTLs) and trans-expression QTLs (trans-eQTLs) by restricting the set of tests to those suggested by the TWNs and TSN edges in the corresponding tissue. From the TWNs, we sought to identify trans-splicing QTLs (sQTLs) based on TE-IR hub genes, using the top 500 hubs by degree centrality. We tested every SNP within 20 Kb of the TE hub-gene's transcription start site (TSS) for association with isoform ratios of each neighbor in the TWN. Using this approach, we identified 58 trans-sQTLs corresponding to six unique genes (sGenes) at FDR ≤ 0.2 (Table 2). For example, we identified a trans-sQTL association in skeletal muscle between rs115419420 and CARNS1 (p ≤ 2.18 × 10 −5 ) that is supported by a cis association with the TE-IR hub CRELD1.
This eVariant also showed enrichment for low p-values with numerous isoform ratios genome wide (Fig.   6C). In the TSNs, we identified 27 trans-eQTLs using variants within 20 Kb of each gene and testing for association with the neighbors of those genes in the gene expression data of the same tissue (FDR ≤ 0.2; Supplementary Table 8). All of these associations were with inter-chromosomal. Overall, we saw a substantial enrichment of p-values for association between genetic variants local to a gene and the genes' neighbors in each network (Fig. 6B).

Discussion
We reconstructed co-expression networks that capture novel regulatory relationships in diverse human tissues using large-scale RNA-seq data from the GTEx project. First, we specified an approach for integrating both total expression and relative isoform ratios in a single sparse Transcriptome-Wide Network (TWN) based on graphical lasso. Splicing is a critical process in a number of tissue-and disease-specific processes and pathways (Ghigna et al., 2008;Hutton et al., 1998;Glatz et al., 2006;D'Souza et al., 1999), but, critically, isoform ratios have not been included in co-expression network analysis to allow the study of splicing regulation. We estimated TWNs from sixteen tissues and demonstrated that hubs in TWNs are strongly enriched for genes involved in RNA binding and RNA splicing. We found that, across tissues, the top hub genes with isoform ratio neighbors included many genes with known impact on splicing such as RBM14, a hub in all 16 tissues with TWNs. We identified a number of novel shared and tissue-specific candidate regulators of alternative splicing. TWNs may be extended to include diverse expression phenotypes such as allelic expression or individual splicing event frequencies. As more large-scale RNA-seq studies become available, this method will be useful for analyzing diverse types of regulatory relationships in disease, longitudinal, and context-specific studies.
Next, we estimated tissue-specific networks for 27 single tissues and across four tissue subsets; these networks represent co-expression relationships unique to individual tissues and sets of closely related tissues. Distinguishing between shared and tissue-specific structure across single tissue co-expression networks is challenging, but is essential for understanding tissue-specific regulatory processes in disease.
From these tissue-specific networks, we identified hub genes involved in the essential tissue-specific regulation of transcription, such as MAGOH in the brain -caudate specific network and MAB21L1 in the artery -aorta specific network, both of which are essential for the development of their specific organs.
We used these networks to quantify shared relationships across tissues, and found minimal sharing of relationships across these 27 tissues. Finally, we replicated edges in our networks using integration of genetic variation, and we identified 33 novel trans-QTLs affecting both expression and splicing. Together, our results provide the most comprehensive map of gene regulation, splicing, and co-expression in the largest set of tissues available to date. These networks will provide a basis for interpreting the transcriptome-wide effects of genetic variation, differential expression and splicing in complex disease, and impact of diverse regulatory genes in the human genome.

RNA-sequencing alignment and transcript quantification
The RNA-seq processing pipeline follows previously described steps . Adapter sequences and overrepresented contaminant sequences, identified by FastQC (v.0.10.1) (Andrews, 2010), were trimmed using Trimmomatic (v.0.30) (Bolger et al., 2014) with 2 seed mismatches and a simple clip threshold of 20. Leading and trailing nucleotides (low quality or N s) were trimmed from all reads until a canonical base was encountered with quality greater than 3. For adaptive quality trimming, reads were scanned with a 4-base sliding window, trimming when the average quality per base dropped below 20.
Any remaining sequences shorter than 30 nucleotides were discarded.
We aligned the RNA-seq reads using the Star aligner in 2-pass mode (Dobin et al., 2013). After preparing the genome with STAR aligner genomeGenerate mode using a splice junction database (sjd-bGTFfile) set to GENCODE v.19 annotation, the splice junction database overhang (sjdbOverhang) set to 75 bp, and defaults for all remaining settings. STAR aligner alignReads mode was run using default settings except outFilterMultimapNmax was set to 1 so that only uniquely mapping reads were retained.
We performed transcript and gene quantification using RSEM v1.2.20 (Li and Dewey, 2011). We used default settings, allowing for four threads and using paired-end aware quantification.

Pre-processing for per-tissue TWNs
We considered only protein-coding genes on the autosomes and X chromosome to construct TWNs in all tissues. We filtered genes and isoforms using a threshold of 10 samples with ≥ 1 TPM and ≥ 10 samples with ≥ 6 reads. We also filtered out genes where the Ensembl gene ID did not uniquely map to a single HGNC gene symbol. Isoform ratio was computed by using annotated isoforms in Gencode V19 annotation, and undefined ratios (0/0, when none of the isoforms was expressed) were imputed from the mean ratio per isoform across individuals. Each gene's least abundant isoform was excluded to avoid linear dependency between isoform ratio values. We log-transformed the total expression data and standardized both total expression levels and isoform ratios. To correct hidden confounding factors, we applied HCP (Hidden covariates with prior) (Mostafavi et al., 2013), whose parameters were selected based on an external signal relevant to regulatory relationships. Namely, we selected parameters that produced maximal replication of an independent set of trans-eQTLs from meta-analysis of a large collection of independent whole blood studies (Westra et al., 2013). For both total expression levels and isoform ratios of genes in all tissues, the best HCP parameters (k = 10, λ = 1, σ 1 = 5, σ 2 = 1), which consistently reproduced a largest subset of the gold-standard trans-eQTLs in GTEx whole blood samples even when subsetting the number of samples, were used for correcting data. Finally, quantile-normalization to a standard normal distribution was applied.
To avoid spurious associations due to miss-mapped reads, we filtered out genes and isoforms with mappability score constraints (≥ 0.97). We first downloaded mappability scores of all 75-mers and 36-mers in the human reference genome (hg19) from the UCSC genome browser for exonic regions and untranslated regions (UTRs), respectively (accession: wgEncodeEH000318, wgEncodeEH00032) (Derrien et al., 2012). For each gene, we then measured mappability scores for either exonic regions or UTRs with corresponding k-mers matched to the regions and aggregated the mappability scores for two regions by computing their weighted average. The weights were proportional to the total length of exonic regions and UTRs.
We filtered out isoforms of a gene if the mean IR of a most dominant isoform was ≥ 0.95. In each tissue, we further reduced the number of features to 6, 000 genes and 9, 000 isoforms for computational tractability. To do so, we first considered genes or isoforms if > 10 samples have TPM > 2 or reads > 6.
To obtain the final set of genes, we first considered the top 9, 000 genes based on their average expression levels and then selected the top 6, 000 highly variable genes across individuals. Similarly, to obtain the final set of isoforms, we first considered the 13, 500 genes with the highest expressed isoform levels on average. We reduced this to 11, 250 genes based on the entropy of isoform ratios across individuals, normalized by the maximum entropy possible with the same number of isoforms, and finally took the top 9, 000 most highly variable isoforms in terms of TPM values.

Per tissue Transcriptome-Wide Networks (TWNs)
We built per-tissue Transcriptome-Wide Networks (TWNs) using scalable graphical lasso (Hsieh et al., 2011). We estimated a sparse precision matrix (Θ) by optimizing the following objective with Λ specifying different penalties for different types of edges: where the entry in rth row and cth column of Λ was (2) Here, gene(k) denotes the gene that the kth feature belongs to. type(k) denotes whether or not the kth feature represents total expression ('TE') or isoform ratio ('IR').
We did not penalize diagonal entries (λ d = 0), and put a small non-zero penalty for edges between distinct features belonging to the same gene (λ s = 0.05), such as distinct isoforms of the same gene. We selected the other penalties (λ tt , λ ti , λ ii ) such that the network has a scale-free topology with a reasonable number of edges. The empirical pairwise correlations distribution for different types of edges were different: correlations between two total expression nodes were generally much higher than those between two isoform ratio nodes or between a total expression node and an isoform ratio node ( Supplementary Fig 1), while the later two distributions were apparently similar. We tried all (λ tt , λ ti , λ ii ) combinations where λ tt {0.3, 0.35, 0.4, 0.45, 0.5}, Λ ti {0.25, 0.3, 0.35, 0.4}, and λ ti = λ ii . We measured the scale-free property by the square of correlation (R 2 ) between log(p(d)) and log(d), where d is an integer and p(d) represents the fraction of nodes in the network with d neighbors (Zhang and Horvath, 2005). We selected penalty parameters so that R 2 ≈ 0.85 and there were at least 5,000 edges of each type. Selected parameters for each tissue are shown in Supplementary Table 1. Each non-zero element in Λ rc in the precision matrix with selected penalty parameters represents an edge between the rth and cth features in our network.
We excluded some edges from our networks for quality purpose and interpretability. In other words, we excluded edges between nodes belonging to same gene for downstream analysis. Then, we aligned every 75-mer in exonic regions and 36-mers in UTRs of every gene with mappability < 1.0 to the reference human genome (hg19) using Bowtie (v 1.1.2) (Langmead et al., 2009). If any of the alignments started within an exon or a UTR of another gene, then these two genes were considered "cross-mappable", and we excluded edges between these pairs of genes. We also excluded edges between genes with overlapping positions in the reference genome to avoid mapping artifacts.

Replication of whole blood TWN
We replicated our network edges with GTEx whole blood tissue in an independent RNA-seq data set: Depression Genes and Networks (DGN) Mostafavi et al., 2014). These RNA-seq data include 15,231 genes and 12,080 isoforms from whole blood in 922 samples, out of which 5,609 genes and 1,464 isoforms were uniquely mapped to the set of genes and isoforms used in GTEx whole blood network construction. Firstly, to check if the genes and isoforms directly connected in the GTEx whole blood network were supported by correlation in the DGN data set, we computed the fraction of significantly correlated (Spearman correlation, FDR ≤ 0.05) gene-gene/gene-isoform/isoform-isoform pairs in DGN.
We then compared these fractions with those in random pairs generated by permuting genes/isoforms labels in the TWN. Next, to verify if our method could reproduce relationships in GTEx whole blood network from DGN data, we tested if node pairs connected directly or indirectly in the GTEx whole blood network had a shorter distance between them in the DGN network compared to the same network with the node labels shuffled. We performed a one-sided Wilcoxon rank-sum test between two groups: i) pairwise distances between GTEx-connected TE-TE / TE-IR / IR-IR pairs in the DGN network. ii) those in random DGN networks generated by permuting genes/isoforms among themselves. Here we generated random networks ten times to estimate the null distribution.

Hub ranking
We ordered the network hubs by degree centrality for each tissue according to the number of unique gene-level connections to avoid the effect of different number of isoforms per gene. To do this, we created a gene-level network from original TWNs by keeping TE nodes as they were and grouping all isoforms of the same gene together to form a compound IR node. We put an edge between a compound IR node and a total expression node (or another compound isoform ratio node) if any isoform of the compound had an edge with the TE node (or any isoform of the other compound) in the original TWN, and the weight is equal to the sum of absolute weights of all such edges in the original TWN. TE-TE and IR-TE hubs were ordered by the number of TE nodes they were connected with. TE-IR and IR-IR hubs were ordered by the number of compound IR nodes they were connected with. If multiple hubs had the same number of connections, ties were broken by the sum of corresponding edge weights.

Splicing and RNA binding enrichment in top TE-IR hubs
We downloaded a list of human genes annotated with RNA Splicing (GO:0008380) and RNA Binding (GO:0003723) using topGO (Alexa and Rahnenfuhrer, 2016). We computed the enrichment of these RNA splicing and RNA binding genes in top 500 TE-IR hubs using Fisher's exact test. The set of all genes represented in the corresponding network was used as background.

General hubs and tissue-specific hubs
We used rank-product (Zhong et al., 2014) to find hubs generally ranked high in a set of tissues. We first ranked genes by the number of neighbors in the gene-level network. If a gene had no edge in the network, its rank was considered to be the number of genes with neighbors plus one. A gene's rank-product is the product of its ranks from each network. The top general hub gene had the least rank-product.
To find hubs specific to a group of tissues, we used rank-product to rank hubs in both the target group of tissues, and all other tissues, separately. Then, we normalized ranks so that the top-and bottom-ranked hub have a score of 1 and 0, respectively. Let the normalized rank of a gene in the target group of tissues and other tissue be r t and r o , respectively. Then, the F-score for the gene (r): will be high if it ranks high in the target group, but low in other tissues.
networks, we looked across each run with a network specific to a tissue, and every edge that appeared in > r runs was included in the final network. The parameter r was chosen for each tissue based on the number of runs with networks and the number of samples and genes in the run specific networks.

Genotypes from GTEx data
The 2.5M and 5M BeadChip genotypes were merged to yield approximately 1.9 million genotyped SNPs.
A greater set of genotypes were imputed using IMPUTE2 (Howie et al., 2009), yielding a satisfactory distribution of imputation scores for MAF ≥ 0.01 (mean INFO of 0.888 and median of 0.951 for variants with MAF between 0.01 and 0.05). The genotypes were filtered for MAF ≥ 0.05, leaving approximately 6 million variants. In order to take full advantage of SNP imputation, we used continuous (MLE of the dosage, ranging from 0 to 2) genotypes in association mapping. The genotype-level principle components were computed with the imputed genotypes.

eQTL mapping
For each tissue, the gene expression values were projected to the quantiles of a standard normal, and Matrix-eQTL (Shabalin, 2012) was used to test all SNPs within the 150 Kb window of a gene's transcription start site (TSS) or end site (TES) using an additive linear model. For cis-eQTL analyses, we optimized the number of PEER factors by tissue to a test chromosome (chromosome 11) to maximize the number of identified cis-eQTLs. We included in association mapping a tissue-specific number of PEER factors, sex, genotyping batch, and three genotype principal components. The correlation between SNP and gene expression levels was evaluated using the estimated t-statistic from this model. False discover rate (FDR) was calculated using the q-value R package (Storey, 2003;Dabney et al., 2010). We used these cis-eQTLs for the trans-eQTL analysis for the TSN edge replication.

Trans-splicing QTLs
We computed trans-splicing QTLs using two approaches. In the first approach, we computed the best cis-associated variant per gene (smallest p-value) located within 1 Mb from the transcription start site (TSS) of the gene . Then for every total expression node connected with an isoform ratio in the network, we measured association between the gene's best cis-associated variant and all the