Integrative analysis of RNA, translation, and protein levels reveals distinct regulatory variation across humans

  1. Michael P. Snyder1
  1. 1Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA;
  2. 2Institute of Software Technology and Interactive Systems, Vienna University of Technology, A-140 Vienna, Austria;
  3. 3RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, Massachusetts 01605, USA;
  4. 4CIRI, International Center for Infectiology Research, Eukaryotic and Viral Translation Team, Université de Lyon, INSERM U1111, Lyon, 69634, France
  1. Corresponding author: mpsnyder{at}stanford.edu

Abstract

Elucidating the consequences of genetic differences between humans is essential for understanding phenotypic diversity and personalized medicine. Although variation in RNA levels, transcription factor binding, and chromatin have been explored, little is known about global variation in translation and its genetic determinants. We used ribosome profiling, RNA sequencing, and mass spectrometry to perform an integrated analysis in lymphoblastoid cell lines from a diverse group of individuals. We find significant differences in RNA, translation, and protein levels suggesting diverse mechanisms of personalized gene expression control. Combined analysis of RNA expression and ribosome occupancy improves the identification of individual protein level differences. Finally, we identify genetic differences that specifically modulate ribosome occupancy—many of these differences lie close to start codons and upstream ORFs. Our results reveal a new level of gene expression variation among humans and indicate that genetic variants can cause changes in protein levels through effects on translation.

Deciphering the molecular mechanisms that underlie human variation is essential for understanding human diversity and personalized medicine. To date, genetic variants that affect protein function in humans have been well studied, but those that control protein levels are less well characterized. Yet, misregulation of protein levels can have profound consequences for human health. For example, transcriptional regulatory mutations that increase telomerase gene expression have been identified in ∼70% of melanoma patients (Horn et al. 2013; Huang et al. 2013) and are frequent in several other cancers (Huang et al. 2013). Similarly, changes in protein levels of SHANK3, neuroligins and neurexins have been linked to autism spectrum disorder, schizophrenia, and learning disorders (Darnell 2011). Therefore, understanding how RNA levels and translation efficiency control protein levels on an individual basis is required not only for understanding human phenotypic diversity, but also for personalized medicine as thousands of human genome sequences become available.

Protein expression is determined at many levels, including (1) RNA expression, (2) translation efficiency, and (3) protein stability. Recent studies have begun to unravel the extent of human variation at RNA levels and its control through transcription factor binding sites and chromatin (Stranger et al. 2007; Kasowski et al. 2010, 2013; McDaniell et al. 2010; Montgomery et al. 2010; Pickrell et al. 2010; Pai et al. 2012; Westra et al. 2013). However, protein levels often correlate poorly with RNA expression (Vogel and Marcotte 2012; Ly et al. 2014). Translation efficiency, i.e., the number of proteins synthesized per mRNA, has been suggested to account for a large component of the unexplained variation in protein levels (Schwanhäusser et al. 2011; Marguerat et al. 2012). Although recent studies in yeast have begun to address the genetic control of translation efficiency (Albert et al. 2014; Artieri and Fraser 2014; McManus et al. 2014; Muzzey et al. 2014), little is currently known about variation in translation efficiency and its genetic determinants in humans. Further, an integrated view of how expression is controlled at many different levels is lacking in humans.

Here, we utilized RNA-seq and ribosome profiling to identify ribosome occupancy profiles. Ribosome profiling involves RNase digestion of unprotected RNA and isolation of ribosome-bound mRNA segments. The sequences of ribosome-protected mRNA fragments can then be used to deduce the number of ribosomes per message in conjunction with RNA-seq data. We integrated these measurements with quantitative proteomics to reveal a comprehensive view of the variation in gene expression programs across a diverse set of humans.

Results

Measuring ribosome occupancy across individuals at a global scale

To measure genome-wide ribosome occupancy of mRNAs, we first improved and adapted the ribosome profiling protocol (Ingolia et al. 2009, 2011, 2012) for lymphoblastoid cell lines (LCLs) (Fig. 1A). A critical step in the ribosome profiling method involves RNase digestion of unprotected RNAs before isolating and sequencing ribosome-associated mRNAs. While optimizing the protocol, we observed that RNase I digestion caused extensive degradation of ribosome integrity (Fig. 1B,C; Supplemental Fig. S1A). The loss of polyribosome signal was not accompanied by a corresponding increase in monosome signal (Fig. 1C; Supplemental Fig. S1A), but rather a shift toward lighter fractions indicative of free and degraded RNAs (Fig. 1C; Supplemental Fig. S1A). Hence, RNase I treatment can lead to a loss in ribosome integrity in addition to producing the expected 80S ribosome footprints (i.e., monosomes).

Figure 1.

Choice of RNase is critical for generating ribosome profiling data. (A) A schematic representation of the ribosome profiling strategy is shown. A key step is the digestion of unprotected RNA segments with an RNase. The ribosome-protected RNA segments are isolated using a sucrose cushion and prepared for high-throughput sequencing. (B) Human lymphoblastoid cells (GM12878) were lysed in the presence of cycloheximide. The samples were ultracentrifuged through a 10%–50% sucrose gradient. Samples were fractionated while continuously monitoring absorbance at 254 nm. A representative polysome profile is shown. (C) Samples were prepared for ultracentrifugation as in B with the following exception: The cleared lysate was incubated with 100 units of RNase I (Ambion) for 30 min at RT before the ultracentrifugation. (D) Samples were prepared as in B, except 300 units of RNase T1 (Fermentas) and 500 ng of RNase A (Ambion) were used for the RNase digestion step. A complete digestion of polysomes into monosomes was observed. (E) Schematic representation of the data sets used in the current study. Genotype, ribosome profiling, RNA-seq, and mass spectrometry-based proteomics data were collected from lymphoblastoid cells derived from a diverse group of 30 individuals.

We tested whether other RNases could alleviate this problem and found that treatment with RNase A and RNase T1 (which collectively cut after C, U, and G) resulted in complete digestion of polyribosomes into monosomes (Fig. 1D; Supplemental Fig. S1A). Recent work in Drosophila and other systems also reported the importance of optimizing nuclease digestion to generate robust ribosome profiling data (Dunn et al. 2013; Ricci et al. 2014). Using our optimized ribosome profiling protocol, we generated ribosome occupancy maps for LCLs obtained from 30 individuals of diverse ethnic backgrounds: five Europeans, two Asians, and 23 Yorubans with significant genetic diversity (Fig. 1E). These lines were chosen because (1) their genomes have been sequenced (The International HapMap 3 Consortium 2010; The 1000 Genomes Project Consortium 2012); (2) their relative protein and RNA levels have been previously measured (Khan et al. 2013; Wu et al. 2013); and (3) they can be grown in large quantities. Importantly, the ribosome occupancy maps were based on at least two replicate samples for the majority of individuals. In parallel, we generated 44 deep RNA-seq libraries (with a median of ∼12 million uniquely mapped transcriptome reads) from the same cells and combined these with those from previous work (Pickrell et al. 2010; Lappalainen et al. 2013; ’t Hoen et al. 2013), thereby providing multiple RNA-seq replicates for most individuals.

We leveraged replicate measurements to assess data quality and its dependence on several parameters, including alignment strategy, mRNA enrichment method, PCR artifacts, gene length normalization, and batch effects (Supplemental Fig. S2A–E; Supplemental Methods). In addition to verifying the high quality of the data, replicate measurements also enabled modeling of gene-specific variance in RNA expression and ribosome occupancy per individual, allowing robust derivation of individual-specific translation efficiency estimates. Specifically, we developed a linear modeling–based approach to regress out the effects of RNA expression from ribosome occupancy measurements to calculate translation efficiency (Supplemental Methods).

Finally, for the 28 individuals studied here, we previously measured relative protein abundances via isobaric tag-based quantitative proteomics using the same cell lines (Wu et al. 2013). In total, we present a combined analysis of 133 high-throughput sequencing libraries (83 RNA-seq and 50 ribosome profiling libraries) and extensive protein expression measurements (Fig. 1E).

Integrative analysis of RNA expression, ribosome occupancy, and protein levels

We first considered the relationship between RNA expression, ribosome occupancy, and protein expression across genes. As expected, RNA expression and ribosome occupancy were highly correlated (Spearman ρ = 0.87, P-value < 2.2 × 10−16; outlier robust correlation 0.88 using Donoho-Stahel estimator) (Supplemental Fig. S2F), albeit still lower than biological replicates of RNA expression data (Spearman ρ ∼ 0.98, P-value < 2.2 × 10−16), indicating that control of ribosome occupancy levels is distinct from RNA levels. Importantly, ribosome occupancy correlated better with protein levels than RNA expression correlated with protein levels (Spearman ρ of 0.54 and 0.43, Donoho-Stahel estimator based correlation coefficient 0.56 and 0.42, respectively; permutation test for difference in correlation coefficient P-value < 10−4) (Supplemental Fig. S2F,G). Consistent with previous results (Ingolia et al. 2009), these results suggest that ribosome occupancy is a better predictor of protein level differences between genes.

Although the correlation analysis reveals pairwise relationships, the interdependencies between RNA expression, translational efficiency, and protein levels are not captured. For example, some genes with high protein levels can have low RNA expression but very high translation efficiency, yielding a decreased correspondence between RNA expression and protein levels. To reveal such interdependencies, we utilized self-organizing maps (SOM), an integrative machine learning method that is robust to noise and allows assessment of all relationships simultaneously (Fig. 2A; Kohonen 1990; Wehrens and Buydens 2007). Because SOMs are sensitive to differences in mean and variance of the input variables, we first converted each measurement into its relative rank order expressed as percentiles ensuring equal weighting of the input variables for the SOM training (Supplemental Methods). After training, each neuron within the SOM contains genes that share a similar pattern of expression and protein level (Fig. 2A; Supplemental Fig. S2H).

Figure 2.

Ribosome occupancy correlates better with absolute protein levels than RNA expression and protein levels. (A) A self-organizing map (SOM) was trained using ribosome occupancy, RNA expression, translation efficiency (TE), and protein levels. These measurements were converted into their relative rank order before training. After training, each neuron in the SOM contains several genes sharing similar expression patterns. (B) Four different colorings of the trained SOM depict the mean ribosome occupancy, RNA expression, translation efficiency, or protein levels for each neuron. (C) Neurons of the SOM were grouped using affinity propagation clustering (Frey and Dueck 2007). Shared coloring between nodes indicates membership to the same cluster. For each cluster, the mean rank in ribosome occupancy (RO), RNA expression (RE), translation efficiency (TE), and protein level (PL) was shown for the representative neuron of the cluster. The number of genes in each cluster (n) is shown. (D) For four of nine clusters, significantly enriched gene ontology (GO) terms were identified (FuncAssociate; permutation-based corrected P-value < 0.05) (Supplemental Table S1; Berriz et al. 2009). For Clusters 5 and 8, selected GO categories were shown (log2 odds ratio). Supplemental Table S1 contains the full list of enriched terms.

The emerging map recapitulated the pairwise relationships between RNA expression, ribosome occupancy, and protein levels across neurons (Fig. 2B). We further grouped neurons in the SOM using a clustering approach (affinity propagation clustering) (Frey and Dueck 2007). This approach uncovered nine clusters in the SOM, revealing the distinct relationships between RNA expression and translation efficiency in determining protein levels. For example, genes in Cluster 6 have relatively high RNA expression but do not reach high protein levels as they are translationally repressed (Fig. 2C).

We then examined functional (GO term) enrichments (Berriz et al. 2003, 2009) across the different clusters within the SOM and found specific functional enrichments for four of the nine clusters (Fig. 2C; Supplemental Table S1). Genes with high translation efficiency and high protein levels were enriched for diverse functional categories such as the proteasome complex, glycolysis, mRNA splicing, and DNA damage checkpoint (P-value < 0.05 for all categories using permutation-based multiple hypothesis correction) (Supplemental Table S1; selected examples are shown in Fig. 2D). Conversely, genes associated with translation and cytosolic ribosome constituents were enriched among those that exhibited very high RNA and protein levels despite having lower translation efficiencies (P-value < 0.05 for all categories using permutation-based multiple hypothesis correction) (Supplemental Table S1; Fig. 2D), raising the possibility that higher protein stability or feedback mechanisms on translation efficiency modulate the levels of translation components. These findings indicate that some sets of functionally coherent genes adopt alternative strategies to achieve their respective steady-state protein levels.

Gene expression variability between individuals

We next focused on how ribosome occupancy and RNA expression differ between individuals. We leveraged replicate measurements and identified genes with significant inter-individual variance in RNA expression or ribosome occupancy, exceeding technical noise. We found that ∼27% of genes had statistically significant inter-individual variation in RNA expression compared to only ∼7% of genes that had detectable variation in ribosome occupancy (Holm’s method adjusted P-value < 0.05 based on a simulation based likelihood ratio test) (Fig. 3A,B; Holm 1979). Consequently, ∼20% of all genes exhibit inter-individual RNA expression variation that is not reflected in ribosome occupancy. These results were not explained by different sensitivities of the measurements (Supplemental Fig. S3A). These results were also consistent when restricting the analysis to only the Yoruban individuals or when excluding RNA expression data not generated by our laboratory, indicating the robustness of the results.

Figure 3.

Identification of genes with significant inter-individual variability in RNA expression and ribosome occupancy improves the ability to identify personal differences in protein levels. (A) Ribosome occupancy and RNA expression was modeled using a linear mixed model treating individuals as a random effect and mean expression as the fixed effect. A simulation-based exact likelihood ratio test (Scheipl et al. 2008) was used to compare the linear mixed model to a linear model that did not include the individual as a predictor. The number of genes that show significant inter-individual in RNA expression or inter-individual variation in ribosome occupancy is plotted (Holm's corrected P-value < 0.05). (B) The Venn diagram depicts the overlap between the two groups. (C) Enriched gene ontology (GO) terms among genes with significant inter-individual variation in both RNA expression and ribosome occupancy was determined using FuncAssociate (Berriz et al. 2009). Cytoscape (Smoot et al. 2011) was used to visualize the enriched GO terms (permutation test corrected P-value < 0.05, odds ratio > 3) (Supplemental Table S2). Nodes correspond to GO terms and are colored by the corrected P-value. The size of the node is proportional to the logarithm of the odds ratio. The similarity between GO terms was quantified using Kappa similarity. The strength of the similarity was visualized using darker edge colors (Supplemental Methods). An edge-weighted spring embedded layout is shown. (D) For each gene, Spearman correlation was calculated between individual specific RNA expression and relative protein levels. The distribution of the correlation coefficients was plotted as a density. Genes that showed significant variation in both RNA expression and ribosome occupancy between individuals are plotted with red bars and genes without detectable variation in RNA expression and ribosome occupancy are shown with white bars.

Genes that exhibited significant inter-individual variation in both RNA expression and ribosome occupancy were highly enriched for gene ontology terms, including “chemokine receptor activity,” “complement activation,” “leukocyte migration,” and “antigen binding” (P < 0.05 permutation-based multiple hypothesis correction) (Fig. 3C; Supplemental Table S2), indicating a role in immune functions. Consistently, protein levels that exhibit the most variation between individuals were previously shown to be enriched for “immune system process” (Wu et al. 2013). These functional categories are highly specific to the function of the studied cell type, LCLs (Fig. 3C; Supplemental Table S2). Given that genes with significant inter-individual variation were directly pertinent to the function of the cell line studied here, it is likely that carrying out similar studies in other cell types will expand the set of genes whose expression levels differ significantly between individuals.

Within genes that exhibited significant inter-individual variation in both RNA expression and ribosome occupancy, we identified three subsets. Within the first subset, the variability in RNA expression was comparable to variability in ribosome occupancy (Supplemental Fig. S3E). This first subset contained 54% of all genes exhibiting inter-individual variation in both RNA expression and ribosome occupancy. The second subset consisted of genes that had higher RNA-level variability compared to ribosome occupancy variability. This subset encompassed nearly twice as many genes as the third subset, where ribosome occupancy variability was higher than that of RNA expression (Supplemental Fig. S3E). These results were consistent with the observation that for many genes, inter-individual RNA expression variation is not reflected in ribosome occupancy. Taken together, our results are consistent with yeast studies that reported translational buffering of divergent RNA expression (Artieri and Fraser 2014; McManus et al. 2014). However, we note that an alternative explanation of our findings is the presence of an untranslated pool of mRNA (e.g., nuclear-retained or sequestered cytoplasmically in P bodies) that is variable between individuals.

A small, but interesting fraction of genes (0.7%) exhibited differential ribosome occupancy between individuals with no apparent differences at the RNA level (Supplemental Table S3). These were enriched in genes coding for proteins involved in “cellular response to chemical stimulus” and the “Golgi apparatus” (P < 0.05 permutation-based multiple hypothesis correction) (Supplemental Table S2). These results suggest that translational control may play important roles in cellular signaling, whereby rapid cellular responses are often required.

Relationship between individual differences in protein levels, ribosome occupancy, and RNA expression

An outstanding question in understanding phenotypic variation is how individual-specific protein levels relate to corresponding differences in gene expression. We previously measured relative protein levels for approximately 6000 proteins using the same cell lines (Wu et al. 2013). As expected, the protein level measurements were skewed toward genes that are more highly expressed and translated (Wilcoxon rank-sum test P < 2.2 × 10−16) (Supplemental Fig. S3B). We first calculated the correlation between RNA expression and the corresponding protein level across individuals. Consistent with previous results (Wu et al. 2013), the median correlation coefficient was 0.22, with 11% of genes showing a statistically significant correlation (Spearman correlation coefficient, 5% FDR using the Benjamini-Hochberg method) (Fig. 3D; Supplemental Fig. S3C; Benjamini and Hochberg 1995).

We next repeated this analysis for the set of genes that we identified as having significant RNA expression variability between individuals. Among this subset, relative protein levels and RNA expression had a median correlation coefficient of 0.43 (Spearman correlation coefficient) (Supplemental Fig. S3D), indicating a partial correlation between RNA and protein variability.

Finally, we tested whether joint measurement of RNA expression and ribosome occupancy improved this correspondence. Specifically, we considered genes that exhibit significant inter-individual variation in both ribosome occupancy and RNA expression. Strikingly, 83% of these genes had statistically significant correlation between differences in protein levels and RNA expression with a median correlation coefficient of 0.67 (Spearman correlation coefficient, 5% FDR using Benjamini Hochberg method) (Fig. 3D; Supplemental Fig. S3C). The large difference between the correlation coefficients indicates that measuring both ribosome occupancy and RNA levels simultaneously greatly improves the ability to identify gene expression variability between individuals that will eventually result in personal differences in protein levels.

Genetic determinants of variability in ribosome occupancy

We next investigated whether genetic differences between individuals were associated with the observed variation in gene expression, specifically at the ribosome occupancy level. We used two complementary approaches. First, we used the 21 unrelated individuals from the Yoruban population and conducted a cis-quantitative trait loci (cis-QTL) mapping approach. Using the cis-QTL mapping strategy, we identified significant association between single nucleotide polymorphisms (SNPs) and ribosome occupancy for 67 genes (30% FDR) (Supplemental Fig. S4A–D; Supplemental Table S4). Although 34 of the 67 ribosome occupancy QTLs (roQTLs) were not associated with significant differences in RNA expression (nominal association P-value > 0.05), this analysis cannot conclusively show that these roQTLs are not associated with RNA expression. Overall roQTLs had consistent effects on RNA expression and protein levels (Spearman ρ = 0.86, P < 2.2 × 10−16) (Supplemental Fig. S4A–D). Consistent with recent work comparing two yeast strains (Albert et al. 2014), these results suggest that genetic effects that were propagated through RNA expression to ribosome occupancy caused consistent changes in protein levels for this set of genes.

The role of uORFs in modulating ribosome occupancy

We next adopted a targeted approach that was both better powered and enabled detection of combined effects of multiple genetic variants on ribosome occupancy. We first analyzed variants modifying upstream open reading frames (uORFs), which can alter protein expression by regulating translation (Wethmar et al. 2014). In humans, approximately half of annotated transcripts contain uORFs, and presence of uORFs is widely polymorphic across individuals (Supplemental Table S5; Calvo et al. 2009; Barbosa et al. 2013; Waern and Snyder 2013). Disruption of a uORF in the HR gene has been previously shown to lead to Marie Unna hereditary hypotrichosis by modulating the translation of the main ORF, suggesting that human disease can be associated with changes in uORFs (Wen et al. 2009).

We correlated genetic alterations to uORF presence with ribosome occupancy of the main coding region and found 33 transcripts with significant association (5% FDR) (Fig. 4A–D; Supplemental Fig. S4E,F). One particular advantage of this targeted approach was the ability to detect changes to uORFs caused by multiple genetic variants. For example, two different SNPs in the ZNF215 gene result in merging of two uORFs by removing a stop codon (Fig. 4D). Merging of the uORFs significantly increased ribosome occupancy of the main coding region (P < 0.001) (Fig. 4D).

Figure 4.

Nucleotide variants that modify upstream ORFs can alter ribosome occupancy of the main coding region. (A) We identified single nucleotide polymorphisms that generate, delete, or otherwise modify an upstream open reading frame (uORF). We tested whether changes to uORFs affected ribosome occupancy of the main coding region using a linear regression framework. The absolute value of the effect size from the regression was plotted against the P-value of association. For 17 uORF changes shown with red circles, the association was solely with ribosome occupancy (nominal P-value > 0.05 or opposite signed regression coefficients for RNA expression). Supplemental Table S5 shows the robustness to population stratification and linear mixed model. (B) A SNP in the 5′ UTR of the LENG8 gene introduces a premature in-frame stop codon that shortens an existing uORF. This event results in lower ribosome occupancy of the main coding region, as shown in the boxplot (pRibo = 0.002). The horizontal bar reflects the median of the distribution, and the box depicts the interquartile range. The whiskers are drawn to 1.5 times the interquartile range. (C) In another example, SRRM1, a SNP completely eliminates an existing uORF by removing its start codon. The loss of this uORF is associated with reduced ribosome occupancy of the main coding region (pRibo = 0.0004; pRNA = 0.19). (D) The reference sequence of ZNF215 gene has two short uORFs. Two different genetic variants eliminate the stop codon of the first uORF (UGA to UAC or UGA to CAA), resulting in merging of the two short uORFs into a single long uORF. The merging of the uORF significantly modulates both ribosome occupancy and RNA expression (pRibo = 0.0001 and pRNA = 10−9, respectively).

In addition to impacting translational efficiency, nucleotide variants changing uORFs may alter RNA abundance. For example, they may change transcript stability by triggering nonsense-mediated decay (Kervestin and Jacobson 2012). Alternatively, the variant or variants in linkage disequilibrium may alter transcriptional output as a proximal element downstream from the transcription start site. Of the 33 significant associations between changes in uORFs and ribosome occupancy, ∼52% (17 of 33) also had a significant effect on RNA levels (nominal P-value < 0.05), indicating the presence of uORFs and RNA levels are often coupled. However, for 16 other genes, the observed effect was solely on ribosome occupancy, suggesting direct modulation of the translation efficiency of the main reading frame (Fig. 4A,B, Supplemental Table S5). We further observed that presence of uORFs could be associated with both increased and decreased ribosome occupancy of the main coding region (Supplemental Fig. S4G). We verified the robustness of these results by limiting the analysis to data from Yoruban individuals and using an alternative statistical framework based on linear mixed models (Supplemental Fig. S4E,F; Supplemental Table S5). These results reveal that natural genetic variation within the human population can specifically cause personal differences in translation through changes to uORFs.

The role of sequences surrounding the start codon in modulating ribosome occupancy

We next analyzed the Kozak sequence, the region surrounding the start codon for its effect on translation efficiency (Fig. 5A). Previous work has suggested that Kozak sequence is important for both start codon selection and translation efficiency of specific transcripts (Kozak 1987). However, the extent and impact of natural genetic variation affecting the Kozak sequence and the global effect of the Kozak region on translation efficiency have not been studied.

Figure 5.

Nucleotide variants modulating the sequence around the translation initiation site alter translation efficiency. (A) The Kozak region is defined as the 6 nt preceding and 2 nt following the start codon. The derived position weight matrix was visualized using WebLogo (Crooks et al. 2004). The upper panel shows the effects of each nucleotide at the −3 position on translation efficiency. The effect of nucleotides on translation efficiency was tested using the Kruskal–Wallis test. (B) The effect of a Kozak region variant on the ribosome occupancy of NTPCR was assessed using a linear model (P-value = 1.1 × 10−6). A boxplot was used to visualize the distribution of ribosome occupancy for individuals with given genotypes. The horizontal bar reflects the median of the distribution and the box is drawn to depict the interquartile range. (C) WDR11 had two naturally occurring SNPs in its Kozak region. An additive model was adopted to calculate the change in the position weight matrix score of the Kozak region. (D) 5′ UTRs with or without Kozak variants were cloned into a translation efficiency reporter. The reporter expresses a biscistronic mRNA, in which the Renilla luciferase is translated under the control of the cloned 5′ UTR, and the firefly luciferase is translated under the control of Hepatitis C virus (HCV) internal ribosome entry site (IRES). (E,F) The ratio of Renilla to firefly luciferase activity was plotted for NTPCR (E) and WDR11 (F). Error bars represent SEM. The difference between the ratios was assessed using a two-sided two-sample t-test. (*) P-value < 0.05.

We first determined whether certain positions of the Kozak sequence have a global effect on translation efficiency. We found a highly significant and large effect of the nucleotides at position −3 and at position −2 on translation efficiency (Bonferroni adjusted Kruskal–Wallis test P = 5.7 × 10−20 for position −3; P = 1.2 × 10−17 at position −2) (Fig. 5A; Supplemental Fig. S5A). Additionally, the 2 nt immediately after the start codon had statistically significant effects on translation efficiency (Bonferroni adjusted P < 2.8 × 10−7) (Supplemental Fig. S5A). Although previous work using reporter systems anticipated the significance of these features (e.g., Kozak 1987), our analyses highlight the role of sequence composition near the ATG in modulating translation efficiency of endogenous genes at a global scale.

The extent and potential role of natural variation that might alter the Kozak sequence across the genome remains largely unexplored in the human population (Supplemental Table S6; Xu et al. 2010). Among the set of individuals studied here, there were approximately 150 genetic variants altering the Kozak region in at least three individuals. Approximately 65% of Kozak region variants reduced the position weight matrix (PWM) score of the reference sequence (Supplemental Fig. S5B). This effect was even more pronounced for Kozak variants that were observed in a single individual. Of these, 77% reduced the PWM score of the reference sequence, suggesting that selective pressure may be acting to optimize the Kozak sequence.

We next tested the effect of these variants on ribosome occupancy of the main coding region. We utilized the position weight matrix for the Kozak region to score the impact of each variant on the Kozak strength (Fig. 5A). We found nine genes with Kozak variants that modified ribosome occupancy significantly with no significant effect on the RNA levels (10% FDR using Benjamini-Hochberg correction; RNA expression association P-value > 0.01; using a conservative linear mixed model, two of these genes had P < 0.01) (Fig. 5B,C; Supplemental Fig. S5C), indicating the presence of variants specifically affecting translation efficiency.

Finally, to directly examine the role of genetic variation on translation efficiency, we used reporter assays (Jang et al. 1988) for six genes. These included four genes with Kozak region variants and two genes with uORF variants. We cloned the reference 5′ UTR or the variant 5′ UTR with a single base change at the Kozak region or the uORF upstream of a Renilla luciferase and transfected the resulting constructs into human HEK 293 cells. To normalize differences in RNA expression and transfection efficiency, an HCV internal ribosome entry site driven firefly luciferase was cloned after the Renilla stop codon, and the ratio of the Renilla to firefly luciferase was quantified. Differences in this ratio between the reference and variant 5′-UTR-containing reporters for four genes recapitulated the results from our ribosome profiling data, i.e., sequences that were associated with reduced translational efficiency also gave low luciferase ratios. These results provide an independent validation of our conclusion that natural genetic variation can modify sequences surrounding the start codon leading to personal differences in translation (Fig. 5D–F; Supplemental Fig. S5D).

Discussion

This study demonstrates that translation efficiency varies among individuals, and nucleotides important for regulating translation efficiency can be identified. In several cases, we uncovered the mechanisms controlling translation efficiency variation in humans. These included uORFs and sequences near the translation initiation sites. Our study revealed that genetic differences between individuals could lead to gene expression differences at the level of translation.

We leveraged replicate measurements to identify genes with significant variability in RNA expression or ribosome occupancy between individuals. We found that genes that exhibit significant variability in both RNA expression and ribosome occupancy were highly enriched for functions directly pertinent to LCLs such as immune response and leukocyte migration (Fig. 3). Hence, extending this analysis framework to additional cell types or tissues will likely uncover more genes with variable expression among individuals.

We also investigated the relationship between differences in protein levels and variability in RNA expression and translation. We found that joint analysis of RNA expression and translation improved our ability to identify the extent of gene expression variation that would be reflected in protein levels, indicating a tight coupling of translation efficiency and protein levels. These analyses were skewed toward genes with higher expression levels due to missing protein level measurements (Supplemental Fig. S3B). Hence, further improvements in proteomics technology will be needed to test the generalizability of our results to lowly expressed proteins. Despite the significant improvements obtained by joint analysis of ribosome occupancy and RNA level measurements, there remains unexplained variability in protein levels. One potential contributor to this discrepancy is variability in protein degradation rates (Vogel and Marcotte 2012). Another important future direction will be to investigate the contribution of RNA sequence features (e.g., Vogel et al. 2010) to the relationship between RNA expression, ribosome occupancy, and protein levels.

Importantly, genes that have individual variability only in RNA expression are less likely to have corresponding differences at the protein level. Among this subset of genes, only 40% had statistically significant covariation between RNA levels and protein levels (5% FDR). An important implication of this result concerns ongoing efforts that aim to identify genetic determinants of RNA expression (Lappalainen et al. 2013; Battle et al. 2014). These studies are in part motivated by the finding that most disease risk factors identified by genome-wide association studies lie in noncoding regions (Edwards et al. 2013). By linking genetic differences to RNA expression, these studies hope to uncover functional connections to disease states. Yet, our analyses suggest that the functional impact of RNA-level differences needs to be carefully considered to establish causal relationships to phenotype.

Recent consortium efforts measured RNA expression in large sets of genotyped samples (approximately 900 in Battle et al. 2014 and approximately 500 samples in Lappalainen et al. 2013) to identify trans-acting genetic effects on RNA expression. Interestingly, 85% of the trans-effects on RNA expression were mediated by the effects of the associated SNP on a nearby gene (Battle et al. 2014), indicating that changes in regulators of RNA expression lead to differences in RNA levels of distant transcripts. Similarly, genetic variation in translation regulators is likely to have trans-effects on ribosome occupancy of many transcripts. For example, levels of global regulators of translation such as MTOR, and translation initiation factors (e.g., EIF4E), have the potential to modulate the translation of a large number of targets (Mamane et al. 2007; Thoreen et al. 2012). In fact, a recent study that compared translation in two different strains of yeast suggested that the relative importance of trans-effects on translation is comparable to that for RNA levels (Albert et al. 2014). Future studies in the human population will likely uncover trans-acting and additional cis-acting genetic variants associated with translation and reveal the contribution of population-level variation to translation variability.

A recent analysis (Battle et al. 2015) of RNA expression, ribosome occupancy, and protein measurements from several human LCLs concluded that there is a scarcity of human genetic variants associated with translation-specific effects. However, we note critical limitations in their ribosome profiling data. As demonstrated in Figure 1 and recently by Miettinen and Björklund (2015), the nuclease digestion conditions used in Battle et al. (2015) lead to severe ribosomal degradation and significantly lower monosome purity in ribosome profiling libraries. Moreover, the Battle et al. (2015) study design lacks replicate experiments, precluding proper assessment of the reproducibility of ribosome profiling measurements. In resequencing experiments, Battle et al. (2015) reported rank correlations lower than 0.9 (Spearman ρ) for the majority of their samples (Supplemental Fig. S2A in Battle et al. 2015). In contrast, we consistently achieved greater than 0.98 rank correlations between biological replicates using independently grown cells and independently prepared ribosome profiling libraries. Here, by leveraging higher quality ribosome profiling data sets with replicates and independent reporter experiments, we identify genetic variants associated with translation efficiency undetected in Battle et al. (2015).

Our study revealed several genetic variants that control translation efficiency variation in humans, including those affecting the Kozak region and upstream open reading frames (uORFs). A particularly interesting question is the molecular mechanisms of these sequence-function relationships. An intriguing feature of genetic variants modifying uORFs on translation was the observation that both gain and loss events could lead to increased translation of the downstream open reading frame (Supplemental Fig. S4G), consistent with previous work that implicated both positive and negative regulation of translation efficiency by uORFs (Brar et al. 2012; Waern and Snyder 2013). Whereas several mechanisms have been implicated in negative regulation of translation efficiency by uORFs, including nonsense mediated decay (Kervestin and Jacobson 2012), less is known about the mechanisms of positive regulation by uORFs. Recent work identified a complex, DENR-MCT1, that catalyzes translation reinitiation downstream from certain uORFs (Schleich et al. 2014), suggesting that DENR-MCT1 or similar factors may act on subsets of uORFs to increase reinitiation frequency of the downstream ORFs leading to higher translation efficiency.

Recent structural analysis of the yeast 48S translation initiation complex permitted an unprecedented view of the molecular environment of the start codon in eukaryotes (Hussain et al. 2014), revealing a potential mechanism by which Kozak region variants affect translation efficiency. Remarkably, this structural analysis revealed that eIF2A directly contacts nucleotides at positions −2 and −3, the same two positions that our global analysis of Kozak variants highlighted as being the most important for translation efficiency (Fig. 5A). Thus, our results provide functional evidence that these residues are of general importance for translational efficiency.

Together, these results demonstrate that genetic alterations in the human population and disease-associated mutations may penetrate to phenotype through changes in translation. In the era of personal genome sequencing, this information is crucial for understanding the role of genetic variants on gene expression, phenotypic traits, and human disease susceptibility.

Methods

RNA-seq experiments and ribosome profiling experiments

Human lymphoblastoid cell lines (LCLs) were obtained from Coriell Cell Repository. For replicate ribosome profiling experiments, cells were grown separately to a density of 0.8–1.0 × 106 cells/mL. Approximately 10 million cells were pelleted at 250g at 4°C and washed with PBS. The pellets were frozen in liquid nitrogen prior to cell lysis. Seven A260 units of the cleared cell lysates were incubated with 300 units of RNase T1 (Fermentas) and 500 ng of RNase A (Ambion). A 34% (Weight/Volume) sucrose cushion was used to isolate ribosomes. Library preparation and sequencing was done as previously described with some modifications (Supplemental Methods; Ingolia et al. 2012).

For RNA-seq experiments, LCLs were grown to a density of 3 × 105–6 × 105 cells/mL. Total RNA was extracted using TRIzol reagent according to the manufacturer's instructions (Life Technologies), then purified using the Qiagen RNeasy kit (Qiagen) and treated with RNase-free DNase (Qiagen). RNA integrity was checked with a Bioanalyzer (Agilent), and only samples with an RNA integrity number (RIN) of >9.5 were subsequently subjected to either ribosomal depletion or poly-A selection. For ribosomal RNA depletion, 5 μg of purified total RNA was depleted of rRNAs using the Ribo-Zero Magnetic Gold Kit (Human/Mouse/Rat) (Epicentre Biotechnologies). For poly-A selection, 10 μg of purified total RNA were enriched by performing two cycles of selection using the Dynabeads mRNA Purification Kit (Life Technologies). Stranded libraries were prepared following the dUTP protocol (Parkhomchuk et al. 2009). For each cell line, we generated 2 × 101 bp paired-end RNA-seq data using two biological replicates of ribosomal RNA depleted and three biological replicates of poly-A-selected RNA.

Sequence alignment and processing

To enable comparable analysis of high-throughput sequencing data sets, we used a uniform alignment and preprocessing pipeline. Reads were sequentially aligned using Bowtie 2 v.2.0.5 (Langmead and Salzberg 2012). All reads mapping to human rRNA and tRNA sequences were filtered out. The remaining reads were aligned to APPRIS principal transcripts (release 12) (Rodriguez et al. 2013) from the GENCODE mRNA annotation v.15 (Harrow et al. 2012). For all transcript level analyses, reads that map only to coding regions were used. For details, see Supplemental Methods.

Ribosome profiling sample identity verification

The cell line identity for all ribosomal profiling libraries were verified by comparing empirically generated genotype calls to the reference genotypes. Specifically, we utilized SAMtools mpileup utility in combination with BCFtools (Li et al. 2009) to generate genotype calls from the ribosomal profiling read alignments. Finally, a custom Perl script was used to compare the number of perfect matches between empirically called genotypes and the reference genotype that was available from the HapMap and the 1000 Genomes Project (The International HapMap 3 Consortium 2010; The 1000 Genomes Project Consortium 2012). For details, see Supplemental Methods.

Genotype data and processing

Genome sequences were obtained from the 1000 Genomes Project pilot 2 trios and Phase1v3 (The International HapMap 3 Consortium 2010; The 1000 Genomes Project Consortium 2012) for 27 of the 30 individuals. The genome sequences of three cell lines (NA19139, NA19193, and NA19201) were imputed from HapMap release 28 data (The International HapMap Consortium 2007; The International HapMap 3 Consortium 2010) to the 1000 Genomes Phase1v3 reference panel (The 1000 Genomes Project Consortium 2012).

We included all variant calls provided by both release and pilot data sets without additional score or source filtering. We subsetted all single nucleotide polymorphisms (SNPs) that overlap APPRIS transcripts and retained all phasing information from the VCF files. About 8% of the variants in the pilot data set were unphased; and for these variants, we randomly assigned the phase. For details, see Supplemental Methods.

Sequence data normalization and quality control

After accounting for differences in the mRNA enrichment method, ∼9600 transcripts had a read count per million reads mapped (cpm) (as implemented in the edgeR package) (McCarthy et al. 2012) greater than one in at least 40 RNA-seq libraries and 36 ribosome profiling libraries. We used trimmed mean of M-values to account for differences in library size (Robinson and Oshlack 2010) and estimated the mean to variance relationship in the data using the voom method (Law et al. 2014). We explicitly specified the individual identifier to indicate which libraries were replicates from the same individual while applying the voom method. The inverse variance weights obtained from the voom method were used in all analyses where applicable. For details, see Supplemental Methods.

Calculation of translation efficiency

When combined with RNA expression measurements, ribosome profiling enables the estimation of translation efficiency by capturing a snapshot of the transcriptome-wide ribosome occupancy. We treated ribosome profiling and RNA-seq as two experimental manipulations of the RNA pool of the cell. Translation efficiency was calculated using a linear model, in which the normalized expression values are dependent on the treatment (RNA-seq or ribosome profiling) and the individual identifiers (limma R package) (Smyth 2005). For details, see Supplemental Methods.

Self-organizing maps for integrative gene expression analysis

We used SOMs to explore the relationship between protein levels and the three expression measurements: RNA levels, ribosome occupancy, and translation efficiency. SOMs rely on a suitable measure of distance between the transcripts for the clustering. To avoid skewing distance calculation due to difference in scale and variance of the expression measurements, expression levels and protein amounts were converted to percentiles using the empirical cumulative distribution function for each level. The kohonen R package (Wehrens and Buydens 2007) was used for training the SOM with custom modifications to the plotting functions following (Xie et al. 2013). We then clustered the codebook vectors of the 140 units in the SOM using affinity propagation clustering (Frey and Dueck 2007) as implemented in the apcluster R package (Bodenhofer et al. 2011). For details, see Supplemental Methods.

Gene set enrichment analysis

FuncAssociate 2.0 was used for gene set enrichment analyses (Berriz et al. 2009). The background gene list was explicitly defined as the set of all genes that could potentially be included in the query set. We defined significant enrichments as GO terms with an odds ratio >2 and adjusted P-value < 0.05. P-value adjustment was carried out using a permutation method to account for the overlap between the GO terms. We calculated the Kappa Similarity Score between all pairs of significantly enriched GO terms. We retained edges between all pairwise GO terms whose Kappa similarity score was >0.1. Enriched GO terms were visualized with Cytoscape (Smoot et al. 2011) using the edge-weighted spring embedded layout. For details, see Supplemental Methods.

Analysis of inter-individual variation in RNA expression and ribosome occupancy

Replicate measurements for RNA-seq and ribosome profiling were used to determine inter-individual variance while controlling for platform specific variance observed between replicates from the same individual. To decompose these two variance components, we used a linear mixed effects model in which we treated the individual as a random effect. As before, we utilized the inverse variance weights obtained from the voom approach and fitted the model using log-likelihood instead of a restricted maximum likelihood approach. We tested the null hypothesis that the variance of the random effect is zero. Rejection of the null hypothesis implied that there was significant inter-individual variance in the expression of the given transcript. We adopted a simulation-based approach using an exact likelihood ratio test implemented in the RLRsim R package (Scheipl et al. 2008). Multiple-hypothesis correction was applied to RNA expression and ribosome occupancy P-values separately using Holm's method. For details, see Supplemental Methods.

Cis-QTL identification

Association between gene expression and the genotype at each variant position located in the exons of the APPRIS transcripts was tested in the set of 21 unrelated Yoruban individuals using PLINK v1.07 (Purcell et al. 2007). For each transcript, replicate gene expression measurements were averaged for this analysis. The expression values were regressed on variant genotypes assuming an additive genetic model where genotype was coded as 0,1, or 2 copies of the alternate allele and restricting the testing to variants with a minor allele frequency >10% in the 21 unrelated Yoruban individuals.

Genetic determinants of variability in ribosome occupancy

Testing the effect of uORF events on ribosome occupancy

We used AUG and CUG as potential start codons, and UAG, UAA, and UGA as potential stop codons. CUG initiation has been reported in few well-documented cases, such as FGF2, VEGF, MYC, and MHC class I transcripts (Hann et al. 1988; Vagner et al. 1996; Meiron et al. 2001; Schwab et al. 2003). Additionally, recent studies mapping genome-wide translation initiation sites have suggested that upstream translation initiates frequently from non-AUG codon, most prominently at CUG sites (Ingolia et al. 2011; Leek et al. 2012).

To group individuals by uORF differences on a given transcript, we first determined all possible combinations of uORF gain/loss events. We then tested whether the copy number of the uORF variants affects ribosome occupancy of the main coding region using two approaches. In the first approach, we used linear regression. In the second, more conservative approach, we fitted a linear mixed model assuming the difference in cell lines is an individual-specific random effect, i.e., treating the different cell lines of the same individual as “technical replicates.” For details, see Supplemental Methods.

The effect of Kozak region sequence on translation efficiency

We defined the Kozak region as the 6 nt preceding the start codon and the 2 nt following the start codon. We extracted the nucleotide sequence of this region from all annotated APPRIS transcripts and built a PWM, which recapitulated the known Kozak sequence (Fig. 5A). We tested whether the nucleotide content of the Kozak sequence affected translation efficiency using the Kruskal–Wallis test. Specifically, we tested whether transcripts split into four categories based on the nucleotide at a given position has the same translation efficiency. We corrected the P-value from this test using Bonferroni correction for the eight tests (number of positions) that were performed.

Association between Kozak region genetic variants and ribosome occupancy

Next, we collected all SNPs that intersect annotated Kozak regions. We scored the variant and the reference Kozak sequence using the PWM matrix obtained above. We coded each variant by the PWM score change and assumed an additive relationship between different positions in the Kozak region and copy number of the allele. We then tested whether the variants in the Kozak regions affect ribosome occupancy of the main coding region using a linear model. For all Kozak variants affecting ribosome occupancy, we conducted the same association test using RNA expression level as the phenotype. As for the uORF analysis, we deemed RNA association to be not significant if the nominal P-value was >0.05 or if the regression coefficient had the opposite sign. For details, see Supplemental Methods.

Luciferase reporter assays

To assay translation efficiency, we used a bicistronic luciferase reporter construct (Jang et al. 1988). This construct has an SV40 promoter that drives the expression of a bicistronic transcript that includes both the firefly and Renilla luciferase. Although the Renilla luciferase translation is cap-dependent, firefly luciferase has a Hepatitis C virus (HCV) internal ribosome entry site (IRES) that enables cap-independent translation.

Gene segments were synthesized and cloned right in front of the start codon (ATG) of the Renilla luciferase using the CloneEZ system (GenScript). The bicistronic constructs were transfected into HEK293 cells. Cap-dependent translation was calculated by taking the ratio of Renilla (cap) to firefly (HCV IRES) luciferase activity and derived from five replicate experiments. The HCV-IRES-dependent translation of firefly luciferase accounted for differences in RNA expression and transfection efficiency. Outlier detection was carried out as described (Jacobs and Dinman 2004). The difference between Renilla to firefly luciferase ratios was assessed using a Welch two-sample two-sided t-test. For details, see Supplemental Methods.

Data access

The sequencing data from this study have been submitted to the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE65912.

Acknowledgments

C.C. is supported by the Child Health Research Institute, the Lucile Packard Foundation for Children's Health, and the Stanford CTSA grant number UL1TR000093. G.W.B. is supported by a Benchmark Stanford Graduate Fellowship. S.I.C. is supported by a Stanford Transformative and Innovation grant. D.S. was supported by National Institutes of Health, National Human Genome Research Institute grant NIH/NHGRI T32 HG000044, and the Genentech Graduate Fellowship. Hua Tang is supported by the NIH grant GM07305907. This research was supported by NIH grants 1U01HG00761101, HG007611, and HG006996. We thank Basar Cenik, Douglas H. Phanstiel, and Robert J. Nichols for comments. We thank Ghia Euskirchen for sequencing support. Illumina sequencing services were performed by the Stanford Center for Genomics and Personalized Medicine.

Author contributions: C.C. and M.P.S. designed the study. C.C. carried out ribosome profiling experiments and coordinated the computational analyses. G.W.B. carried out uORF analysis; S.I.C. contributed to ribosome occupancy QTL analysis; B.A. developed visualization tools; and Hagen Tilgner contributed to RNA-seq analysis. E.S.C. performed luciferase reporter assays; F.G. generated RNA-seq data; and D.S. carried out cell culture/maintenance. E.R. helped optimize the ribosome profiling protocol; and C.L.A. and Hua Tang provided statistical analysis guidance. C.C., E.S.C., C.L.A., and M.P.S. wrote the manuscript with input from all authors.

Footnotes

  • [Supplemental material is available for this article.]

  • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.193342.115.

  • Freely available online through the Genome Research Open Access option.

  • Received April 24, 2015.
  • Accepted August 20, 2015.

This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

References

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server