Synchronized replication of genes encoding the same protein complex in fast-proliferating cells

  1. Wenfeng Qian1,2,3
  1. 1State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China;
  2. 2Key Laboratory of Genetic Network Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China;
  3. 3University of Chinese Academy of Sciences, Beijing 100049, China;
  4. 4Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona 08003, Spain;
  5. 5Center for Quantitative Biology and Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
  • Corresponding authors: wfqian{at}genetics.ac.cn, ychen{at}genetics.ac.cn
  • Abstract

    DNA replication perturbs the dosage balance among genes; at mid-S phase, early-replicating genes have doubled their copies while late-replicating ones have not. Dosage imbalance among genes, especially within members of a protein complex, is toxic to cells. However, the molecular mechanisms that cells use to deal with such imbalance remain not fully understood. Here, we validate at the genomic scale that the dosage between early- and late-replicating genes is imbalanced in HeLa cells. We propose the synchronized replication hypothesis that genes sensitive to stoichiometric relationships will be replicated simultaneously to maintain stoichiometry. In support of this hypothesis, we observe that genes encoding the same protein complex have similar replication timing but mainly in fast-proliferating cells such as embryonic stem cells and cancer cells. We find that the synchronized replication observed in cancer cells, but not in slow-proliferating differentiated cells, is due to convergent evolution during tumorigenesis that restores synchronized replication timing within protein complexes. Taken together, our study reveals that the demand for dosage balance during S phase plays an important role in the optimization of the replication-timing program; this selection is relaxed during differentiation as the cell cycle prolongs and is restored during tumorigenesis as the cell cycle shortens.

    The balance hypothesis asserts that the stoichiometric relationship among subunits of a protein complex is essential for the survival and proliferation of cells; the disruption of this relationship perturbs functions of protein complexes and sometimes even causes cytotoxicity (Papp et al. 2003; Veitia 2005; Birchler and Veitia 2010, 2012). The balance hypothesis provides a unique framework for understanding a variety of biological phenomena, especially the proliferation rate of aneuploid cells, the fate of duplicated genes, and the X-Chromosome inactivation. Aneuploidy, defined as a karyotype that is not a multiple of the haploid complement, generates dosage imbalance among genes on different chromosomes (Birchler et al. 2001; Chen et al. 2019). Consistent with the balance hypothesis, aneuploidy often results in a more severe growth defect than polyploidy, which keeps the dosage balance among genes (Birchler and Newton 1981; Otto and Whitton 2000; Birchler and Veitia 2010). Furthermore, the addition of a larger chromosome, which leads to a dosage imbalance among more genes, often results in a greater reduction in fitness (Torres et al. 2007, 2010; Pavelka et al. 2010; Birchler and Veitia 2012; Stingele et al. 2012). Individual gene duplication confers the second type of dosage imbalance, between duplicate genes and singletons. Consistent with the balance hypothesis, genes often reduce their expression soon after duplication (Qian et al. 2010), through which the dosage balance is restored. Furthermore, genes encoding protein complexes exhibit a higher retention rate after the whole genome duplication so that the dosage balance among subunits is maintained (Papp et al. 2003; Qian and Zhang 2008; Freeling 2009; Tasdighian et al. 2017). In addition, dosage balance in a protein complex also explains the evolution of gene size (Chen et al. 2009) as well as mammalian X-Chromosome inactivation (Lin et al. 2012).

    A probably more prevalent but less studied source of dosage imbalance is caused by DNA replication that occurs in each cell cycle. During the DNA synthesis phase (S phase) of a cell cycle, the genome is replicated in a defined temporal order known as the replication-timing program (Pope and Gilbert 2013; Sima et al. 2019). In the middle of S phase, early-replicating genes have doubled their copy number, but late-replicating genes have not, leading to a dosage imbalance between early- and late-replicating genes. Such dosage imbalance likely causes a growth defect, especially among genes sensitive to dosage relationships, such as those encoding the same protein complex (Papp et al. 2003; Birchler and Veitia 2012). Although acetylated histones (H3K56ac) can incorporate into newly replicated DNA regions and partly suppress the expression of newly replicated genes in yeast (Voichek et al. 2016), this compensatory mechanism cannot completely restore the dosage balance; the mRNA levels of early-replicating genes still exhibited an ∼20% increase compared to late-replicating genes during mid-S phase (Voichek et al. 2016). Consistently, higher expression was observed when a GFP reporter was inserted into the early-replicating regions in budding yeast (Chen and Zhang 2016).

    The dosage imbalance during S phase could be more severe in mammalian cells, where H3K56ac may not mark newly replicated DNA (Stejskal et al. 2015). Consistently, in mouse embryonic stem cells (ESCs), the transcription rates of Pou5f1 and Nanog increased by 28% and 50%, respectively, upon DNA replication (Skinner et al. 2016). A similar phenomenon was also observed in UBC in human primary fibroblasts (Padovan-Merhar et al. 2015). These data show that replication can cause dosage imbalance during S phase and suggest that additional mechanisms should exist in mammalian cells to solve the problem.

    In this study, we aim to explore the specific mechanisms by which mammalian cells maintain dosage balance within a protein complex and to investigate if such mechanism plays a role in tumorigenesis as the fraction of time cells spend in S phase increases.

    Results

    The single-cell transcriptomes of HeLa cells through the cell cycle

    Increased gene expression after DNA replication has been observed in individual genes in mammalian cells using single-molecule fluorescence in situ hybridization. To determine if this phenomenon is generalizable to other genes in the genome, we characterized the transcriptomes of HeLa, a cell line derived from cervical cancer cells, through the cell cycle. Taking advantage of single-cell RNA sequencing (scRNA-seq) technology, we obtained the transcriptomes of 884 individual cells. These cells are neither synchronized with drugs nor sorted by flow cytometers and therefore are more likely to represent the native cell-cycle status (Cooper 2003). We performed principal component analysis (PCA) using the expression levels of 151 previously identified cell-cycle phase-specific genes (Supplemental Table S1; Whitfield et al. 2002). We observed that cells formed a circle in the scatterplot of the first two principal components (Fig. 1A). To determine if this circle is related to the cell cycle, we assigned each cell to one of the five cell-cycle phases (G1/S, S, G2, G2/M, or M/G1) by scoring for the expression specificity of the phase-specific genes (Fig. 1B; Supplemental Table S1; Macosko et al. 2015). Cells assigned to the same phase are clustered in the PCA plot, and five clusters are in the order of the known cell-cycle progress in a clockwise direction (Fig. 1A), underscoring the competence of the single-cell transcriptome analysis in recognizing the cell-cycle statuses of individual cells.

    Figure 1.

    Cell-cycle analysis of the HeLa scRNA-seq data. (A) The PCA plot of 884 individual HeLa cells shows asynchrony in the cell cycle. (B) The cell-cycle status of individual HeLa cells. Each cell is classified into one of the five cell-cycle phases based on the phase-specific scores; 259, 172, 86, 201, and 166 cells were classified into G1/S, S, G2, G2/M, and M/G1 phases, respectively. The phase-specific scores were standardized among five phases for each cell. (C) S-phase cells were separated into six subgroups based on the PC1 values in the PCA plot. Subgroups 1–6 represent the six stages of S phase from early to late. (D) The expression level of S-phase marker genes among cells. CCNE1 (left) and CENPA (right) are known to be highly expressed in the early- and late-S phases, respectively. The average expression level among cells is shown for each S-phase subgroup (bottom).

    The overexpression of early-replicating genes relative to late-replicating ones is predicted to take place during mid-S phase when the replication of early-replicating genes has finished and that of late-replicating genes has not yet started. To test this prediction, we further divided S-phase HeLa cells into six subgroups based on their order in the clockwise direction in the PCA plot (Fig. 1C). Two lines of evidence suggest that subgroups 1–6 reflect genuine S-phase progression. First, genes known to be specifically expressed in a certain stage of S phase were precisely expressed in the cells at this stage. For example, CCNE1, a protein required for the G1/S transition (Koff et al. 1992), was more highly expressed in the cells of subgroup 1 (Fig. 1D). In contrast, CENPA, a protein that plays a role in centromere assembly (Zeitlin et al. 2001), was more highly expressed in the cells of subgroups 4–6 (Fig. 1D). Second, we inferred the mRNA abundance in each cell from the total number of unique molecular identifiers (UMI) of all genes in the scRNA-seq data. mRNA abundance in each cell increased from subgroup 1 to 6 (Supplemental Fig. S1), consistent with previous findings of the increase of cell volume and mRNA abundance along the progress of the cell cycle (Boye and Nordström 2003; Mitchison 2003; Padovan-Merhar et al. 2015). Both observations indicate that we successfully obtained the transcriptomes at different S-phase stages.

    Early-replicating genes are overexpressed during S phase in HeLa cells

    To determine if early-replicating genes are overexpressed during mid-S phase (Fig. 2A), we compared the expression of early- versus late-replicating genes during S phase. We retrieved DNA replication-timing profiles of HeLa cells (Weddington et al. 2008; Oda et al. 2012) and defined 2000 (∼10%) genes with the highest or lowest replication timing values as early- or late-replicating genes, respectively (Fig. 2B). We generated all possible gene pairs in which one is from the 2000 early-replicating genes and the other is from the 2000 late-replicating ones. We estimated the expression ratio of early-/late-replicating genes for each gene pair and calculated the mean of all gene pairs for each S-phase subgroup. The overexpression of early-replicating genes was observed in almost all S-phase subgroups and peaked in subgroup 4 (by ∼10%, P = 8 × 10−8, bootstrap hypothesis testing) (Fig. 2B, blue dots). As a control, we randomly assigned 2000 genes from the genome as “pseudo” early- or late-replicating genes; the overexpression was no longer observed (Fig. 2B, orange dots). Note that this observation cannot be due to the effect of a small number of genes exempt from dosage compensation (Müller and Nieduszynski 2017), as the same conclusion can be reached when we randomly dropped 10% of the 2000 early- and 2000 late-replicating genes (Supplemental Fig. S2A). These results revealed that, although the dosage imbalance was partly compensated during S phase, an average ∼10% overexpression of early-replicating genes remained.

    Figure 2.

    Early-replicating genes are overexpressed during S phase. (A) The dosage between early- and late-replicating genes is predicted to be imbalanced during mid-S phase because early-replicating genes have finished replication and late-replicating ones have not yet started. (B) Early-replicating genes are overexpressed during mid-S phase. The observed (blue) average expression ratio of all gene pairs in each S-phase subgroup was normalized by that in G1/S. The random expectation (orange) was obtained from 2000 “pseudo” early- or “pseudo” late-replicating genes that were randomly sampled from the genome. Error bars represent the standard errors among gene pairs in each subgroup. (C) The distribution of the expression ratio of early- to late-replicating gene in each gene pair in subgroup 4.

    Such imbalance was stronger when we considered individual gene pairs. We observed that 28.3%, 7.3%, and 1.8% of gene pairs with the relative expression of early-replicating genes increased by at least 20%, 50%, and 80% in subgroup 4, respectively (Fig. 2C). Genes replicating at different times show variation in the relative expression level during S phase, and vice versa. A pair of genes with a large variance of expression ratio through S phase tended to replicate at different times (Supplemental Fig. S2B). Collectively, the dosage balance between early- and late-replicating genes is disrupted during S phase.

    Genes encoding subunits of the same protein complex tend to replicate simultaneously in HeLa cells

    When dosage imbalance occurs among genes sensitive to the stoichiometric relationship, such as those encoding protein complexes, fitness is likely reduced (Papp et al. 2003; Birchler and Veitia 2012; Oromendia and Amon 2014). Nevertheless, the reduction in fitness can be avoided if genes encoding the same protein complex are replicated simultaneously. Following this logic, we proposed the synchronized replication hypothesis that the replication of genes encoding the same protein complex is synchronized.

    Genes encoding some protein complexes are replicated almost simultaneously in HeLa cells, as exemplified in Figure 3A. To test the synchronized replication hypothesis at the genomic scale, we retrieved the components of 1521 protein complexes from the Human Protein Reference Database (Peri et al. 2003). For each protein complex, we calculated the standard deviation of replication timing of all genes encoding this protein complex (Fig. 3B). As a control, we shuffled among genes encoding protein complexes to constitute “pseudo” protein complexes, keeping the number of complexes and the number of subunits in each complex unchanged (Fig. 3B). We performed the shuffling 1000 times. The median of observed standard deviations was significantly smaller than the random expectation (P < 0.001, permutation test) (Fig. 3B), indicating synchronized replication within protein complexes.

    Figure 3.

    Genes encoding the same protein complex are replicated simultaneously in HeLa cells. (A) Two protein complexes exemplify the synchronized replication of genes encoding the same protein complex. The standard deviation (sd) of replication timing within a protein complex is shown on the right. (B) The observed median standard deviation of replication timing within a protein complex is significantly smaller than the random expectation in which the protein-complex–coding genes were shuffled. (C) The comparison of the similarity in replication timing between the S-phase–expressed and –unexpressed protein complexes. The expressed protein complexes are defined as the complexes with at least two-thirds of subunits expressed; the expressed subunits are defined as the subunits expressed in at least half of the S-phase cells (CPM > 0). Outliers are omitted in the box plot. P-value was given by the Mann–Whitney U test.

    The synchronized replication hypothesis makes two more predictions. First, genes encoding the same protein complex should exhibit better mRNA dosage balance through S phase than the random expectation; this was indeed observed (Supplemental Fig. S3A). Second, the S-phase–expressed protein complexes should exhibit better synchronized replication than the unexpressed ones. Consistently, the standard deviations of replication timing of the expressed protein complexes were significantly smaller than those of the unexpressed ones (P = 0.0036, Mann–Whitney U test) (Fig. 3C), indicating that synchronized replication within a protein complex is driven by the expression of this complex in S phase.

    There are several confounding factors that should be controlled for. First, genes encoding the same protein complex tend to form clusters on chromosomes (Albig and Doenecke 1997; Hurst et al. 2004), which are likely to simultaneously replicate because they have similar physical distances to the closest replication origin. We discarded protein complexes of which at least two subunits are encoded by the genes on the same chromosome; synchronized replication remained observed (P = 0.002, permutation test) (Supplemental Fig. S3B). Second, to test if the size of protein complexes affects the level of synchronized replication, we classified the protein complexes into three categories according to its subunit number (<4, =4, or >4) and tested if replication is synchronized by shuffling genes within each category. Synchronized replication remained observed in all three categories (Supplemental Fig. S3C). Third, we tested the synchronized replication hypothesis in 14 most validated protein complexes that were identified by multiple methods, including the transcription factor II D (TFIID) complex, exon junction complex, and cyclin-dependent kinase 8 (CDK8) subcomplex; synchronized replication remained observed (P = 0.007, permutation test) (Supplemental Fig. S3D).

    Synchronized replication mainly occurs in fast-proliferating cells

    The replication-timing program varies among cell types. To determine if synchronized replication occurs uniformly among various human cells, we retrieved the replication-timing programs in 19 cell lines/types (Supplemental Table S2) from the ReplicationDomain database (Weddington et al. 2008) or NCBI Gene Expression Omnibus (GEO) (Edgar et al. 2002). They include six human ESC (hESC) lines, five cancer cell lines, and eight differentiated cell types derived from hESCs, such as liver, pancreas, smooth muscle, and mesothelial cells (Fig. 4A); the proliferation of ESCs and cancer cells is fast whereas that of differentiated cells is slow. These cell lines/types exhibit various levels of synchronized replication within protein complexes (as exemplified in Fig. 4B and Supplemental Fig. S4A). We further used the P-value of the permutation test, as in Figure 3B, to infer the level of synchronized replication for each cell line/type (Supplemental Fig. S4B, with three examples shown in Fig. 4C) and labeled synchronized replication for those with P < 0.05 (Fig. 4D). Synchronized replication was mainly observed in fast-proliferating cells (P = 0.001, Fisher's exact test) (Fig. 4D).

    Figure 4.

    Synchronized replication of genes in the same protein complex occurs mainly in fast-proliferating cells. (A) Nineteen cell lines/types across ESCs, differentiated cells, and cancer cells are used to detect the synchronized replication. Differentiated cells include liver, pancreas, smooth muscle (SM), and mesothelial cells (mesothel) derived from hESCs. (B) Replication timing of three genes encoding a three-subunit protein complex in each cell line/type. HCT116 (shown in Supplemental Fig. S4A) is not shown here because it deviates from others in replication timing. (C) Three examples of the tests for synchronized replication. The observed median standard deviation (sd) of replication timing of genes encoding the same protein complex (magenta) is showed on the distribution of the random expectations (blue). (D) Synchronized replication occurs in 11 fast-proliferating cell lines and two slow-proliferating cell types. The dendrogram (left) shows the clustering of 19 cell lines/types based on the replication-timing profile of all protein-coding genes and was generated with Ward's method in the hclust function of R (R Core Team 2018). The heat map on the right shows the level of synchronized replication estimated from the permutation test. The asterisk indicates that replication is synchronized (P < 0.05) in the corresponding cell line/type. (E) The proliferation rate was estimated from the average expression level of 30 meta-PCNA genes.

    The absence of synchronized replication in six differentiated cells may be due to the fact that the protein-complex–coding genes are no longer expressed or that the dosage balance among them is no longer important. To test these possibilities, we performed two additional analyses. First, we retrieved the mRNA levels of protein-complex–coding genes (Rivera-Mulia et al. 2015) and found them highly correlated between ESCs and differentiated cells (Supplemental Fig. S5A). Second, we tested the mRNA dosage balance, as for HeLa cells in Supplemental Figure S3A, for all cell lines/types; the mRNA dosage balance remained in differentiated cells (Supplemental Fig. S5B). These results showed that neither explanation worked.

    We then speculated that the loss of synchronized replication in differentiated cells was caused by the reduced power of natural selection for synchronized replication in slow-proliferating cells that spend a greater fraction of time in G0 phase and a smaller fraction of time in S phase (S%) (Fig. 5A). To test it, we estimated the proliferation rate of these cell lines/types from proliferation marker genes. The expression of the gene encoding proliferating cell nuclear antigen (PCNA), which promotes DNA replication in actively proliferating cells, has been widely used for estimating cell proliferation rate (Kubben et al. 1994; Moldovan et al. 2007; Bologna-Molina et al. 2013). We calculated the proliferation rate for each cell line/type (Fig. 4E) as the average expression level of PCNA and 29 more genes whose expression is positively correlated with it (meta-PCNA) (Supplemental Table S3). As expected, the proliferation rate was positively correlated with the proportion of cells in S phase among the eight cell lines/types for which flow-cytometry data were available (r = 0.8, P = 0.03, Pearson's correlation) (Supplemental Fig. S6). The proliferation rate was positively correlated with the level of synchronized replication among 19 cell lines/types (r = 0.7, P = 4 × 10−4, Pearson's correlation) (Fig. 5B), and the positive correlation remained after the normalization of the replication-timing data (r = 0.8, P = 2.3 × 10−4) (Supplemental Fig. S7), suggesting an S%-dependent optimization of the replication-timing program in human cells.

    Figure 5.

    Synchronized replication of genes in the same protein complex is S%-dependent. (A) Dosage imbalance between early- and late-replicating genes during S phase in fast- (top) and slow-proliferating (bottom) cells. (B) The proliferation rate and the level of synchronized replication are positively correlated among 19 cell lines/types.

    Genes causing changes in synchronized replication reside in long interspersed nuclear element (LINE)-rich regions

    How does synchronized replication get lost in differentiated cells? To answer this question, we identified 165 protein complexes (as exemplified in Fig. 6A) in which the similarity of replication timing within a protein complex was significantly reduced during differentiation (P < 0.05 in the t-tests for the standard deviations of replication timing in a protein complex, between ESCs and differentiated cells). Among the 491 genes encoding these protein complexes, 92% exhibited a delay in replication in differentiated cells (top in Fig. 6B). In contrast, only 73% of noncomplex encoding genes exhibited delays in replication timing (P < 2.2 × 10−16, Fisher's exact test), suggesting that the differentiated cells lose synchronized replication preferentially through a replication delay.

    Figure 6.

    Genes causing changes in synchronized replication reside in the genomic regions with higher LINE density. (A) The average replication timing (RT) among cells in the same group: ESCs, six differentiated cells that do not exhibit synchronized replication (liver and pancreas cells), or cancer cells. A six-subunit protein complex as an example shows the change in replication timing among three groups. (B) Protein complexes losing synchronized replication during cell differentiation or restoring synchronized replication during tumorigenesis were identified. Genes encoding these protein complexes were classified into categories according to the change in replication timing during these two processes. The fraction of genes in each category is shown. Error bars represent the standard deviation estimated from bootstrapping. (C) The comparison of the LINE densities between genes encoding 79 protein complexes that restore synchronized replication during tumorigenesis (orange) and genes encoding other complexes (blue). Outliers are omitted in the box plot.

    Among the 165 protein complexes in which synchronized replication is lost in differentiated cells, 79 protein complexes restored synchronized replication in cancer cells (P < 0.05 in the t-tests for the standard deviations of replication timing in a protein complex, between differentiated and cancer cells). Presumably, the restoration could occur either through (1) reversing the change in replication timing during cell differentiation, or (2) through an inter-genic suppression such that the change in replication timing of a second gene in the same protein complex follows that of the first. We found that during tumorigenesis, 82% of genes encoding these 79 protein complexes reversed the changes during cell differentiation (bottom in Fig. 6B), significantly higher than the fraction (72%) among genes not encoding protein complexes (P = 7 × 10−4, Fisher's exact test).

    That the changes in replication timing during cell differentiation were often reversed during tumorigenesis suggested that genes affecting synchronized replication reside in the transition or variable regions of the genome. It has been reported that replication-timing transitions during cell differentiation often occur in LINE-rich regions (Hiratani et al. 2004, 2008). The 79 complexes with synchronized replication restored in cancer cells were also encoded by genes located in the genomic region with higher LINE density (P = 2.3 × 10–5, Mann–Whitney U test) (Fig. 6C), suggesting that the change in synchronized replication during cell differentiation or tumorigenesis is caused by the variable firing times of replication origins in the transition regions of the genome.

    Discussion

    Abnormal replication-timing programs have been known to be related to disease and cancer (Smith et al. 2001; Ryba et al. 2012). Our study provides a new mechanism that explains why a proper regulation of the replication-timing program is essential, especially for fast-proliferating cells: to maintain the dosage balance among stoichiometry-sensitive genes during S phase. Synchronized replication was observed in ESCs and cancer cells but not in most differentiated cells. Since cancer cells are “evolved” from differentiated cells rather than ESCs, regaining synchronized replication in cancer cells implies convergent evolution of the replication-timing program during tumorigenesis. These observations echo previous analyses on the evolution of tumor cells at different levels, such as those at the transcriptome or the amino-acid usage level (Chen et al. 2015; Chen and He 2016; Zhang et al. 2018).

    We showed that the demand for dosage balance during S phase could cause synchronized replication of genes encoding the same protein complex. Presumably, such synchronized replication could also have evolved under other selection pressures. For example, it may evolve to meet the demand for similar expression levels of genes encoding the same protein complex because replication timing is associated with gene expression level (Schübeler et al. 2002; Woodfine et al. 2004). Nevertheless, this mechanism cannot explain why synchronized replication is lost in differentiated cells, where the genes encoding protein complexes remain expressed (Supplemental Fig. S5A) and the dosage balance among subunits remains important (Supplemental Fig. S5B). The selection for dosage balance during S phase uniquely predicts the loss of synchronized replication in slow-proliferating cells and the regain of it in cancer cells.

    Whereas the dosage imbalance during S phase can be partly relieved by the H3K56ac-associated transcription repression of newly replicated genes in yeast, the balance is not completely restored; early-replicating genes still exhibit an ∼20% higher expression level during mid-S phase (Voichek et al. 2016). In mammalian cells, the challenge of the dosage imbalance during S phase is likely greater because S phase lasts for a much longer time. For example, a HeLa cell divides every 24 h and its S phase lasts for ∼8 h (Volpe and Eremendo-Volpe 1970); HeLa cells need to suffer from the imbalance between early- and late-replicating genes for a few hours every 24 h. In contrast, yeast has adapted to the life cycle of 24–48 h per generation in the fermentation industry (Gallone et al. 2016) or in nature, which can be mimicked by a synthetic oak exudate medium (Murphy et al. 2006; Qian et al. 2012). However, S phase lasts for <1 h even in a poor carbon source (Leitao and Kellogg 2017), likely because the total time for DNA replication is mainly determined by the elongation rate of the DNA polymerase. Therefore, the imbalance between early- and late-replicating genes lasts for only dozens of minutes every 24–48 h in yeast. Collectively, the natural selection for synchronized replication is likely stronger in mammalian cells.

    The components of protein complexes may be heterogeneous among cell types (Long et al. 2017), which in principle can drive changes in replication-timing programs during development. For example, protein A may form a complex with protein B in one cell type but form another complex with protein C in a second cell type. To maintain the dosage balance in multiple cell types, the replication times of genes A and B should be more similar in the former cell type, while those of genes A and C should be more similar in the latter. It would be of great interest to test this possibility when the genome-wide heterogeneity data of protein complexes become available.

    Genes encoding the same protein complex sometimes form clusters on chromosomes (Albig and Doenecke 1997), which is often explained by the demand for coordinated gene expression (Hurst et al. 2004), reduction in expression noise (Batada and Hurst 2007), the positive epistatic relationship among genes (Yang et al. 2017), and the cofluctuation of stochastic gene expression (Sun and Zhang 2019; Xu et al. 2019). We showed that the synchronized replication hypothesis remained supported after controlling for such gene clusters (Supplemental Fig. S3B); yet the gene cluster itself could, in turn, be an evolutionary outcome of the selection for the dosage balance during S phase. Replication origins fire stochastically at the single-cell level (Bechhoefer and Rhind 2012), and therefore, the synchronization of replication timing is not robust in individual cells when these genes are interspersed in the genome and use different replication origins. Forming a cluster on chromosomes is a more robust strategy for maintaining dosage balance among genes.

    Our results also have implications for evolution at the nucleotide level. For example, since replication timing is a major determinant of mutation rate (Stamatoyannopoulos et al. 2009; Woo and Li 2012), we predict that genes encoding the same protein complex will have similar mutation rates due to their synchronized replication. Consistently, similar evolutionary rates have been reported between a pair of genes that share a biological function or are coexpressed (Clark et al. 2012). Collectively, our study not only identifies the driving forces underlying the evolution of the replication-timing program but also provides new insights into the evolution of DNA sequences.

    Methods

    Cell culture and the preparation of the single-cell suspension

    HeLa cells were cultured in Dulbecco's Modified Eagle Medium containing 10% fetal calf serum and 2 mM L-glutamine at 37°C in a 5% CO2 incubator. HeLa cells were grown to 70% confluence and were digested by 0.25% trypsin-EDTA solution. The digestion was stopped by adding culture medium. Cells were gently mixed and were washed with 1 × PBS with 0.04% bovine serum albumin twice. Finally, cells were filtered with a cell strainer (BD, #352340) and were assessed for viability with Trypan blue. The final cell viability was >95%.

    Preparation of single-cell sequencing libraries

    We performed scRNA-seq to analyze transcriptomes of “natural” cells through the cell cycle, to avoid the stresses imposed on cells by conventional experimental strategies such as cell-cycle synchronization or fluorescence-activated cell sorting (FACS); agents that prevent DNA synthesis or inhibit the formation of mitotic spindles not only arrest the cell cycle at certain points but also kill important fractions of the cells due to their toxic effect (Cooper 2003).

    Single-cell suspensions were loaded onto 10x Genomics Single Cell 3′ Chips (Single Cell A Chip kit, #120236) and were captured along with the barcoded gel beads. Single-cell libraries were constructed using Chromium Single Cell 3′ Library & Gel Bead kit v2 (10x Genomics, #120237) and Chromium i7 Multiplex kit (10x Genomics, #120262), following the manufacturer's protocol (#CG00052). The libraries were sequenced with the Illumina HiSeq 2500 System.

    Initial processing of the sequencing data

    Cell Ranger (v2.0.0) was used to process initial sequencing data. The FASTQ files were aligned to the reference genome (GRCh38-1.2.0). UMIs were counted for each bead barcode, and the number of cells was estimated from the barcode count distribution. The transcriptome data of a total of 884 cells were obtained in this study, and the average sequencing depth of them was 378,473 reads per cell.

    Cell-cycle analyses for HeLa scRNA-seq data

    The list of genes that are specifically expressed in each of the five phases of the cell cycle (G1/S, S, G2/M, M, and M/G1) was retrieved from a previous study (Whitfield et al. 2002). The cell-cycle status of each cell was inferred following Macosko et al. (2015). Briefly, we defined a total of 151 genes as phase-specific genes (Supplemental Table S1), the expression levels (in the unit of counts per million, CPM) of which were used for PCA. The cell-cycle status of each cell was inferred from the five phase-specific scores calculated as follows. The expression level (log2[CPM + 1]) of each of the 151 phase-specific gene was standardized (z-score) among the 884 single cells. For each cell-cycle phase, the phase-specific score of a cell was calculated as the average z-score of all marker genes of this phase. Each cell was assigned to the cell-cycle phase with the greatest score.

    Calculating the expression ratio between early- and late-replicating genes

    Early- or late-replicating genes were defined as the 2000 genes with highest or lowest replication-timing values, respectively. Genes specifically expressed (Supplemental Table S1) and unexpressed in S phase were excluded. The average expression level of each gene among all cells in an S-phase subgroup was normalized by dividing its average expression level among all cells in G1/S phase. For each gene pair, in which one gene from the 2000 early-replicating genes and the other from the 2000 late-replicating ones, the expression ratio of the early-replicating gene versus the late-replicating one was calculated in each S-phase subgroup.

    Data sources

    The information of 1521 protein complexes in humans was downloaded from the Human Protein Reference Database release 9 (www.hprd.org) (Peri et al. 2003). Among them, 1317 were annotated with complete information and were used in this study. The information of protein complexes was also retrieved from the Database of Interacting Proteins (DIP, dip.doe-mbi.ucla.edu), which documents experimentally determined protein complexes with the methods of purification (Xenarios et al. 2000).

    The replication-timing profiles used in this study were downloaded from the ReplicationDomain database (www.replicationdomain.org) (Weddington et al. 2008) or NCBI GEO (www.ncbi.nlm.nih.gov/geo/) (Edgar et al. 2002). The accession numbers are listed in Supplemental Table S2 (Thurman et al. 2007; Weddington et al. 2008; Hansen et al. 2010; Oda et al. 2012; Rivera-Mulia et al. 2015). Specifically, all available replication-timing profiles of ESCs and cancer cell lines were retrieved, together with eight hESC-derived differentiated cell types generated in Rivera-Mulia et al. (2015).

    Gene expression data used in this study were downloaded from NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/), and the accession numbers are listed in Supplemental Table S2.

    Estimation of the replication timing for genes

    The chromosomal locations of 19,805 protein-coding genes were retrieved from the human GRCh38.p12 annotation file, which was downloaded from Ensembl release 87 (www.ensembl.org) (Zerbino et al. 2018). The average intensity ratio of probes (when measuring replication timing) that overlap with a gene was defined as its replication timing.

    Estimation of the proliferation rate with gene expression profiles

    PCNA is a component of DNA polymerase delta, and its expression level is a reporter of DNA synthesis. We defined a panel of 30 meta-PCNA genes whose expression is positively correlated with PCNA. Specifically, we normalized the average expression level of genes (log2[RPKM + 1]) in 19 cell lines/types and calculated the Pearson's correlation between PCNA and each of 131 previously identified candidate meta-PCNA genes (Venet et al. 2011). Genes with a correlation coefficient >0.8 were defined as the meta-PCNA genes of this study. The proliferation rate was inferred from the average expression level of the 30 meta-PCNA genes, which are listed in Supplemental Table S3.

    Estimation of the fraction of cells in S phase with flow cytometry

    Flow-cytometry data of propidium iodide-stained cells in eight cell lines/types (listed in Supplemental Table S2) were generated in a previous study (Rivera-Mulia et al. 2015). We estimated the fraction of cells that belong to each of the three stages (G1, S, and G2/M) using the tool for cell-cycle analysis in FlowJo.

    Calculation of LINE density

    The human LINE annotation file (rmsk.txt) was downloaded from the UCSC Genome Browser (www.genome.ucsc.edu, hg38) (Kent et al. 2002). LINE density of a gene was defined following a previous study (Hiratani et al. 2004) as the percentage of repetitive sequences in the 400-kb region surrounding the transcription start site of this gene.

    Data access

    The raw scRNA-seq data reported in this study have been submitted to the Genome Sequence Archive (Wang et al. 2017) in the Beijing Institute of Genomics (BIG) Data Center (https://bigd.big.ac.cn/gsa) under accession number CRA001055. All codes to analyze the data and generate figures are available at www.github.com/YingChen10/Synchronized-replication-during-S-phase and as Supplemental Code.

    Acknowledgments

    We thank Takayo Sasaki and David M. Gilbert for providing the flow cytometry data generated in Rivera-Mulia et al. (2015). We thank Zeyu Zhang for helpful discussions, and Xionglei He and Mengyi Sun for critical comments on the manuscript. This work was supported by grants from the National Natural Science Foundation of China to W.Q. (91731302 and 31922014).

    Author contributions: Y.C., K.L., and W.Q. conceived the idea; K.L. performed the experiments; Y.C., K.L., and X.C. analyzed the data; Y.C., L.B.C., and W.Q. wrote the manuscript.

    Footnotes

    • Received July 4, 2019.
    • Accepted October 28, 2019.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents

    Preprint Server