Theories and Applications for Sequencing Randomly Selected Clones

  1. Michael C. Wendl1,3,
  2. Marco A. Marra2,
  3. LaDeana W. Hillier1,
  4. Asif T. Chinwalla1,
  5. Richard K. Wilson1, and
  6. Robert H. Waterston1
  1. 1Genome Sequencing Center, Washington University, St. Louis, Missouri 63108, USA; 2British Columbia Cancer Research Centre, Vancouver, British Columbia V5Z 1L3, Canada

Abstract

Theory is developed for the process of sequencing randomly selected large-insert clones. Genome size, library depth, clone size, and clone distribution are considered relevant properties and perfect overlap detection for contig assembly is assumed. Genome-specific and nonrandom effects are neglected. Order of magnitude analysis indicates library depth is of secondary importance compared to the other variables, especially as clone size diminishes. In such cases, the well-known Poisson coverage law is a good approximation. Parameters derived from these models are used to examine performance for the specific case of sequencing random human BAC clones. We compare coverage and redundancy rates for libraries possessing uniform and nonuniform clone distributions. Results are measured against data from map-based human-chromosome-2 sequencing. We conclude that the map-based approach outperforms random clone sequencing, except early in a project. However, simultaneous use of both strategies can be beneficial if a performance-based estimate for halting random clone sequencing is made. Results further show that the random approach yields maximum effectiveness using nonbiased rather than biased libraries.

Genomic mapping and sequencing have benefited greatly from the development of stable large-insert cloning platforms, such as the bacterial artificial chromosome (BAC) clone (Shizuya et al. 1992). Sequencing random clones has recently been discussed, both as a primary strategy and/or in the role of augmenting map-based sequencing. By random clone sequencing, we mean that large-insert clones are randomly selected from a library without a priori knowledge of their positions in the genome. They are then sequenced by standard shotgun techniques and assembled into contigs (i.e., sets of contiguous clones) using sequence comparison methods. Because positional information is not known in advance, there is no guarantee that a clone will extend a contig or will not be fully redundant with respect to clones already sequenced. This random-clone approach contrasts with map-based sequencing, by which clone selection is guided by a preexisting physical map. Here, redundancy and contig information are known at the outset, enabling some optimization of the selection procedure. For example, if constant minimal redundancy can be maintained, map-based sequencing is essentially a linear coverage process.

As with map-based sequencing, issues for random clone sequencing revolve largely around performance, as measured by such parameters as rates of progress and redundancy accumulation. These parameters have not been empirically quantified for random sequencing because there remains a lack of substantial data for large genomes. Conversely, map-based sequencing is better understood from an empirical standpoint. For example, a map-based approach was employed for theCaenorhabditis elegans sequencing project ( C. elegansSequencing Consortium 1998) and has figured prominently in the Human Genome Project (Sanger Centre and Washington University Genome Sequencing Center 1998). These projects provide a good benchmark of the map-based approach for large genomes.

As for mathematical modeling, the random-sequencing scenario described above has not been specifically addressed. Theoretical developments have concentrated mainly on mapping techniques, for example, the seminal work of Lander and Waterman (1988) for the fingerprint method and later models for other procedures (Arratia et al. 1991; Barillot et al. 1991; Zhang and Marr 1993; Port et al. 1995; Schbath 1997). Owing to considerations of similarity, it has been postulated that mapping models could be directly applied to random clone sequencing (Lander and Waterman 1988; Roach 1995). However, a subtle issue related to clone overlaps arises. Mapping models are necessarily based on the ability of the method to detect overlaps for contig assembly. In particular, a prescribed clone-length fraction necessary for detectionθ 0 > 0 accounts for the fact that some percentage of overlaps will go undetected (the full description of nomenclature is given in Table 1). A measure of overlap itself, for example, as characterized by library depth ϕ is not considered. However, for random clone sequencing, contig assembly relies on ex post facto sequence comparison, which can be assumed to detect all overlaps (θ 0 ➞ 0). Application of mapping theory, for instance, the Lander and Waterman model withθ 0 = 0, would not explicitly account for library depth. Modeling liability of this idealization, if any, has not been established.

Table 1.

Nomenclature

In this report, we formulate a theory for the random-clone-sequencing procedure described above. Our primary purpose is to provide a mechanism to estimate performance of an actual sequencing project based on this paradigm. Moreover, we also evaluate the importance of library depth as a modeling factor and, thus, the applicability of mapping models to this problem. We focus on two types of libraries: one generated by random means, for example, mechanical shearing, and one created by partial digest using a restriction enzyme. The former is expected to have essentially a uniform clone distribution because there is no preference for cut sites, whereas the latter depends upon the inherent nonuniformity of restriction sites throughout a genome and could have appreciable bias in the distribution of clones. We apply the theory to the particular case of human BAC sequencing using a randomly generated library and a library created by partial digest with HindIII. We refer to these as the randomly generated and HindIII libraries, respectively. Map-based results from human-chromosome-2 BAC data are used for comparison. No account is made of genome-specific or nonrandom phenomena; therefore, theory and results are considered to be first-order approximations.

RESULTS

Assessment of the Importance of Library Depth

If we set Θ 0 = 0 in the Lander and Waterman (1988) model to simulate perfect overlap detection, the main difference between it and our model in equation 3 is consideration of library depth. For perfect detection, Lander-Waterman coverage reduces to the well-known Poisson coverage expressionL(i) = G − Ge L0 i/G. Equation yields the same result if we allow clone size to vanish (L 0 ➞ 0). Thus, the importance of considering library depth diminishes with clone size, at least in the approximate context of these models.

The physical interpretation is that L 0 ➞ 0 tends toward an idealized point model, for which partial overlap of one clone with another is not possible. For the point model, clones only overlap completely or not at all. As would be expected, one identically recovers the Poisson coverage law if our model is derived without accounting for partial overlap. Because clone size is usually small relative to genome size, it appears from equation 3 that library depth is actually a secondary consideration compared to the other variables. A corollary is that random clone sequencing can often be reasonably approximated by the Poisson coverage expression.

Assessment of the Importance of Clone Size Variation

Actual clone sizes in any library can be expected to vary somewhat; however, uniform clone size has been a standard theoretical assumption (e.g., Lander and Waterman 1988; Barillot et al. 1991; Zhang and Marr 1993; Port et al. 1995; Roach 1995). We use data from chromosome-2 BAC clones to evaluate the effect upon the present model. Table2 indicates an average clone length of 181.8 kb, with a 19-kb standard deviation. A rudimentary test is to determine difference in coverage over one standard deviation in clone size, that is, for 181.8 ± 9.5 kb (the test is more conservative than it may initially seem because the size distribution is entirely concentrated at a single point for each case). Equation yields a maximum difference below 10% when evaluated over the whole sequencing project. As with library depth, variation of clone size does not appear to be a primary factor in modeling random clone sequencing.

Table 2.

Overlapping BAC Clones from Chromosome 2 RPCI-11 Library

DISCUSSION

Performance Early in a Project

Results from 7000 randomly sequenced human BAC clones are used to evaluate performance early in a project. Modeling constants are genome length G = 3 Gb, nominal clone sizeL 0 = 170 kb (derived from a genomewide sampling of Genome Sequencing Center BAC clones), and library depthϕ = 10×. Other parameters are as discussed in Methods; in particular, the average value of redundancy for map-based sequencing is θ exp = 14.4%. Figures1 and 2 show redundancy and coverage results, respectively, for each approach. They confirm that randomly sequenced BACs, regardless of library type, initially yield comparable rates of coverage and lower redundancy compared to map-based sequencing. After some number of clones, the advantage shifts to the map-based approach. From a redundancy standpoint, this crossover occurs at approximately 5600 clones for the randomly generated library and between 2900 and 4800 clones for theHindIII library. In terms of coverage, the numbers are comparable: 5600 nonbiased clones and between 3500 and 5400HindIII clones. The upper limit for performance crossover is about 5600 BAC clones, using a randomly generated library. At this point, a total of ∼one-fourth of the genome has been covered. This value is also a reasonable estimate for the mouse genome because its parameters are nominally the same as for the human genome. Figure 2also indicates that performance is essentially independent of library type and bias level for the first several thousand clones. In fact, rate of progress for the first 10%–15% of the genome is essentially fixed. This observation suggests that conventional restriction digest libraries could effectively be used for random clone sequencing while mapping work is in progress.

Figure 1.

Expected redundancy for the first 7000 BAC clones using both map-based and randomly selected BAC sequencing approaches. The randomly generated library appears as a single curve and the HindIII library appears as a shaded region resulting from low- and high-restriction site-bias estimates. The top bound denotes high bias; the bottom corresponds to low bias. Results for map-based sequencing are given by the horizontal line.

Figure 2.

Effective expected coverage for map-based and randomly selected BAC sequencing for the first 7000 BAC clones. The shaded area denotes theHindIII library; high bias is represented by the lower of the bounding curves.

Simultaneous Random, Map-Based Sequencing

In a general scenario, random and map-based sequencing could be conducted simultaneously. At the time of this writing, ∼50% of map-based human sequencing was complete, well past the 5600 clone crossover juncture. However, Figure 3(inset) suggests that sequencing from the random library would still be justified because its coverage rate, as deduced from the slope, remains comparable to the map-based approach. Conversely, suitability of theHindIII library depends strongly on bias. For low bias, the coverage rate is slightly less than that for a random library; however, the rate is much decreased at higher bias.

Figure 3.

Predicted coverage up to 176,000 BAC clones (∼10 genome equivalents). The inset shows a magnified view in the 30%–45% range of coverage. HindIII coverage is denoted by the shaded area; the high-bias simulation is the lower of the bounding curves.

To determine when random clone sequencing should be halted, the following procedure can be applied. First, specify minimum acceptable performance in terms of model parameters; then evaluate the resulting number of clones obtained for each parameter. Choose the lowest value as the stopping point. If finished sequence is already available, as in this case, subtract the equivalent number of already-sequenced clones to obtain the number of clones that can still be sequenced within the original constraints. Consider the following example: Random sequencing is performed while its rate of coverage is at least half that of map-based sequencing and its redundancy is <50%. We work the problem for a random library because it is already clear that this will yield a higher number of clones. Using θexp = 0.144 from Table 2, rate of progress for map-based sequencing is computed asR = 0.856L 0 per BAC clone. Taking half this value indicates that random clone sequencing should be continued until R falls to 0.428L 0 per BAC. Solving equation 4 for i and substitutingR = 0.428L 0 yields ∼15,000 clones as one possible stopping point. The other possibility is given by the number of clones for 50% redundancy, which can be read directly from Figure 4 as ∼28,000. Our prescribed coverage rate governs the problem; so we choose 15,000 as the maximum number of randomly sequenceable BAC clones. At 50% coverage, the equivalent number of already-sequenced BAC clones is about 12,000, which means 15,000–12,000 or 3000 BACs could still be sequenced within our original performance constraints.

Figure 4.

Predicted redundancy up to 176,000 bacterial artificial chromosome (BAC) clones (∼10 genome equivalents). The inset shows a magnified view in the lower redundancy region. The map-based redundancy of 14.4% coincides with the abscissa in both plots.

Extrapolation to Higher Coverage

Figures 3 and 4 extend the simulation to 10 genome equivalents (176,000 BACs) to examine results at higher coverage levels. There has been considerable debate on the efficacy of a clone map for sequencing (Green 1997; Weber and Myers 1997); however, Figures 3 and 4 clearly show its superior performance at higher coverage levels. Specifically, map-based clone sequencing behaves linearly, whereas random clone sequencing is an asymptotic process. The random library displays better coverage and lower redundancy than the best HindIII library and converges to full coverage. All bias estimates predict theHindIII library falling short of 100% coverage due to unrepresented regions of the genome.

Summary

Performance of map-based sequencing is superior to random clone sequencing, except early in a project. Moreover, a randomly generated library can be expected to outperform a restriction digest library, perhaps by a significant margin. These conclusions are likely true for many combinations of organism, clone type, enzyme, etc. Results suggest that sequencing via map-based and random strategies simultaneously is reasonable, especially if a suitable performance-based estimate for halting random clone sequencing is made.

Other cases of interest can be treated by applying the same procedures shown here for human BAC clones. Of special interest is the whole-genome shotgun method, which is now being used for large genomes (Adams et al. 2000). The procedure relies on random fragmentation so that a uniform subclone distribution can reasonably be assumed. Here,L 0 is the subclone length. Because this value is considerably small compared to genome size, a point model can readily be assumed, permitting usage of the Poisson coverage approximation discussed above.

METHODS

Models for map-based and random clone sequencing are derived here.Solomon (1978) and Hall (1988) are good introductions to the general topic of coverage theory, which encompasses these processes.

Empirical Model of Map-Based Sequencing

A clone library supported by a physical map can be characterized by the average overlap of the sequenced clonesθexp . We estimate this parameter empirically using 43 human-chromosome-2 BAC clones sequenced at the Genome Sequencing Center. Table 2 shows derived overlap and size data for these clones, which originate from the RPCI-11 library (Osoegawa et al. 1998). In the present context, our documentation system for clone lengths and finishing boundaries has the effect of tallying overlaps twice. Therefore, the average overlap of 28.9% calculated from Table 2must be divided in half, yieldingθexp  = 14.4%. This group of clones is taken as representative of the overall map-based sequencing process for human BAC clones. In other words, overlap is assumed to be a constant given byθ(i) = θ exp. Therefore, the coverage added by each sequenced clone, that is, the rate of progress isR(i) = (1 − θ exp)L 0and the total coverage L(i) is simplyiR(i).

Random-Sequencing Model for a Randomly Generated Library

Random fragmentation implies that all base positions are equally likely to be breakage sites, which results in a uniform clone distribution. We assume a constant clone size of L 0bases, a genome length of G bases, and a library depth ofϕ. The standard equation of expected value is employed for estimating coverageL = Σp ɛ L ɛ, where L ɛ is the coverage contributed by a particular event ɛ and p ɛ is its probability of occurrence. Three events are considered: the new event, in which the clone does not overlap any established sequence (all added sequence is new); the partial event, in which part of the clone overlaps established sequence; and the buried event, in which the entire clone is buried in previously generated sequence. The coverage after randomly sequencing a clone is then Σ(pL)new + Σ(pL)partial + Σ(pL)buried. Then define a segment length b, the average number of bases from one breakage site to the next. Because clone ends are synonymous with breakage sites, clones must overlap in units that are multiples ofb. The number of clones in a libraryN = ϕG/L 0, equals the number of right (or left) clone ends, and therefore, the number of uniformly distributed breakage sites. By definition, b can be computed by dividing the genome length by the number of breakage sites, yieldingb = G/N = L 0/ϕ. A clone end can only occupy discrete positions given by multiples ofb; so the total number of possible positions isK = N − ϕ + 1 and the probability of any given position isp = K −1. This expression accounts for so-called end effects (Port et al. 1995; Roach 1995) and prevents clones from running off the end of the genome. No such constraint exists for circular genomes (Parke 1997).

The length of unique sequence after i clones have been sequenced is L i, whereL i = lib. For the new event, the length after sequencing another clone isL new = L i + L 0and the number of ways to obtain this event isK − (l i + ϕ − 1). For the buried event, the length remains the same as it was before the clone was sequenced, that is,L buried = L i, and the number of ways this could occur isl i − ϕ + 1. These combinations yieldp new = (K − l i − ϕ + 1)/Kandp buried = (l i − ϕ + 1)/K. Length for the partial event is simply L newminus the amount of overlap, that is,L partial = L i + L 0 − h b, where h is the number of b-length segments of overlap. Because the new clone is, at most, equal in length to the established sequence, this can happen a total ofh = ϕ − 1 ways on either side of the established sequence. Therefore, the partial event can be written as ∑(pL)partial = 2∑ϕ−1 h=1(Li + L 0h ·b)/K. Using a summation identity, this expression reduces to ∑(pL)partial = 2(ϕ − 1)(Li + L 0)/K − (ϕ − 1)L 0/K. Taking all of these events and simplifying, the final recursion for coverage isL i+1 = Li + [1 −Li /(GL 0 +L 0/ϕ]L 0. Rearranging, we obtainFormula The left hand side is a finite difference approximation of the rate of change of coverage with respect to the number of clones sequenced. Eqn. (1) is therefore a discrete analog of the differential equationFormula which can be solved usinge L0 i/[G−L0 (1−1/ϕ)]as an integrating factor (Martin and Reissner 1956). Initial conditions require that the first clone sequenced yields coverage equal to its own length, L(1) = L 0, which leads to the final coverage expressionFormula where c 0 = GL 0(1 − 1/ϕ) and c 1 =c 0L 0. Various performance parameters can be derived from this equation. For example, in the ideal case of end-to-end clone placement, coverage is simplyL 0 i, which allows redundancy to be defined in a normalized sense as θ(i) = 1 −L(i)/(L 0 i). Rate of progress, R(i), can be calculated by differentiation asFormula

Random-Sequencing Model for a Restriction Digest Library

Modeling the sequencing process for a restriction digest library must account for nonuniform clone distribution, which arises from the natural restriction-site bias of a genome. Rather than attempt to derive an exact solution for this case, it is more expedient to employ Monte-Carlo simulation (Press et al. 1991). A set of restriction sites is first established according to an appropriate nonuniform-probability-density function. Sequence coverage is then simulated by randomly selecting a site as a left clone end. Based upon local site distribution and nominal clone length, the right end is then determined, after which cumulative coverage is recomputed. Iteration is continued until a user-specified stopping point is encountered.

Poisson probability density functions provide a suitable model for restriction sites in the sense that a large segment size can be specified, for example, 4s where s is enzyme specificity, while the probability of a site occurring in any neighborhood of a segment is small. Because site bias cannot be known a priori (Green 1997), we follow the methodology of Port et al. (1995)and Schbath (1997) in using simple functions to model bias. We select (m + 1)x m, where m is the user-specified bias level and x is the distance along the linearly arranged genome. This function conserves the total number of sites in a genome toμavgG/4s, where μ avg is an empirically sampled value of the number of restriction sites in a segment. A general variable-rate Poisson process for restriction site distribution is then given byFormula where n is the number of sites expected per segment. Global distribution of sites is computed along 0 < x ≤ G using equation 5; however, the local distribution in each segment is taken to be uniform. No coverage is allowed between base position 1 and the first restriction sitex 1. Thus, bias yields a segmentx 1 − 1, which cannot be covered by sequence due to lack of representation in the library.

As with the uniform distribution model, other performance parameters can be obtained using coverage results. Redundancy is computed exactly as defined for the uniform model. For rate of progress, no closed-form expression is available. However, it can be calculated by finite difference approximation (Tannehill et al. 1997). Due to the nonsmooth nature of Monte-Carlo simulation, it may be necessary to average out local fluctuations by computing each difference over a large set of clones. We note that this introduces an additional component of numerical error (Tannehill et al. 1997).

As applied specifically to a HindIII human BAC library, the parameters are s = 6 (because this enzyme is a 6-cutter recognizing AAGCTT) and a segment size of 4096 bases. To estimateμ avg, 868 finished sequences from the Genome Sequencing Center encompassing ∼105 megabases were analyzed for AAGCTT. We found μ avg ≈ 1.25, implying ∼916,000 total sites for a 3-Gb genome size. A coefficient of dispersion of 1.046 indicates that Poisson modeling is acceptable (Sokal and Rohlf 1981). We assume lower- and upper-bound functions for bias as 3√ x /2 and 3x 2, respectively.

A code using this model was tested with m = 0 and results compared well to equation 3. This method is computationally intensive because each restriction site and its coverage status must be stored explicitly and these arrays are traversed for each succeeding clone. Placing a clone in the genome and computing coverage require approximately L 0/4s andG/4s operations, respectively. Because clone ends must be restriction sites, the simulated length of a clone will vary in a range of ∼4s around the nominal value of L 0. Exceptions are that near the end of the genome, a clone may be much smaller because it cannot run off the end, and in restriction-site-poor areas, a clone can be significantly longer because it must extend to the next restriction site.

Acknowledgments

This work was supported by a grant from the National Human Genome Research Institute (HG02042). We thank J. Wallis of the Genome Sequencing Center for useful discussions and anonymous reviewers for insightful critiques.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Footnotes

  • 3 Corresponding author.

  • E-MAIL mwendl{at}watson.wustl.edu; FAX (314) 286-1810.

  • Article published online before print: Genome Res.,10.1101.gr.133901.

  • Article and publication are at www.genome.org/cgi/doi/10.1101/gr.133901.

    • Received February 2, 2000.
    • Accepted November 21, 2000.

REFERENCES

| Table of Contents

Preprint Server