Generalized Gap Model for Bacterial Artificial Chromosome Clone Fingerprint Mapping and Shotgun Sequencing

Michael C. Wendl; Robert H. Waterston

doi:10.1101/gr.655102

Abstract

We develop an extension to the Lander-Waterman theory for characterizing gaps in bacterial artificial chromosome fingerprint mapping and shotgun sequencing projects. It supports a larger set of descriptive statistics and is applicable to a wider range of project parameters. We show that previous assertions regarding inconsistency of the Lander-Waterman theory at higher coverages are incorrect and that another well-known but ostensibly different model is in fact the same. The apparent paradox of infinite island lengths is resolved. Several applications are shown, including evolution of the probability density function, calculation of closure probabilities, and development of a probabilistic method for computing stopping points in bacterial artificial chromosome shotgun sequencing.

Complete DNA sequences are critical resources for biomedical research. Motivated both by the need for such information and by enabling advances in technology, sequencing efforts continue to expand dramatically. Several “model” organisms have already been completed (e.g., Johnston et al. 1997; TheCaenorhabditis elegans Sequencing Consortium 1998; Adams et al. 2000; The Arabidopsis Genome Initiative 2000), and draft versions of the human genome have recently been announced (International Human Genome Sequencing Consortium [IHGSC] 2001; Venter et al. 2001). Numerous additional projects are either planned or underway.

There are a number of views regarding optimal strategies toward sequencing. Experience derived from recent human projects (IHGSC 2001;McPherson et al. 2001) confirms that a fingerprint approach based on bacterial artificial chromosome (BAC) clones (Shizuya et al. 1992) is effective for large genomes. Conversely, small genomes can usually be sequenced directly using the random shotgun method (e.g., Heidelberg et al. 2000). The seminal work of Lander and Waterman (1988) provided the first step toward a fundamental theoretical basis for these two important procedures. In particular, the Lander and Waterman (L-W) theory permits calculation of the expected number of gaps as a function of the number of clones or subclones processed and the resolution for detecting overlaps (Fig. 1). Because project completion basically depends on the number of outstanding gaps (Roach et al. 1999), this statistic is useful both in planning and troubleshooting and remains one of scientists' standard analytical tools (Myers 1999).

Figure 1.

Schematic representation of fingerprint mapping and shotgun sequencing. Crossbars represent average amount of overlap required for detection. Some predicted gaps will be genuine as in (a) for which no clone spans the region, whereas others will be falsely predicted as in (b) because of insufficient detection resolution.

Open in new tab Download PowerPointLink to figure

Mathematical descriptions of mapping and sequencing are rooted in classical theories of probabilistic coverage processes (Kendall and Moran 1963; Solomon 1978). These early results are idealized in the sense that they do not consider biologically relevant parameters, such as detection resolution for clone overlaps. The L-W theory was the first practical advance in this regard. The L-W model posits a simple geometric coverage process from which expected values are deduced. Conversely, Roach (1995) proposes a process governed by a binomial distribution and argues that the geometric model is valid only for limited coverage. Wendl et al. (2001) cast some doubt on this conclusion by showing that L-W results can be obtained independently of a geometric assumption, but they did not further resolve the discrepancy. Other idealized results have been developed, for example, the probability of closure in which the alphabet of nucleotide bases is infinite (Derrida and Fink 2002). The text by Hall (1988) discusses some related problems.

Here, we formulate a rigorous extension to L-W theory. This work was motivated by three concerns. First, L-W theory is based on the assumption of vanishing clone size. This simplification is actually embedded in all the standard models discussed previously, in which it is invoked in equivalent forms of infinite genome size or a continuum representation of the problem rather than a discrete one. The degree to which projects such as BAC fingerprinting small genomes (e.g., Tomkins et al. 2001) violate the vanishing clone length assumption is unclear. Second, there are apparent theoretical discrepancies with other models, especially the well-known paradox of infinite island lengths (Roach 1995). Finally, L-W theory does not support descriptive statistics beyond the expected value. The current generalization fully resolves each of these issues. We show several example applications that give a more accurate and comprehensive gap characterization of mapping and sequencing than has previously been available.

RESULTS

A combinatorially exact distribution describing gaps appears in equation 4. Variables L and G denote clone and project lengths, respectively, T specifies the average length of overlap required for detection, and N represents the number of clones processed. Statistics are characterized by the moment-generating function in equation 5, from which are derived expected number and variance of gaps in equations 6 and 7. Higher moments can be derived in a straightforward fashion from equation 5. Corresponding approximate results appear in equations 9 through 12. We quantify errors arising in the latter set of equations and show that they are equivalent to models by Lander and Waterman (1988) and Roach (1995).

Error Quantification for Approximate Models

The approximate model is obtained by invoking two simplifications with respect to equation 3. First, asymptotic approximation is used, that is, (1 − α)^N → e ^−αN, where α = (L − T)/G is small (Seed 1982;Torney 1991; Marr et al. 1992). Second, gap limits are not established as in equation 3. Finite probabilities are therefore permitted for numbers of gaps in excess of the physical maximum, int(G/L). In general, the resulting probability density given by equation 9 is artificially disperse compared with the combinatorially exact result in equation 4 (Fig. 2). Consequently, approximation is only valid when clone length is “small enough” compared with project size.

Figure 2.

Representative probability density functions for a hypothetical mapping project (L/G = 0.001, T/L = 0) at 1× coverage.

Open in new tab Download PowerPointLink to figure

Current mapping and sequencing projects encompass L/G ratios that vary over five orders of magnitude, with the maximum being of order 10⁻² for certain fingerprint projects (Table1). Exact theory is difficult to compute for low L/G, whereas approximate theory is not valid for highL/G. Delineating values for which each is appropriate is therefore useful. Figure 3 shows error evaluation for the expected number of gaps in a set of projects having 0.00085 ≤ L/G ≤ 0.03 (Zhu et al. 1999; Chang et al. 2001). Predictably, the worst case is that in which relative clone size is largest. Yet, even at this extreme, the maximum error is only on the order of 2%. Asymptotic theory is therefore a remarkably robust predictor of expected gaps. Figure 4 shows the corresponding error evaluation for standard deviation of the gap distribution. Here, error is more sensitive, being about five times as large as that of the expected value. A 2% error limit indicates applying the exact model for BAC shotgun sequencing and small genome fingerprinting (Table 1).

Table 1.

Representative Fingerprint Mapping and Shotgun Sequencing Projects

Project description	Approximate L/G	Reference
Whole genome shotgun	1.8 × 10⁻⁷	Venter et al. (2001)
sequencing of complex organisms	4.6 × 10⁻⁶	Adams et al. (2000)
BAC clone fingerprinting of large genomes	6.0 × 10⁻⁵	McPherson et al. (2001)
Bacterial whole genome	1.4 × 10⁻⁴	Heidelberg et al. (2000)
shotgun sequencing	2.6 × 10⁻⁴	Fleischmann et al. (1995)
BAC clone fingerprinting	7.7 × 10⁻⁴	Mozo et al. (1999)
intermediate-size genomes	8.5 × 10⁻⁴	Chang et al. (2001)
BAC shotgun sequencing	3.0 × 10⁻³	IHGSC (2001)
BAC clone fingerprinting	3.3 × 10⁻³	Martin et al. (2002)
small genomes	1.1 × 10⁻²	Dewar et al. (1998)
	1.7 × 10⁻²	Tomkins et al. (2001)
	2.1 × 10⁻²	Diaz-Perez et al. (1997)
	3.0 × 10⁻²	Zhu et al. (1999)

[i] BAC, bacterial artificial chromosome.

Open in new tabLink to table

Figure 3.

Parametric characterization of how asymptotic theory overpredicts expected value of gaps. Ordinate is scaled by the maximum exact expected value for each project.

Open in new tab Download PowerPointLink to figure

Figure 4.

Parametric characterization of how asymptotic theory overpredicts standard deviation of gaps. Ordinate is scaled as in Figure 3.

Open in new tab Download PowerPointLink to figure

Unification of Previous Models

Equations { label needed for disp-formula[@id='E3'] } through 12 resolve a long-standing controversy between two established theories. The Lander and Waterman (1988) model can be considered the standard: It is widely applied and characterizes the expected number of islands and their expected lengths via the simple expressions N e ^−αN andG(e ^αN − 1)/N. Roach (1995) developed an alternative model, which is thought to be fundamentally different from the L-W model. Roach asserts that L-W results are inconsistent at higher coverages. In particular, expected island length is unbounded and exceeds that of the project itself for coverage depths above approximately 6× to 8×. This trend appears in the original Lander and Waterman article, although it is not discussed per se. It is then argued by Roach that the fundamental basis of the L-W theory is not valid in this range. Kupfer et al. (1995) have raised similar concerns. Consequently, many investigators resort exclusively to the Roach model when coverages of interest exceed 5× (Smith et al. 1997; Yamada et al. 2000).

If a slightly modified interpretation is applied to one of the L-W results, we show that not only is this assertion incorrect but that theLander and Waterman (1988) and Roach (1995) models are basically identical and both consistent. The paradox of unbounded island length is really a matter of correctly characterizing limiting behavior and can be resolved as follows. Although investigators usually regard gap number and island number as equal, the latter must converge to one greater than the former in the limit of closure, that is

lim_{N_{gaps} \to 0} N_{islands} = lim_{N_{gaps} \to 0} (N_{gaps} + 1) = 1 .

Suppose that we increment the L-W expression for the expected number of islands by 1 to obtain the correct limiting behavior as closure is approached. Although not as important for practical calculations, let us also replace N with N − 1 to obtain the correct behavior at project initiation, that is, the first clone yields exactly 1 island. The result is N e ^{−α(N − 1)} + 1 − ε, where ε = e ^{−α(N − 1)} is a small quantity that quickly vanishes. This expression is identical within ε to E〈I〉 + 1, where E〈I〉 is given by equation 11. Because equation 11 represents the expected value of gaps, the Lander and Waterman (1988) result above should be more properly regarded as the number of gaps rather than the number of islands. In this context, the model is fully consistent and limiting behavior is correct. For example, the quotient of bases covered,G(1 − e ^−αN), and number of islands (with correct end-limiting behavior) yields a more reasonable L-W approximation for expected island length

{ label needed for disp-formula[@id='E2'] }

E 〈 L_{island} 〉 = \frac{G (1 - e^{- α N})}{N e^{- α N} + 1} .

Equation { label needed for disp-formula[@id='E2'] } correctly converges to the project length G.

Furthermore, equation 11 is derived from equation 9, which is essentially the same density function given by Roach (1995), that is, a binomial distribution based on the probability of a gap. The Lander and Waterman (1988) and Roach (1995) models are thus fundamentally equivalent, although Roach provides the underlying density function that did not appear in the Lander and Waterman article. Differences in appearance of the equations between the two articles are second-order and can be neglected for practical calculations. Specifically, Roach (1995) uses N − 1 rather than N but does not explicitly use exponentiation. Strictly speaking, his result remains asymptotic because gap limits are not rigorously established as in equation 3. This leads to a one-term approximation of equation 4. To illustrate the equivalency, we repeat a case study by Roach (1995) that compares expected island lengths for a shotgun sequencing project (Fig.5). Whereas original L-W theory diverges, equation 2 duplicates results obtained by Roach within the second-order differences mentioned above. Amending limiting behavior as we have described here promises to resolve similar anomalies in other models (Arratia et al. 1991; Port et al. 1995).

Figure 5.

Repeat of a case study by Roach (1995) that compares expected island length for a shotgun sequencing project having G = 40,000,L = 500, and T = 20. Crosses represent average values derived from a series of Monte Carlo simulations performed byRoach (1995). Coordinate axes are scaled exactly as in Roach (1995).

Open in new tab Download PowerPointLink to figure

DISCUSSION

Past work has largely focused on expected value of gaps, islands, and so forth. Here we broaden these results by several example calculations using both our combinatorially exact and asymptotically approximate models.

Evolution of Gaps

The process by which gaps evolve in a project can be examined by plotting probability density as a function of coverage depth N L/G (Fig. 6). Dispersion is minimal at the outset, which is expected, given that the number of possible arrangements for a limited number of clones is relatively small. Distributions are not symmetric. As a project progresses toward 1× coverage, distributions rapidly become disperse and symmetric. It is in this region that theoretical predictions for expected gaps are most likely to differ from results obtained in the laboratory. The shape remains almost constant for several increments in coverage. As deeper coverage is reached, for example, 5× in this case, distributions start to contract and become asymmetric. The trend becomes more exaggerated as closure is approached. Dispersion also increases with L/Gas characterized by the quotient of maximum ς to maximumE〈I〉 (Fig. 7). In general, this implies that estimates of the expected number of gaps are more likely to reflect actual laboratory observations for smallerL/G.

Figure 6.

Evolution of probability density function for a hypothetical project (L/G = 0.001, T/L = 0) up to 5× coverage as evaluated by equation 4. Arrows indicate whether the average number of gaps is increasing (→) or decreasing (←) for each distribution.

Open in new tab Download PowerPointLink to figure

Figure 7.

Dispersion of probability density function characterized by the quotient of maximum standard deviation and maximum expected gaps.

Open in new tab Download PowerPointLink to figure

Closure Probabilities

Although it is not a rigorous indicator, some estimate of the difficulty of a project can be obtained by examining the probability of closure, that is, the absence of gaps. Straightforward simplification of equations 4 and 9 yields p(0, N). It is clear from Figure 8 that closure is approached faster for projects having larger L/G values. Maximizing clone length (or sequencing read length) is therefore critical. Similar behavior has been noted previously for random subcloning by Roach (1995) using theFlatto and Konheim (1962) theory and for pairwise end sequencing using computer simulation (Roach et al. 1995). In our opinion, idealized models that predict lower coverages, for example, 15× for shotgun sequencing a typical human chromosome of 10⁸ bases (Derrida and Fink 2002), are incorrect. Trends in Figure 8approximately follow (1 − e ^−NL/G)^N, as shown by equation 9, which penalizes short clones because Nmust be larger to attain a given coverage. This reflects the fact that larger clones are more effective at closing gaps than smaller ones and explains why BAC clones can be shotgunned to within a few gaps, whereas whole genome shotgun projects retain many gaps at the same coverage. These expectations extrapolate in large degree to fingerprinting as well. For example, projects having L/G of 3.3 × 10⁻³ (Martin et al. 2002) or above reach a probability of closure of 99% or higher by 13× coverage. In practice, some bias will likely exist, meaning that a small number of gaps must still be closed by directed means.

Figure 8.

Probability of closure as a function of depth of coverage for various projects: 1. Zhu et al. (1999); 2. Dewar et al. (1998); 3. Fleischmann et al. (1995); 4. McPherson et al. (2001); 5. Adams et al. (2000); 6.Venter et al. (2001). Abbreviations “f.p.” and “w.g.s.” represent fingerprint mapping and whole genome shotgun sequencing projects, respectively. Cases 1 and 2 were evaluated using equation 4, whereas the remaining cases were determined using equation 9.

Open in new tab Download PowerPointLink to figure

BAC Shotgun Sequencing

The concept of closure probability can also be applied to deriving probabilistic stopping points in BAC clone shotgun sequencing. Current practice uses a simple linear scale: 5× coverage is considered a “half shotgun” and 10× coverage is a “full shotgun.” However, these figures do not take into account clone size or the average read length obtained from sequencing reactions. Roach (1995) proposed a criterion based on the expected cost for incrementally closing a gap, but the scale increases exponentially near closure. A more systematic method unaffected by the exponential problem can be defined according to confidence levels, for example, a 90% confidence of closure. BAC clone length is typically on the order of 150 kb (IHGSC 2001) but can average as low as 58 kb (Diaz-Perez et al. 1997) or show significantly higher values, for example, 235 kb for some human clones (Wendl et al. 2001). Read length is generally in the range of 500 to 800 base pairs in a large-scale production environment. Figure9 shows that reasonable stopping points vary between about 8.5× and 12× coverage and decrease approximately linearly with read length. “Full shotgun” of a typical 150-kb BAC coincides with 10× coverage for an average read length of 650 bases and a 90% confidence level of closure. Longer clones, lower read lengths, or higher confidence values would require additional coverage beyond 10×.

Figure 9.

Stopping points for bacterial artificial chromosome (BAC) shotgun sequencing based on confidence levels for closure of 90% and 95%. Short clones averaging 58 kb (Diaz-Perez et al. 1997) were evaluated using equation 4, whereas “typical clones” of 150 kb (IHGSC, 2001) and longer clones of 235 kb (Wendl et al. 2001) were determined using equation 9.

Open in new tab Download PowerPointLink to figure

METHODS

We briefly describe assumptions used in modeling BAC clone mapping and shotgun sequencing and then construct a theory describing evolution of gaps for these processes.

Assumptions

The following assumptions collectively represent what is possible in the laboratory regarding implementation of BAC clone and subclone libraries. Well-made libraries would be expected to display characteristics reasonably close to these.

First, we make the conventional assumption of a uniform clone distribution. Techniques used for BAC clone libraries enable a high degree of uniformity (Osoegawa et al. 1998, 2000; Cheung et al. 2001;Osoegawa et al. 2001), and subclone libraries are usually created by mechanical means, which are not significantly biased (e.g., sonication). We assume that cloning biases are small or can be minimized. Second, we make the standard assumption of a constant clone length L. Although length variability is largely governed by fractionation protocols, it is typically small in practice (Osoegawa et al. 1998). Third, chimerism is low in a well-made library, for example, less than 1% for BACs (Osoegawa et al. 2000), so it is ignored. Fourth, end effects are neglected because they are genome and project specific. Although they have little influence on large projects(Arratia et al. 1991; Balding and Torney 1991; Ewens et al. 1991), they can have a small biasing effect on fingerprint mapping ifL/G is comparatively large. Conversely, for circular architectures found in bacterial fingerprint projects (Tomkins et al. 2001), the assumption is identically satisfied. Some models account for end effects on a linear representation of the DNA target; however, this is spurious for genomes with more than one chromosome. One would have to properly model all chromosome-specific end effects. Lacking such genome-specific considerations, the appropriate configuration is a circular DNA target. Last, we assume that overlap detection can be adequately modeled using the simple threshold constant T used by previous theories (Lander and Waterman 1988; Roach 1995). This parameter can be thought of as an expected value required for an overlap to be detected.

Theoretical Development

Let N be the number of clones that have been processed in a fingerprint mapping or shotgun sequencing project and I be a random variable representing the number of gaps i among theseN clones. Following Lander and Waterman (1988) and Roach (1995), we define the effective clone length as α = (L − T)/G. This expression accounts for the penalty involved in not detecting an actual overlap. That is, if a real overlap is less than T, a gap is assumed. No restrictions are imposed on clone size except 0 < L/G < 1. In other words, we do not explicitly invoke the asymptotic approximation.

We begin by deriving probabilities of gaps immediately following particular sets of clones. Let the target DNA segment be represented by a circle of unit circumference so that each of the N clones contributes a fractional coverage α. A gap occurs when the starting positions of two clones are greater than α apart. Following Solomon (1978), we can infer the probability of gaps following particular sets of clones by applying a geometric translation operator to each set. For example, the probability of a gap immediately following any one specific clone of the N clones isf(1) = (1 − α)^N − 1. For gaps following any two particular clones, the probability isf(2) = (1 − 2α)^N − 1. Generalizing this procedure for m specific clones leads to

{ label needed for disp-formula[@id='E3'] }

f (m) = (1 - m α)_{+}^{N - 1},

where the “plus” notation (Siegel 1979) is defined as (j)₊ = max (0, j). This restriction arises from the fact that the number of gaps is bounded by the minimum number of clones required to cover the project exactly one time. In other words, there can be, at most, a tiny gap between each clone as 1× coverage is approached. The probability of a number of gaps greater than this value is zero. Results from equation 3 are biased upward as T increases because gaps are presumed when overlaps are too small to be detected.

Next, we must account for the various ways these gap arrangements can be realized. For example, in the case of m = 2, gaps could follow the first and second clones, the first and third clones, and so forth. Stevens' Theorem (Stevens 1939; Solomon 1978) can be applied directly for this calculation. We thus obtain the probability density function for i gaps distributed among N clones

p (i, N) = C_{N, i} \sum_{m = 0}^{N - i} C_{N - i, m} (- 1)^{m} f (m + i),

where C _j,k is the binomial coefficient forj gaps taken k at a time. By applying the definition of the moment-generating function (Ross 2000), we obtain

φ (t) = E 〈 e^{tI} 〉 = \sum_{i = 0}^{N} e^{ti} p (i, N),

from which all moments of interest can be derived.

The standard gap statistic provided by previous models is the expected number of gaps E〈I〉 resulting from Nclones. Evaluating the first momentE〈I〉 = φ‘(0), we obtain

E 〈 I 〉 = \sum_{i = 0}^{N} i p (i, N) .

This result is more general than corresponding expressions given byLander and Waterman (1988) and Roach (1995) because it can be applied with larger L/G ratios. Variance is a useful measure of dispersion and can be computed as a combination of the first and second moments ς² = E〈I ²〉 − (E〈I〉)². Evaluation of E〈I ²〉 = φ"(0) from equation 5 along with some algebraic manipulation shows

ς^{2} = \sum_{i = 0}^{N} [i - E 〈 I 〉] i p (i, N) .

Standard deviation ς is obtained by taking the square root of equation 7. Higher moments such as skewness and kurtosis could be derived by similar operations.

These equations become progressively more difficult to evaluate asL/G decreases. Specifically, N becomes very large for coverages of interest, making the ranges of both the summations and the binomial coefficients correspondingly large. Moreover, full precision of the binomial coefficients must be retained, otherwise round-off error quickly destabilizes the calculation. Here, we use Perl, which implements arbitrary precision integer and floating point object classes (Wall et al. 2000). In most cases, we do not evaluate the equations “exactly,” that is, over the entire distribution such that the total probability is identically 1. Instead, we truncate computations for the moments in equations 6 and 7 such that the total probability is at least 0.9998. This dramatically reduces computational time without significant loss of accuracy.

Asymptotic Approximation

When L/G is small enough, one can invoke the so-called asymptotic approximation (Seed 1982; Torney 1991; Marr et al. 1992), whereby (1 − α)^N → e ^−αNfor suitable α and N. In this case, the specific probability in equation 3 follows the limit [1 − (i + m)α] $_{+}^{N - 1}$ → e ^{−α(i + m)(N − 1)}. Let b = e ^{−α(N − 1)}, then equation 4 becomes

p (i, N) = b^{i} C_{N, i} \sum_{m = 0}^{N - i} C_{N - i, m} (- 1)^{m} b^{m} .

The summation in equation 8 is simply an expansion of (1 − b)^N − i. Thus, the density function in equation 4 reduces to the binomial distribution

p (i, N) = C_{N, i} b^{i} (1 - b)^{N - i} .

Following equation 5, we substitute this expression to obtain the moment-generating function, which can be simplified via the Binomial Theorem to obtain

{ label needed for disp-formula[@id='E10'] }

φ (t) = ({be}^{t} + 1 - b)^{N} .

Equation { label needed for disp-formula[@id='E10'] } is the well-known generating function for a binomial distribution having a Bernoulli “success” probability of b(Ross 2000). Deriving the appropriate moments, we find the expected value to be

E 〈 I 〉 = N e^{- α (N - 1)}

and the variance to be

ς^{2} = E 〈 I 〉 (1 - e^{- α (N - 1)}) .

Higher moments can be derived in a straightforward fashion by succeeding derivatives of φ(t).

Availability

Programs implementing the theory developed in this article are written in Perl and are freely available from the authors. The Perl language itself and necessary modules used here are freely available atwww.cpan.org on the World Wide Web.

We thank Drs. Warren Gish and Gary Stormo of the Washington University Genetics Department for reviewing draft manuscripts and Drs. Marco Marra of the British Columbia Cancer Research Centre and John Wallis of the Washington University Genome Sequencing Center for informative discussions. This work was supported by a grant from the National Human Genome Research Institute (HG02042)

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Notes

[2] Corresponding author.

Notes

[3] Genome Sequencing Center, Box 8501, 4444 Forest Park Blvd., Saint Louis, MO 63108. E-MAIL [email protected].

[4] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.655102.

REFERENCES

↵
M.D. AdamsS.E. CelnikerR.A. HoltC.A. EvansJ.D. GocayneP.G. AmanatidesS.E. SchererP.W. LiR.A. HoskinsR.F. Galle(2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195.
Google Scholar CrossRef PubMed Web of Science
↵
R. ArratiaE.S. LanderS. TavaréM.S. Waterman(1991) Genomic mapping by anchoring random clones: A mathematical analysis. Genomics 11:806–827.
Google Scholar CrossRef PubMed Web of Science
↵
D.J. BaldingD.C. Torney(1991) Statistical analysis of DNA fingerprint data for ordered clone physical mapping of human chromosomes. Bull. Math. Biol. 53:853–879.
Google Scholar PubMed
↵
Y.L. ChangQ. TaoC. ScheuringK. DingK. MeksemH.-B. Zhang(2001) An integrated map of Arabidopsis thaliana for functional analysis of its genome sequence. Genetics 159:1231–1242.
Google Scholar PubMed Web of Science
↵
V.G. CheungN. NowakW. JangI.R. KirschS. ZhaoX.N. ChenT.S. FureyU.J. KimW.L. KuoM. Olivier(2001) Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409:953–958.
Google Scholar CrossRef PubMed
↵
Derrida, B. and Fink, T.M.A. 2002. Sequence determination from overlapping fragments: A simple model of whole-genome shotgun sequencing. Phys. Rev. Lett. 88: art. no. 068106..
Google Scholar
↵
K. DewarL. SabbaghG. CardinalF. VeilleuxF. SanschagrinB. BirrenR.C. Levesque(1998) Pseudomonas aeruginosa PAO1 bacterial artificial chromosomes: Strategies for mapping, screening, and sequencing 100 kb loci of the 5.9 Mb genome. Microb. Comp. Genomics 3:105–117.
Google Scholar PubMed
↵
S.V. Diaz-PerezF. Alatriste-MondragonR. HernandezB. BirrenR.P. Gunsalus(1997) Bacterial artificial chromosome (BAC) library as a tool for physical mapping of the archaeon Methanosarcina thermophila TM-1. Microb. Comp. Genomics 2:275–286.
Google Scholar PubMed
↵
W.J. EwensC.J. BellP.J. DonnellyP. DunnE. MatallanaJ.R. Ecker(1991) Genome mapping with anchored clones: Theoretical aspects. Genomics 11:799–805.
Google Scholar CrossRef PubMed
↵
L. FlattoA.G. Konheim(1962) The random division of an interval and the random covering of a circle. SIAM Rev. 4:211–222.
Google Scholar
↵
R.D. FleischmannM.D. AdamsO. WhiteR.A. ClaytonE.F. KirknessA.R. KerlavageC.J. BultJ.F. TombB.A. DoughertyJ.M. Merrick(1995) Whole-genome random sequencing and assembly of H. influenzae rd. Science 269:496–512.
Google Scholar CrossRef PubMed Web of Science
↵
P. Hall(1988) Introduction to the theory of coverage processes. (John Wiley & Sons, New York, NY).
Google Scholar
↵
J.F. HeidelbergJ.A. EisenW.C. NelsonR.A. ClaytonM.L. GwinnR.J. DodsonD.H. HaftE.K. HickeyJ.D. PetersonL. Umayam(2000) DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406:477–483.
Google Scholar CrossRef PubMed
↵
International Human Genome Sequencing Consortium(2001) Initial sequencing and analysis of the human genome. Nature 409:860–921.
Google Scholar CrossRef PubMed
↵
M. JohnstonL. HillierL. RilesK. AlbermannB. AndreW. AnsorgeV. BenesM. BrucknerH. DeliusE. Dubois(1997) The nucleotide sequence of Saccharomyces cerevisiae chromosome XII. Nature 387:87–90.
Google Scholar PubMed
↵
M.G. KendallP.A.P. Moran(1963) Geometrical probability. (Hafner Publishing Company, New York, NY).
Google Scholar
↵
K. KupferM.W. SmithJ. QuackenbushG.A. Evans(1995) Physical mapping of complex genomes by sampled sequencing: A theoretical analysis. Genomics 27:90–100.
Google Scholar CrossRef PubMed
↵
E.S. LanderM.S. Waterman(1988) Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2:231–239.
Google Scholar CrossRef PubMed
↵
T.G. MarrX. YanQ. Yu(1992) Genomic mapping by single copy landmark detection: A predictive model with a discrete mathematical approach. Mamm. Genome 3:644–649.
Google Scholar CrossRef PubMed
↵
S.L. MartinB.P. BlackmonR. RajagopalanT.D. HoufekR.G. SceelesS.O. DennT.K. MitchellD.E. BrownR.A. WingR.A. Dean(2002) MagnaportheDB: A federated solution for integrating physical and genetic map data with BAC end derived sequences for the rice blast fungus Magnaporthe grisea. Nucleic Acids Res. 30:121–124.
Google Scholar CrossRef PubMed
↵
J.D. McPhersonM. MarraL. HillierR.H. WaterstonA. ChinwallaJ. WallisM. SekhonK. WylieE.R. MardisR.K. Wilson(2001) A physical map of the human genome. Nature 409:934–941.
Google Scholar CrossRef PubMed
↵
T. MozoK. DewarP. DunnJ.R. EckerS. FischerS. KloskaH. LehrachM. MarraR. MartienssenS. Meier-Ewert(1999) A complete BAC-based physical map of the Arabidopsis thaliana genome. Nat. Genet. 22:271–275.
Google Scholar CrossRef PubMed Web of Science
↵
G. Myers(1999) Whole-genome DNA sequencing. Comput. Sci. Eng. 1:33–43.
Google Scholar CrossRef
↵
K. OsoegawaP.Y. WoonB. ZhaoE. FrengenM. TatenoJ.J. CataneseP.J. de Jong(1998) An improved approach for construction of bacterial artificial chromosome libraries. Genomics 52:1–8.
Google Scholar CrossRef PubMed Web of Science
↵
K. OsoegawaM. TatenoP.Y. WoonE. FrengenA. G. MammoserJ.J. CataneseY. HayashizakiP.J. de Jong(2000) Bacterial artificial chromosome libraries for mouse sequencing and functional analysis. Genome Res. 10:116–128.
Google Scholar Abstract/Full Text PubMed Web of Science
↵
K. OsoegawaA.G. MammoserC. WuE. FrengenC. ZengJ.J. CataneseP.J. de Jong(2001) A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res. 11:483–496.
Google Scholar CrossRef Abstract/Full Text PubMed Web of Science
↵
E. PortF. SunD. MartinM.S. Waterman(1995) Genomic mapping by end-characterized random clones: A mathematical analysis. Genomics 26:84–100.
Google Scholar CrossRef PubMed
↵
J.C. Roach(1995) Random subcloning. Genome Res. 5:464–473.
Google Scholar CrossRef Abstract/Full Text PubMed
↵
J.C. RoachC. BoysenK. WangL. Hood(1995) Pairwise end sequencing: A unified approach to genomic mapping and sequencing. Genomics 26:345–353.
Google Scholar CrossRef PubMed Web of Science
↵
J.C. RoachA.F. SiegelG. van den EnghB. TraskL. Hood(1999) Gaps in the human genome project. Nature 401:843–845.
Google Scholar CrossRef PubMed
↵
S.M. Ross(2000) Introduction to probability models. 7th edition. (Academic Press, San Diego, CA).
Google Scholar
↵
B. Seed(1982) Theoretical study of the fraction of a long-chain DNA that can be incorporated in a recombinant DNA partial-digest library. Biopolymers 21:1793–1810.
Google Scholar CrossRef PubMed
↵
H. ShizuyaB. BirrenU.J. KimV. MancinoT. SlepakY. TachiiriM. Simon(1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. 89:8794–8797.
Google Scholar CrossRef PubMed Web of Science
↵
A.F. Siegel(1979) Asymptotic coverage distributions on the circle. Ann. Probability 7:651–661.
Google Scholar
↵
D.R. SmithP. RichterichM. RubenfieldP.W. RiceC. ButlerH.M. LeeS. KirstK. GundersenK. AbendschanQ.X. Xu(1997) Multiplex sequencing of 1.5 Mb of the Mycobacterium leprae genome. Genome Res. 7:802–819.
Google Scholar CrossRef Abstract/Full Text PubMed Web of Science
↵
H. Solomon(1978) Geometric probability. (Society for Industrial and Applied Mathematics, Philadelphia, PA).
Google Scholar
↵
W.L. Stevens(1939) Solution to a geometrical problem in probability. Ann. Eugen. 9:315–320.
Google Scholar
↵
The Arabidopsis Genome Initiative(2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815.
Google Scholar CrossRef PubMed
↵
The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282:2012–2018.
Google Scholar CrossRef PubMed Web of Science
↵
J.P. TomkinsT.C. WoodM.G. StaceyJ.T. LohA. JuddJ.L. GoicoecheaG. StaceyM.J. SadowskyR.A. Wing(2001) A marker-dense physical map of the Bradyrhizobium japonicum genome. Genome Res. 11:1434–1440.
Google Scholar CrossRef Abstract/Full Text PubMed
↵
D.C. Torney(1991) Mapping using unique sequences. J. Mol. Biol. 217:259–264.
Google Scholar CrossRef PubMed
↵
J.C. VenterM.D. AdamsE.W. MyersP.W. LiR.J. MuralG.G. SuttonH.O. SmithM. YandellC.A. EvansR.A. Holt(2001) The sequence of the human genome. Science 291:1304–1351.
Google Scholar CrossRef PubMed Web of Science
↵
L. WallT. ChristiansenJ. Orwant(2000) Programming Perl. 3rd edition. (O’Reilly & Associates, Inc. Sebastopol, CA).
Google Scholar
↵
M.C. WendlM.A. MarraL.W. HillierA.T. ChinwallaR.K. WilsonR.H. Waterston(2001) Theories and applications for sequencing randomly selected clones. Genome Res. 11:274–280.
Google Scholar CrossRef Abstract/Full Text PubMed
↵
K. YamadaH. OgawaG. TamiyaM. IkenoM. MoritaS. AsakawaN. ShimizuT. Okazaki(2000) Genomic organization, chromosomal localization, and the complete 22 kb DNA sequence of the human GCMa/GCM1,a placenta-specific transcription factor gene. Biochem. Biophys. Res. Commun. 278:134–139.
Google Scholar CrossRef PubMed Web of Science
↵
H. ZhuB.P. BlackmonM. SasinowskiR.A. Dean(1999) Physical map and organization of chromosome 7 in the rice blast fungus Magnaporthe grisea. Genome Res. 9:739–750.
Google Scholar Abstract/Full Text PubMed Web of Science

METHODS

Generalized Gap Model for Bacterial Artificial Chromosome Clone Fingerprint Mapping and Shotgun Sequencing

Current Issue:

Abstract

RESULTS

Error Quantification for Approximate Models

Unification of Previous Models

DISCUSSION

Evolution of Gaps

Closure Probabilities

BAC Shotgun Sequencing

METHODS

Assumptions

Theoretical Development

Asymptotic Approximation

Availability

Notes

Notes

REFERENCES

Article contents

Announcement(s)

METHODS

Generalized Gap Model for Bacterial Artificial Chromosome Clone Fingerprint Mapping and Shotgun Sequencing

Cite this article

Share

Current Issue:

Abstract

RESULTS

Error Quantification for Approximate Models

Unification of Previous Models

DISCUSSION

Evolution of Gaps

Closure Probabilities

BAC Shotgun Sequencing

METHODS

Assumptions

Theoretical Development

Asymptotic Approximation

Availability

Notes

Notes

REFERENCES

Article contents

Announcement(s)