Human Whole-Genome Shotgun Sequencing
Large-scale sequencing of the human genome is now under way (Boguski et al. 1996; Marshall and Pennisi 1996). Although at the beginning of the Genome Project, many doubted the scientific value of sequencing the entire human genome, these doubts have evaporated almost entirely (Gibbs 1995; Olson 1995). Primary reasons for generating the human genomic sequence are listed in Table1.
Primary Reasons for Sequencing Human Genomic DNA
The approach being taken for human genomic sequencing is the same as that used for the Saccharomyces cerevisiae andCaenorhabditis elegans genomes, namely construction of overlapping arrays of large insert Escherichia coli clones, followed by complete sequencing of these clones one at a time. In this article, we outline an alternative approach to sequencing the human and other large genomes, which we argue is less costly and more informative than the clone-by-clone approach.
A Plan for Human Whole-Genome Shotgun Sequencing
Although there are many conceivable variations, the crux of our plan involves high-quality, semiautomated sequencing from both ends of very large numbers of randomly selected human genomic DNA fragments. DNA of high molecular weight purified from at least a few different human donors would be sheared, size-selected, and cloned into E. coli. Insert sizes would fall into two classes. Long inserts would be 5–20 kb in size and would be cloned into plasmid, phage, or possibly cosmid vectors. Short inserts would be 0.4–1.2 kb in size and would be cloned into plasmid vectors. Read lengths would be of sufficient magnitude so that the two sequence reads from the ends of the short inserts overlap. The ratio of long to short inserts would be ⩾1. Standard, gel-based methods would be utilized to generate at least 30 billion nucleotides of raw sequence (10-fold coverage of the genome). Many laboratories throughout the world could participate in raw sequence generation, but all sequences would be deposited in a common, public database, and only a few or possibly even one large informatics group would assay the primary task of sequence assembly. Following initial assembly, gaps in sequence coverage would need to be filled and uncertainties in assembly would need to be resolved.
Sequencing from both ends of relatively long insert subclones is an essential feature of the plan. Initially, Edwards and colleagues (1990) and, more recently, several other groups (Chen et al. 1993; Smith et al. 1994; Kupfer et al. 1995; Roach et al. 1995; Nurminsky and Hartl 1996) recognized that sequence information from both ends of relatively long inserts dramatically improves the efficiency of sequence assembly. In contrast to single sequence reads from one end of shotgun subclones, the pairs of sequence reads from both ends have known spacing and orientation. Use of relatively long insert subclones also aids in the assembly of sequences containing interspersed repetitive elements. Roach and colleagues (1995) showed that use of a mixture of long and short inserts can be as effective in enhancing assembly as use of only long inserts. Precise knowledge of the length of the long insert clones is not required to realize the advantages of end sequencing.
Another essential feature of the plan is the attachment of quality values to the raw sequences. The quality values would indicate the likelihood that each base call is correct. Quality values would aid sequence assembly (Churchill and Waterman 1992; Giddings et al. 1993;Lawrence and Solovyev 1994; Lipshutz et al. 1994), would help to distinguish true DNA polymorphisms from sequencing errors, and would also label uncertain sequences. Quality values would not obviate the need for relatively low error rates in the sequencing (Fleischmann et al. 1995). Low error rates would minimize the number of overlapping nucleotides required for sequence joining and also the ultimate sequence redundancy that is required. Frequent and appropriate quality controls would need to be utilized to ensure that the raw sequence generated is high quality. The quality of the combined sequences from the ends of the short inserts would be enhanced because the overlapping segment occurs at the ends of the sequence reads where base calling is typically least reliable.
Feasibility of Whole-Genome Shotgun Sequencing
The feasibility of human whole-genome shotgun sequencing was evaluated by computer simulation designed to determine whether sufficient coverage and linkage information would result from such an approach. The simulation considered sequencing from both ends of two classes of inserts, long and short. The simulation also modeled both short and long interspersed repetitive elements (SINEs and LINEs). To be conservative, all interspersed repeats were considered to be identical in sequence so that overlaps in reads that fell within repetitive elements were useless for joining sequences. Many parameters such as fold coverage of the genome, sequence read length, amount of repetitive DNA, ratio of long to short inserts, and nucleotides of overlap required to join sequences were varied in the simulations. Default parameters (Table 2) are assumed to be in force unless otherwise stated. The default value for LINE length was conservatively chosen to be 1.5 kb, because although full-length LINE-1 (L1) elements are 6–7 kb in length, the vast majority of human L1 elements are truncated with average length ∼0.7 kb (Smit et al. 1995; A. Smit, pers. comm.). Note that the simulation does not solve an assembly problem over simulated data, but instead analyzes the nature of the sampling obtained. Details of the simulation, including source code, can be obtained from Gene Myers (gene{at}cs.arizona.edu).
Simulation Default Parameters
Two outcomes of the simulation, contig length and scaffold length, were monitored particularly closely. Contigs are defined as sequence assemblies without any discontinuities. Scaffolds (Roach et al. 1995) are defined as collections of two or more contigs joined by long inserts whose ends are in different contigs. Scaffolds, by definition, contain discontinuities, but the positions and approximate sizes of the discontinuities are known. The simulation confirmed that coverage of the genome is largely a function of the amount of raw sequence generated (Lander and Waterman 1988; Fleischmann et al. 1995). As shown in Table 3, the average simulated contig length increased dramatically as the fold coverage of the genome increased from 0.5 to 10. Average contig length was also dependent on the amount of interspersed repetitive DNA and the ratio of long to short inserts (Fig. 1). Increasing amounts of repetitive DNA led to shorter average contigs. Even at 50% total repetitive DNA, however, maximum contig length was still near 100 kb. When long-to-short insert ratios were greater than 1, contig length was largely independent of the ratio. These results were only modestly affected by read length (from 200 to 800 bases) and by the minimum overlap required for sequence joining (from 20 to 60 bases) (data not shown).
Simulated Effects of Genome Coverage
Average simulation contig length as a function of repeat density and long-to-short insert ratio. At each level of repetitive DNA, 80% of the repeats were assumed to be SINES and 20% LINES. All simulation parameters not specified in the plots were set to default values (see Table 2). Average contig length excluded those contigs consisting of only single reads. The single-read contigs comprised only ∼0.1% of all reads.
Given the large number of contigs that would be generated with the whole-genome shotgun approach, a pivotal question is whether the simulation contigs could be ordered into scaffolds. For a hypothetical human chromosome, 400 Mb in size, one scaffold spanning the entire chromosome length was obtained in each of 100 simulation iterations. After assembly, an average of 160 contigs and six small scaffolds remained unconnected to the single, very large scaffold (scaffolds can overlap without being connected by common sequence).
Using the default parameters, only ∼16,000 gaps between contigs (0.04% of the genome) with average size of ∼70 bp and maximum size <1700 bp remained after assembly. Although filling these gaps would certainly require a large effort, because the gaps are short, it should be possible to fill virtually all of them using PCR. Additional effort, if deemed necessary, would be required to sequence the complementary strand of segments with only single-strand coverage. Simulation results indicate that under default conditions, 616,000 of these single-stranded regions would exist with an average size of 106 bases.
Although a large amount of computing power would be required to perform the sequence similarity searches necessary for assembly, such power is already available. Using conservative and sensitive overlap detection algorithms, it would currently be possible to span sequence-tagged sites (STSs) spaced at 100 kb at a rate of at least one STS pair per day per 100 mips (million instructionsper second) workstation. With a cluster of 100 such workstations the assembly of the entire human genome would take 300 days. By using less sensitive, but faster, overlap detection software, this time could be reduced by nearly a factor of 10. Note also that the power of computer processors has doubled every 18 months for many years, and this trend is likely to continue (Patterson 1995). If contemplated machines such as the 3-teraflop supercomputer planned in 1998 for Lawrence Livermore National Laboratory (Macilwain 1996) were recruited to the task of assembly, then the human genome could be assembled, in principle, in 4 min.
It is important to realize that because of significant progress in the genetic and physical mapping of STSs (Olson et al. 1989), the real task of shotgun sequence assembly would be greatly simplified to the task of building contigs and scaffolds that span adjacent STSs. Each of the STSs would serve as a nucleation site for this linking process. Already >30,000 total human STSs, including >16,000 genes, have been physically mapped, and the tally is increasing rapidly (Cox et al. 1994; Hudson et al. 1995; Schuler et al. 1996 and Web sites listed therein). Expressed sequence tags (ESTs) (Adams et al. 1991, 1995;Hillier et al. 1996) are particularly valuable for sequence assembly because the coding sequences are often interrupted by introns. For the purposes of assembly, a single EST will therefore usually be the equivalent of an array of ordered STSs, a nearly ideal framework for assembly. Plans to generate full-length cDNA sequences (Marshall 1996) will only enhance the utility of these sequences for assembly. Some genes like the dystrophin and neurofibromatosis I genes, for example, cover enormous segments of the genome (2.3 and 0.35 Mb, respectively) (Heim et al. 1995; Prior et al. 1995). Assuming, conservatively, a total of 80,000 human ESTs and an average of three exons per sequence, a grand total of >250,000 STSs with an average spacing of only 12 kb is already available for assembly (Table 4).
Human STSs
At present, the process for human whole-genome shotgun sequence assembly can only be projected. Nevertheless, a possible senario for assembly would be to begin with all existing mapped STSs (including ESTs) within a specific chromosomal interval, to add shotgun reads in a very conservative fashion utilizing only sequence overlaps of high probability, to meld these growing assemblies to unmapped STSs within the database, and then to add in lower probability overlapping sequences. The sequence assemblies would continually be examined for disagreements with EST structure or with existing map information and also for the presence of forks or loops, which would indicate the presence of unrecognized interspersed (forks) or tandem (loops) repeats, or other errors in assembly or cloning artifacts. Software for assembly on this scale does not exist, but we have begun work in this direction. Our initial perception is that STS anchors provide sufficient directional information to allow resolution of low copy number repeats (of any scale) and that high copy number repeats can be factored as a consensus sequence that can be resolved at specific sites on a case-by-case basis. The development of such software poses difficult technical questions, but we believe these are surmountable in a several man–year horizon. We note, for example, that human coding sequences have been assembled from individual reads by several groups despite the presence of sequence errors, polymorphisms, alternative splicing, and repetitive elements (Schuler et al. 1996). Also, software developed for assembly of human sequences would be applied in the future to many other organisms.
Whole-genome shotgun sequencing would not result in a single unbroken sequence for entire chromosomes. Even using recombination and restriction-deficient E. coli strains (Chalker et al. 1988;Raleigh et al. 1988; Doherty et al. 1993), a small portion of the genome would likely be resistant to cloning or would not yield stable clones. Sequences from long arrays of tandem repeats such as centromeric satellite DNA, rDNA repeats, and some minisatellites would not be able to be assembled perfectly. Note, however, that these limitations apply to both whole-genome shotgun and clone-by-clone sequencing approaches.
The feasibility of whole-genome shotgun sequencing was also supported by the recent success achieved by Venter and colleagues in sequencing three bacterial genomes with sizes ranging from 0.6 to 1.8 Mb (Fleischman et al. 1995; Fraser et al. 1995; Bult et al. 1996). Neither raw sequence generation, sequence assembly, nor sequence finishing was an impediment to the shotgun sequencing of the bacterial chromosomes. Distances between human STSs are much smaller than the sizes of the bacterial genomes.
Our strategy for whole-genome shotgun sequencing is also entirely consistent with the bacterial artificial chromosome (BAC) end sequencing strategy proposed recently by Venter et al. (1996). Although we feel that large-scale BAC end sequencing would probably not be absolutely required, it would certainly assist in the assembly of the shotgun sequence fragments. BAC clones would likely span some arrays of tandem repeats that are too large for our “long insert” clones.
Advantages of Whole-Genome Shotgun Sequencing
Whole-genome shotgun sequencing of human genomic DNA holds a number of important advantages compared to conventional clone-by-clone sequencing. Foremost among these advantages are detection of large numbers of DNA polymorphisms, more complete and less artifactual coverage of the genome, and improved speed and cost.
A significant fraction of all common human DNA polymorphisms can be detected through shotgun sequencing. Polymorphisms are important because they are used to map genes through linkage analysis (Terwilliger and Ott, 1994), to presymptomatically predict disease status (Antonarakis 1989; Weber 1994), to detect submicroscopic chromosomal rearrangements (Lupski et al. 1991), to identify individuals in, for example, paternity and forensic testing (Hagelberg et al. 1991; Frigeau and Fourney 1993; Smith 1995; Urquhart et al. 1995), and to study a wide range of biological phenomena such as evolution (Bowcock and Cavalli-Sforza 1991; Bowcock et al. 1994; Jorde et al. 1995), population biology (Edwards et al. 1992; Deka et al. 1995; Morell et al. 1995), and recombination (Tanzi et al. 1992; Weber et al. 1993). Polymorphisms within coding and regulatory elements are also the source of relative risk for many common diseases. Common variants of the apolipoprotein E gene on chromosome 19, for example, strongly influence an individual’s risk of developing late onset Alzheimer’s disease (Saunders et al. 1993; Kamboh 1995; Kamboh et al. 1995). Many highly informative human DNA polymorphisms based on short tandem repeats have already been identified, but the vast majority of the much more frequent biallelic base substitution and short insertion/deletion polymorphisms remain unknown (Kwok et al. 1994,1996). Although allele frequencies vary widely, most human DNA polymorphisms are common to all populations (Bowcock and Cavalli-Sforza 1991; Jorde et al. 1995; Bowcock et al. 1994; Deka et al. 1995; Edwards et al. 1992; Morell et al. 1995).
DNA polymorphisms would not usually be detected through clone-by-clone sequencing because only one variant for each genomic region would be sampled. If the genome is sequenced through the clone-by-clone approach, then much additional funding would be required to identify the polymorphisms at a later date and many years would be lost. Calculation of the exact fraction of polymorphisms that would be identified through whole-genome shotgun sequencing requires a distribution of polymorphisms as a function of informativeness, which is not yet known. However by generating 6 billion nucleotides of raw sequence from each of five unrelated individuals, it can be calculated that ∼65% of all 20% heterozygosity biallelic polymorphisms and >99% of all 80% multiallelic polymorphisms would, for example, be detected. To optimize polymorphism detection, DNA should ideally be sequenced from donors with widely differing geographic ancestry.
Sequencing errors would likely be encountered much more frequently in whole-genome shotgun sequencing than true polymorphisms. Sequencing error rates would likely be at least 1%, whereas the rate of polymorphisms would likely be on the order of 0.1%. Although confirmation may be necessary in many cases, several factors would allow many of the polymorphisms to be identified despite the background of sequencing errors. True polymorphisms would often have multiple sequence reads per allele, true polymorphisms would usually have high-quality values attached to each allele, and true polymorphisms do not occur randomly thoughout the genome. Specific sequence features will spotlight polymorphisms. For example, it has been known for many years that CpG dinucleotides are more commonly polymorphic than other dinucleotides (Schumm et al. 1988; Deininger and Batzer 1993; Becker et al. 1996; Sommer and Ketterling 1996).
Rearrangements in the large insert contig clones and biases in the coverage of these clones will, to a large degree, be eliminated by whole-genome shotgun sequencing. Many of the cosmid clones projected for use in sequencing have been developed from hybrid tissue culture cell lines which, themselves, have been propagated for many cell generations. Rearrangements and artifacts have undoubtedly been introduced into the cloned material during this process. Although BACs/PACs (P1-derived artificialchromosomes) appear to be more stable than cosmids, artifacts such as chimeras and deletions still occur at a significant frequency (Kim et al. 1996; Boysen et al. 1997). By starting with total human genomic DNA, many of these artifacts will be eliminated. The cosmid or BAC/PAC assemblies will also likely exclude at least some long arrays of tandem repeats. The genome will be more equally represented with shotgun sequencing using small inserts. In addition, overlaps between large insert clones will lead to largely unproductive duplicative sequencing or to the expenditure of resources to avoid this duplication.
Whole-genome shotgun sequencing would also be less expensive and therefore faster than the clone-by-clone approach. The steps of preparation, mapping, storage, and tracking of tens of thousands of sequence-ready large-insert clones; parallel generation, storage and tracking of subclones for each of the large insert clones; and avoidance of large-insert clone overlap would be entirely eliminated with shotgun sequencing. The processes of sequence assembly and sequence finishing could be carried out much more efficiently in central facilities. Reducing the process of DNA sequencing to the core task of raw sequence generation would also allow efforts to be focused on driving down the costs of a few relatively straightforward procedures in large factory-like operations. With shotgun sequencing there would be no need to wait for expensive, sequence-ready large-insert clone assemblies to be generated and no need to sequence one chromosome or one chromosomal segment at a time. To date, no one has generated overlapping cosmid or BAC/PAC assemblies that span even significant portions of human chromosomes without many gaps (Ashworth et al. 1995; Doggett et al. 1995). Perhaps this can be accomplished eventually but only through great effort, time, and cost. The assertion that collection of large-insert templates for sequencing is trivial is simply wrong. Although initiation of genome-wide sequence assembly would probably not be worthwhile until ∼2.5-fold sequence coverage was obtained, completion of partial cDNA sequences, identification of regulatory regions, definition of intron/exon boundaries, and identification of polymorphisms are all tasks that could be undertaken continuously from the start of shotgun sequence generation. The large number of laboratories worldwide undertaking position cloning projects, for example, could utilize the shotgun sequences from the outset.
Estimating the actual costs of human genomic sequencing is certainly hazardous. Nevertheless, our best effort is summarized in Table5. Assuming optimistically that clone-by-clone sequencing of human DNA can be completed for $0.30 per finished base, and assuming that sequencing is completed by the end of the year 2003, an average cost per year of $130 million is calculated. Assuming conservatively a cost of $0.01 for generation of a single base of raw sequence, spending of $130 million per year would give 10-fold coverage by about the end of the millennium with $90 million remaining for software development and computer assembly. Filling gaps and resolving uncertainties would add additional costs to whole-genome shotgun sequencing in the next century.
Costs of Human Genomic Sequencing
We assert that the goals listed in Table 1 are the true motivation for sequencing the human genome, not the accomplishment of some arbitrary, mythical goal of 99.99% accuracy of a single, artifactual (in places) and nonrepresentative copy of the genome. Most research laboratories, both public and private, want discrete genomic sequence information, and they want it as early as possible. They are interested in information such as the intron/exon structure of specific genes, the polymorphisms that may occur in specific coding and regulatory sequences, and lists of coding sequences that lie within specific chromosomal intervals. The sooner this critical information is available, the sooner it can be applied to accelerating research progress. Americans spend ∼$35 billion per year, public and private, on biomedical research (Silverstein et al. 1995). If the efficiency of this research is improved by even 1%, and this is probably a gross underestimate, then savings would be $350 million per year, far more than the cost of sequencing. Whole-genome shotgun sequencing will allow these savings to be realized far sooner than with clone-by-clone sequencing. We should generate as much of the critical sequence information as rapidly as possible and leave cleanup of gaps and problematic regions for future years.
It is not too late to change strategies for sequencing the human genome. Only a few percent of the sequence has been generated at this time. Even if the human genome is not sequenced via the shotgun approach, there are still many other large genomes that will be sequenced in the future, including many agriculturally important species. It will likely be too expensive to sequence other large genomes via the clone-by-clone approach. A possible general strategy for sequencing other large genomes would be a random cDNA sequencing project, followed possibly by some radiation hybrid physical mapping of the ESTs, followed by whole-genome shotgunning.
About a decade ago, when the Genome Project was just being contemplated, Fred Blattner proposed whole-genome shotgun sequencing of both the E. coli and human genomes. His proposals were neglected. Today, no one considers for a moment sequencing bacterial genomes by any method other than whole-genome shotgun sequencing. Even at several dollars per finished base the human sequence is probably one of the greatest bargains in human history. We laud efforts now under way in several large sequencing centers to generate human genomic sequence. The reality, however, is that research dollars are always limited. We should sequence the human and other eukaryotic genomes using the most rapid, cost effective, and productive strategy.
Footnotes
-
↵3 Corresponding author.
-
E-MAIL weberj{at}mfldclin.edu; FAX (715) 389-3808.
- Cold Spring Harbor Laboratory Press



