Table 1.

Software Performance

Human 4× Human 4× + 2× Rice 4.2×
Target region (Mb)11.9 11.9 430
Masked sequence17.2% 17.2% 42.2% 
Number of contigs by LW201846259512
Unmasked Phrap
 Max. memory use (Gb)3.085 xx
 Computer time (h)48xx
 Number of contigs2703xx
 N50 contig size (kb)7.05 xx
 Phrap error estimate0.099% (0.086%)xx
 BAC discrepancies0.066% (0.063%)xx
 Contig misassembly5.77% xx
Repeat-masked Phrap
 Max. memory use (Gb)0.614 1.040 50
 Computer time (h)1.8 3.4 79
 Number of contigs35362219167,975
 N50 contig size (kb)5.35 11.12 3.41 
 Phrap error estimate0.091% (0.13%)0.043% (0.096%)0.129% (0.145%)
 BAC discrepancies0.077% (0.076%)0.044% (0.059%)0.52% (0.78%)
 Contig misassembly0.51% 0.68% 0.71% 
Repeat-gap closure
 Max. memory use (Gb)0.007 0.007 2
 Computer time (h)2.0 3.0 50
 Number of contigs31811810127,550
 N50 contig size (kb)6.13 14.51 6.69 
 Phrap error estimate0.09% (0.108%)0.041% (0.076%)0.111% (0.103%)
 BAC discrepancies0.075% (0.065%)0.042% (0.05%)0.54% (0.73%)
 Contig misassembly1.1% 1.33% 1.85% 
Scaffold construction
 Max. memory use (Gb)0.035 0.08 1.3 
 Computer time (h)0.05 0.07 2
 Number of scaffolds2284750103,044
 N50 scaffold size (kb)10.61 196.80 11.76 
 Phrap error estimate0.09% (0.108%)0.041% (0.076%)0.111% (0.103%)
 BAC discrepancies0.075% (0.065%)0.042% (0.05%)0.54% (0.73%)
 Scaffold misassembly0% 0.13% 0% 

[i] There are two human data sets, at coverage 4× and 4×+2×. The clone-insert size is 2-Kb for the first 4×. In the 4×+2× data set, the clone-insert size is 15-Kb for the last 2×. The rice data set is discussed in another paper (Yu, et al. 2002). We list the total size of the target region, and the fraction of the shotgun sequence masked by exact 20-mer repeats determined from the shotgun data. Statistics are listed after each RePS stage: repeat-masked Phrap, repeat-gap closure, and scaffold construction. Computations were done on a Sun E10K, employing only 1 of the 64 CPUs for the human data, but 40 of 64 CPUs for the rice data. Lander-Waterman numbers assume 26-bp minimum detectable overlap, based on Phrap's minscore setting. N50 contig or scaffold sizes are the sizes above which 50% of the assembled sequence can be found. Single-base error rates are computed separately for both unique and repeated (parenthesis) sequence. Phrap-derived error estimates are compared to measurements based on alignments with finished BACs. Misassembly rate are defined as the number of bad contigs (or scaffolds) divided by the total number of contigs (or scaffolds). Notice that interleaving scaffold problems are counted as bad in our definition of scaffold mis-assembly.