Markup | Genome Research

Table 1.

Software Performance

	Human 4×	Human 4× + 2×	Rice 4.2×
Target region (Mb)	11.9	11.9	430
Masked sequence	17.2%	17.2%	42.2%
Number of contigs by LW	2018	462	59512
Unmasked Phrap
Max. memory use (Gb)	3.085	x	x
Computer time (h)	48	x	x
Number of contigs	2703	x	x
N50 contig size (kb)	7.05	x	x
Phrap error estimate	0.099% (0.086%)	x	x
BAC discrepancies	0.066% (0.063%)	x	x
Contig misassembly	5.77%	x	x
Repeat-masked Phrap
Max. memory use (Gb)	0.614	1.040	50
Computer time (h)	1.8	3.4	79
Number of contigs	3536	2219	167,975
N50 contig size (kb)	5.35	11.12	3.41
Phrap error estimate	0.091% (0.13%)	0.043% (0.096%)	0.129% (0.145%)
BAC discrepancies	0.077% (0.076%)	0.044% (0.059%)	0.52% (0.78%)
Contig misassembly	0.51%	0.68%	0.71%
Repeat-gap closure
Max. memory use (Gb)	0.007	0.007	2
Computer time (h)	2.0	3.0	50
Number of contigs	3181	1810	127,550
N50 contig size (kb)	6.13	14.51	6.69
Phrap error estimate	0.09% (0.108%)	0.041% (0.076%)	0.111% (0.103%)
BAC discrepancies	0.075% (0.065%)	0.042% (0.05%)	0.54% (0.73%)
Contig misassembly	1.1%	1.33%	1.85%
Scaffold construction
Max. memory use (Gb)	0.035	0.08	1.3
Computer time (h)	0.05	0.07	2
Number of scaffolds	2284	750	103,044
N50 scaffold size (kb)	10.61	196.80	11.76
Phrap error estimate	0.09% (0.108%)	0.041% (0.076%)	0.111% (0.103%)
BAC discrepancies	0.075% (0.065%)	0.042% (0.05%)	0.54% (0.73%)
Scaffold misassembly	0%	0.13%	0%

[i] There are two human data sets, at coverage 4× and 4×+2×. The clone-insert size is 2-Kb for the first 4×. In the 4×+2× data set, the clone-insert size is 15-Kb for the last 2×. The rice data set is discussed in another paper (Yu, et al. 2002). We list the total size of the target region, and the fraction of the shotgun sequence masked by exact 20-mer repeats determined from the shotgun data. Statistics are listed after each RePS stage: repeat-masked Phrap, repeat-gap closure, and scaffold construction. Computations were done on a Sun E10K, employing only 1 of the 64 CPUs for the human data, but 40 of 64 CPUs for the rice data. Lander-Waterman numbers assume 26-bp minimum detectable overlap, based on Phrap's minscore setting. N50 contig or scaffold sizes are the sizes above which 50% of the assembled sequence can be found. Single-base error rates are computed separately for both unique and repeated (parenthesis) sequence. Phrap-derived error estimates are compared to measurements based on alignments with finished BACs. Misassembly rate are defined as the number of bad contigs (or scaffolds) divided by the total number of contigs (or scaffolds). Notice that interleaving scaffold problems are counted as bad in our definition of scaffold mis-assembly.