Table 1.

Software Performance

There are two human data sets, at coverage 4× and 4×+2×. The clone-insert size is 2-Kb for the first 4×. In the 4×+2× data set, the clone-insert size is 15-Kb for the last 2×. The rice data set is discussed in another paper (Yu, et al. 2002). We list the total size of the target region, and the fraction of the shotgun sequence masked by exact 20-mer repeats determined from the shotgun data. Statistics are listed after each RePS stage: repeat-masked Phrap, repeat-gap closure, and scaffold construction. Computations were done on a Sun E10K, employing only 1 of the 64 CPUs for the human data, but 40 of 64 CPUs for the rice data. Lander-Waterman numbers assume 26-bp minimum detectable overlap, based on Phrap's minscore setting. N50 contig or scaffold sizes are the sizes above which 50% of the assembled sequence can be found. Single-base error rates are computed separately for both unique and repeated (parenthesis) sequence. Phrap-derived error estimates are compared to measurements based on alignments with finished BACs. Misassembly rate are defined as the number of bad contigs (or scaffolds) divided by the total number of contigs (or scaffolds). Notice that interleaving scaffold problems are counted as bad in our definition of scaffold mis-assembly.

`RePS:` A Sequence Assembler That Masks Exact Repeats Identified from the Shotgun Data

RePS: A Sequence Assembler That Masks Exact Repeats Identified from the Shotgun Data