RePS: A Sequence Assembler That Masks Exact Repeats Identified from the Shotgun Data

Table 1.

Software Performance

Human 4× Human 4× + 2× Rice 4.2×
Target region (Mb) 11.9  11.9  430
Masked sequence 17.2%  17.2%  42.2% 
Number of contigs by LW 2018 462 59512
Unmasked Phrap
 Max. memory use (Gb) 3.085  x x
 Computer time (h) 48 x x
 Number of contigs 2703 x x
 N50 contig size (kb) 7.05  x x
 Phrap error estimate 0.099% (0.086%) x x
 BAC discrepancies 0.066% (0.063%) x x
 Contig misassembly 5.77%  x x
Repeat-masked Phrap
 Max. memory use (Gb) 0.614  1.040  50
 Computer time (h) 1.8  3.4  79
 Number of contigs 3536 2219 167,975
 N50 contig size (kb) 5.35  11.12  3.41 
 Phrap error estimate 0.091% (0.13%) 0.043% (0.096%) 0.129% (0.145%)
 BAC discrepancies 0.077% (0.076%) 0.044% (0.059%) 0.52% (0.78%)
 Contig misassembly 0.51%  0.68%  0.71% 
Repeat-gap closure
 Max. memory use (Gb) 0.007  0.007  2
 Computer time (h) 2.0  3.0  50
 Number of contigs 3181 1810 127,550
 N50 contig size (kb) 6.13  14.51  6.69 
 Phrap error estimate 0.09% (0.108%) 0.041% (0.076%) 0.111% (0.103%)
 BAC discrepancies 0.075% (0.065%) 0.042% (0.05%) 0.54% (0.73%)
 Contig misassembly 1.1%  1.33%  1.85% 
Scaffold construction
 Max. memory use (Gb) 0.035  0.08  1.3 
 Computer time (h) 0.05  0.07  2
 Number of scaffolds 2284 750 103,044
 N50 scaffold size (kb) 10.61  196.80  11.76 
 Phrap error estimate 0.09% (0.108%) 0.041% (0.076%) 0.111% (0.103%)
 BAC discrepancies 0.075% (0.065%) 0.042% (0.05%) 0.54% (0.73%)
 Scaffold misassembly 0%  0.13%  0% 
  • There are two human data sets, at coverage 4× and 4×+2×. The clone-insert size is 2-Kb for the first 4×. In the 4×+2× data set, the clone-insert size is 15-Kb for the last 2×. The rice data set is discussed in another paper (Yu, et al. 2002). We list the total size of the target region, and the fraction of the shotgun sequence masked by exact 20-mer repeats determined from the shotgun data. Statistics are listed after each RePS stage: repeat-masked Phrap, repeat-gap closure, and scaffold construction. Computations were done on a Sun E10K, employing only 1 of the 64 CPUs for the human data, but 40 of 64 CPUs for the rice data. Lander-Waterman numbers assume 26-bp minimum detectable overlap, based on Phrap's minscore setting. N50 contig or scaffold sizes are the sizes above which 50% of the assembled sequence can be found. Single-base error rates are computed separately for both unique and repeated (parenthesis) sequence. Phrap-derived error estimates are compared to measurements based on alignments with finished BACs. Misassembly rate are defined as the number of bad contigs (or scaffolds) divided by the total number of contigs (or scaffolds). Notice that interleaving scaffold problems are counted as bad in our definition of scaffold mis-assembly.

This Article

  1. Genome Res. 12: 824-831

Preprint Server