The Atlas Genome Assembly System

Table 3.

Quality Assessment Measurements at Various Stages of Atlas Assembly


Reads

BCM trace-quality (TQ) analysis on all traceruns primary feedback to production group.
BCM-cross-repeat scan for wrong-organism repeats (similar to RepeatMasker-primspec).
All reads scanned for first and last 50-base window with no contaminant matches and <1.25 expected errors.
Head and tail beyond the windows trimmed off. Remaining insert required to have 100 bases of Phred quality ≥20 (for WGS) or 50 such bases (for BAC reads).
Trimmed reads used to compute oligo frequencies (32-mers) over all WGS sequence; only oligos with frequency ≤12 (∼3 times the coverage) used to seed overlaps.
Untrimmed BAC and WGS reads used in assembly (masked for contaminant and vector), WGS read must have passed quality or be mate of passed and fished read.
eBAC assembly internal checks Post-Phrap, paired ends used to split and trim contigs which are inconsistent internally or cannot be consistently scaffolded.
BAC purification QC Each tracerun (96-well group) checked for coassembly against other Traceruns.
Coassembly indicated by participation in same contigs (one test) or in same scaffold (for comparison).
Groups of traceruns not coassembling with the bulk of a project are pulled, and if comprising ≥200 passed reads, placed in a “synthetic project.”
(Or relocated to their correct original project if possible, based on both sequence similarity and lab tracking proximity.)
Bactigging QC Linearized sequence for each enriched BAC scaffold BLASTZ'd against others (rendered efficient by prefilter for shared WGS reads).
Enriched BACs with excess overlaps flagged for closer examination in BAC purification.
Bactigs reassembled, scaffolded into superbactigs, laid out by markers. Adjacent bactigs whose terminal BACs had low-confidence overlaps are re-examined for overlaps and joined if confirmed.
Mapping QC Markers, mouse synteny, human synteny, and FPC all examined simultaneously along with superbactig data primarily driven by BAC ends; feedback between assembly layout of BACs and FPC mapping group.
Overall checks Alignment with finished BACs (and multi-BAC regions from NISC) dot-plotting (BLASTZ and atlas-dot). Alignment and scoring using MUMmer.
Large-scale alignment with Mouse, dot-plotting (BLASTZ and atlas-dot); see mapping QC.
Duplications and collapses:
Oligo analysis (24-mers): check for regions overrepresented in assembly (artifactual duplications) especially at bactig and superbactig boundaries-found and corrected small number of cases (<6).
Approximately 4% of unique WGS oligomers missing in final assembly (as compared with 1% in Mouse)-Oligomers with frequency 20-50 underrepresented in Mouse (by 10% to 20%)-oligomer representation in Rnor 3.1 consistently ∼96% beyond frequency = 100.

cDNA and EST alignment consistency.

This Article

  1. Genome Res. 14: 721-732

Preprint Server