(A) Length of consensus sequences resulting from CRAW assembly/analysis on 5′ and 3′ ESTs from UniGene98 after CRAW processing. The x-axis denotes the number of sequences in the UniGene cluster; the y-axis represents consensus length. By forming an assembly with between 10 and 15 ESTs the length of the resulting contig can be doubled on average. Assemblies made from clusters containing >45 ESTs result in contigs that are 400% longer than unassembled sequences. The effective assembly length approaches the actual gene length in UniGene101: the sequences classified as multipass/full-length have an average length (♦) of 2102 and a median length (▴) of 1695 bases. (B) Length of the maximal ORF was measured after performing CRAW assembly/analysis on 5′ ESTs from UniGene98 clusters. The longest ORF of the resulting consensus sequence (in residues) is plotted against the number of 5′ sequences in the cluster. The axes are as in A. The effective ORF size generated from EST fragments easily surpasses 50% of the full-length gene maximal ORF length: the sequences classified as multipass or full-length in UniGene101 have an average maximal ORF length (♦) of 478 residues and a median length (▴) of 367 residues. The improvement shown is the result of both assembly of ESTs into longer contigs and the correction of insertion and deletion errors using sequence redundancy.

