Table 1.

Structure of UniGene Data Set Before and After CRAW Treatment Is Applied

UniGene 101 Structure
 1.  11,751multipass/full-length gene sequence
 2. 213,8853‘ ESTs
 3. 270,0125‘ ESTs
 4.  26,995other ESTs
 5. 522,643total number of sequences subjected to clustering(= 1 + 2 + 3 + 4)
 6.  45,918number of index groups resulting from UniGene clustering
Index Structure of UniGene After Treatment with CRAW Analysis
 7.   9,671UniGene clusters that are singletons(containing   9,671 sequences)
 8.     359UniGene clusters ignored by this study(containing 106,810 sequences)
 9.  35,888remaining UniGene clusters subjected to analysis(containing 406,162 sequences)
10.  41,268non-singleton subgroups resulting from CRAW analysis
11.  58,070singleton subgroups after treatment(includes 9,671 singleton UniGene clusters from 7)
12.  13.96%percent singleton sequences[(100 * 11)/406,162]
Index structure of TIGR Gene Index v. 2.3
13. 619,528ESTs
14.   6,635HTs
15. 626,163total number of sequences subjected to indexing(= 13 + 14)
16.  41,268THCs (non-singletons)
17. 135,140singleton sequences
18.  21.81%percent singleton sequences(100* 17/15)

[i] UniGene 101 contained 11,751 gene sequences, 213,885 3‘ ESTs, 270,012 5‘ ESTs, as well as 26,995 EST sequences not classified at 3‘ or 5‘. Of the total 45,918 UniGene 101 clusters, 9,671 were singletons (contained only one transcript). Of the larger clusters, 359 were excluded from our analysis. The remaining 35,888 (=45,918 − 9,671 − 359) clusters were subjected to our processing and from these the CRAW analysis generated 99,338 subgroups, of which 58,070 (including the 9,671 singleton UniGene clusters) were singleton subgroups. A total of 415,833 sequences were input into our analysis so we measure a fragmentation rate of 13.96% (percent of sequences isolated from subgroups = 58,070 × 100/415,833).

[ii] For comparative purposes, structure information on TIGR Gene Index is included. The TIGR Gene Index inputs 626,163 sequences and results in 135,140 singletons, a fragmentation rate of 21.81%. The reduced fragmentation rate of the CRAW-processed UniGene 101 is suggestive; however, the comparison is not rigorous because the initial data sets are different and our analysis ignores 359 of the initial 45,918 UniGene clusters.