Structure of UniGene Data Set Before and After CRAW Treatment Is Applied
| UniGene 101 Structure | ||
| 1. 11,751 | multipass/full-length gene sequence | |
| 2. 213,885 | 3‘ ESTs | |
| 3. 270,012 | 5‘ ESTs | |
| 4. 26,995 | other ESTs | |
| 5. 522,643 | total number of sequences subjected to clustering | (= 1 + 2 + 3 + 4) |
| 6. 45,918 | number of index groups resulting from UniGene clustering | |
| Index Structure of UniGene After Treatment with CRAW Analysis | ||
| 7. 9,671 | UniGene clusters that are singletons | (containing 9,671 sequences) |
| 8. 359 | UniGene clusters ignored by this study | (containing 106,810 sequences) |
| 9. 35,888 | remaining UniGene clusters subjected to analysis | (containing 406,162 sequences) |
| 10. 41,268 | non-singleton subgroups resulting from CRAW analysis | |
| 11. 58,070 | singleton subgroups after treatment | (includes 9,671 singleton UniGene clusters from 7) |
| 12. 13.96% | percent singleton sequences | [(100 * 11)/406,162] |
| Index structure of TIGR Gene Index v. 2.3 | ||
| 13. 619,528 | ESTs | |
| 14. 6,635 | HTs | |
| 15. 626,163 | total number of sequences subjected to indexing | (= 13 + 14) |
| 16. 41,268 | THCs (non-singletons) | |
| 17. 135,140 | singleton sequences | |
| 18. 21.81% | percent singleton sequences | (100* 17/15) |
[i] UniGene 101 contained 11,751 gene sequences, 213,885 3‘ ESTs, 270,012 5‘ ESTs, as well as 26,995 EST sequences not classified at 3‘ or 5‘. Of the total 45,918 UniGene 101 clusters, 9,671 were singletons (contained only one transcript). Of the larger clusters, 359 were excluded from our analysis. The remaining 35,888 (=45,918 − 9,671 − 359) clusters were subjected to our processing and from these the CRAW analysis generated 99,338 subgroups, of which 58,070 (including the 9,671 singleton UniGene clusters) were singleton subgroups. A total of 415,833 sequences were input into our analysis so we measure a fragmentation rate of 13.96% (percent of sequences isolated from subgroups = 58,070 × 100/415,833).
[ii] For comparative purposes, structure information on TIGR Gene Index is included. The TIGR Gene Index inputs 626,163 sequences and results in 135,140 singletons, a fragmentation rate of 21.81%. The reduced fragmentation rate of the CRAW-processed UniGene 101 is suggestive; however, the comparison is not rigorous because the initial data sets are different and our analysis ignores 359 of the initial 45,918 UniGene clusters.