Table 1.

Simulated Results Assuming the Genome Given by Observations

Model -> Assumed Unique, no errors Unique Random Non-random
A. 9 Base sequences
Unique tags15,7207994 ± 511,029 ± 610,930 ± 610,427 ± 5
% 1–564.1638.86 ± 0.0253.63 ± 0.0253.33 ± 0.0251.26 ± 0.02
% 5–5031.037352.21 ± 0.0240.67 ± 0.0240.26 ± 0.0241.88 ± 0.02
% 50–5004.38158.17 ± 0.015.77 ± 0.0075.87 ± 0.0076.28 ± 0.007
% 500–50000.42120.76 ± 0.0030.54 ± 0.0020.54 ± 0.0020.57 ± 0.002
% Errors novel94.0 ± 0.0194.2 ± 0.0184.6 ± 0.3
% Unique genes100 ± 0100 ± 094.2 ± 0.01 81.6 ± 0.01
B. 10 Base sequences
Unique tags15,7208,003 ± 511,460 ± 611,428 ± 611,268 ± 5
% 1–564.1638.86 ± 0.0255.44 ± 0.0255.43 ± 0.0254.65 ± 0.02
% 5–5031.037352.23 ± 0.0238.51 ± 0.0238.50 ± 0.0239.15 ± 0.02
% 50–5004.38158.16 ± 0.015.53 ± 0.0065.54 ± 0.0065.68 ± 0.006
% 500–50000.42120.75 ± 0.0030.52 ± 0.0020.52 ± 0.0020.52 ± 0.002
% Errors novel98.5 ± 0.00798.5 ± 0.00795.0 ± 0.01
% Unique genes100 ± 0100 ± 098.5 ± 0.00494.0 ± 0.008
C. 10 Base sequences (five times larger genome)
Unique tags78,60047,086 ± 1064,364 ± 1063,407 ± 1058,573 ± 8
% 1–564.1643.35 ± 0.0158.24 ± 0.00957.77 ± 0.00953.94 ± 0.009
% 6–5031.037348.71 ± 0.0136.07 ± 0.0136.46 ± 0.0139.77 ± 0.01
% 51–5004.38157.23 ± 0.0045.26 ± 0.0035.34 ± 0.0035.80 ± 0.003
% 501–50000.42120.71 ± 0.0010.43 ± 0.00090.44 ± 0.00090.48 ± 0.001
% Errors novel92.5 ± 0.00792.8 ± 0.00679.4 ± 0.01
% Unique genes100 ± 0100 ± 092.8 ± 0.00475.4 ± 0.006

[i] Simulated results of SAGE experiments. In all cases, the genome is assumed to be as represented in the column “Assumed.” The columns “Unique, no errors,” “Unique,” “Random,” and “Non-random,” represent the assumptions outlined in this order in Methods. The row headings “Unique tags” and % copy numbers represent the assumed or detected number of unique tag sequences and their copy numbers. “% Errors novel,” the percentage of erroneously sequenced tags that are novel (not present on some other mRNA). “% Unique genes,” the percentage of actively transcribed genes that have unique tag sequences. A and B, 9- and 10-base tag sequences, respectively, assuming published findings for SAGE experiments. C, 10-base tags assuming a genome with 5 times the number of unique tags and 5 times the number of tags. The remaining columns represent increasingly realistic assumptions about the SAGE process as detailed in Methods. In all cases, the number of unique genes detected is significantly underestimated, as is the fraction of low copy number transcripts. Confidence values are standard errors of the mean for 1000 simulations.