A Quantitative Evaluation of SAGE

  1. Jes Stollberg1,
  2. Johann Urschitz,
  3. Zsolt Urban, and
  4. Charles D. Boyd
  1. Pacific Biomedical Research Center, University of Hawai'i at Manoa, Honolulu, Hawaii 96822

Abstract

Serial Analysis of Gene Expression (SAGE) is an innovative technique that offers the potential of cataloging both the identity and relative frequencies of mRNA transcripts in a given poly(A+) RNA preparation. Although it is a very effective approach for determining the expression of mRNA populations, there are significant biases in the observed results that are inherent in the experimental process. These are caused by sampling error, sequencing error, nonuniqueness, and nonrandomness of tag sequences. The quantitative information desired from SAGE experiments consists of estimates of the number of genes and the frequency distribution of transcript copy numbers. Of additional concern is the extent to which a given tag sequence can be assumed to be unique to its gene. The present study takes these mathematical biases into account and presents a basis for maximum likelihood estimation of gene number and transcript copy frequencies given a set of experimental results. These estimates of the true state of genomic expression are markedly different from those based directly on the observations from the underlying experiments. It also is shown that while in many cases it is probable that a given tag sequence is unique within the genome, in larger genomes this cannot be safely assumed.

Footnotes

  • 1 Corresponding author.

  • E-MAIL jesse{at}pbrc.hawaii.edu; FAX (808) 956–6984.

  • 150 Note that the four-base restriction enzyme sequence by which tags are manipulated does not enter into this calculation. As the sequence is by experimental design unvarying, it does not contribute to the number of possible tag sequences or to the simulations presented below.

  • 151 The program performing these simulations is available free of charge for noncommercial use. Contact the corresponding author for information regarding this software.

  • 152 Note that many standard language random number generators are inadequate for this large a task, and care should be taken in the algorithm used. (Press et al. 1998)

    • Received December 9, 1999.
    • Accepted May 18, 2000.
| Table of Contents

Preprint Server