|
|
|
|
Published online before print
January 2, 2007, 10.1101/gr.5646507 Genome Res. 17:231-239, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Methods Improving gene annotation using peptide mass spectrometry1 Bioinformatics Program, University of California, San Diego, La Jolla, California 92093-0419, USA; 2 Department of Biology, University of California, San Diego, La Jolla, California 92093-0346, USA; 3 Department of Computer Science, George Washington University, Washington, DC 20052, USA; 4 Centre de Regualció Genòmica, 08003 Barcelona, Spain; 5 Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093-0404, USA
Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.
Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of recent advances in computational gene finding, a comprehensive annotation of protein coding genes remains challenging. In most annotation pipelines, a computationally predicted gene must be confirmed by independent evidence and/or manual validation before it is accepted. The additional evidence is often in the form of conservation across distant organisms or evidence of transcription. This evidence, while compelling, is not sufficient (see Gupta et al. 2004). Conservation across species is not limited to protein coding regions. Roughly 5%20% of the human genome is conserved against mouse, of which just 1%2% is considered to be coding for proteins (Waterston et al. 2002 Therefore, it is customary to provide a conservative genome annotation and then rely upon community efforts to refine annotations and fill in missing genes. While the genome annotation process is unlikely to be fully automated, high-throughput methods are an important part of any genome annotation strategy. Tandem mass spectrometry is an attractive technique for validating gene predictions. It measures proteins directly, verifying putative gene products at the level of translation. Also, it provides an orthogonal line of evidence, with different error sources than nucleotide-based approaches.
A tandem mass spectrum can be viewed as a collection of fragment masses from a single peptide (eight to 30 amino acids from an enzymatically digested protein). This set of mass values is a "fingerprint" that identifies the peptide. The spectra are usually not analyzed de novo. Instead, they are compared against peptides from a database of known proteins (Aebersold and Mann 2003
In this context, it is natural to ask if we can search translated genomic databases directly. Each match from such a search confirms a genomic locus to be part of a protein-coding gene. This has been proposed in a number of studies (Yates et al. 1995a
We overcome these issues with several technical improvements. First, instead of searching translated genomes directly, we search a compact representation of all putative exons, splice variants and polymorphisms. This representation takes the form of a directed acyclic graph which we call the exon graph. Our search is efficient, using a database filtering technique based on tagging (Frank et al. 2005
Exon and intron predictions Exon predictions were generated by GeneID (Parra et al. 2000 1 were retained, producing 4,110,476 exons with considerable overlap. Splice junctions were considered between all pairs of exons with compatible reading frames and intron length between 25 and 20,000 bases. Each interval was linked to the closest intervals with a compatible reading frame. At most 10 introns were considered per genomic position.
We extracted human sequences from dbEST (6,587,476 sequences) (Boguski 1993). These sequences were aligned against the May 2004 assembly of the human genomic sequence using ESTMapper (Florea et al. 2005
Database construction
Gene prediction algorithms often produce putative exons of various lengths which overlap. Similarly, because ESTs have varying read lengths, it is common for them to map to overlapping genomic intervals. If intervals Ii and Ij overlap, we can merge them into a larger interval without loss of information, so long as
We perform all such legal merges. This phase greatly reduces the redundancy of the set of intervals. If an interval overlaps the edge of a putative intron, we cut the interval into two subintervals at the junction point. At the end of this phase, our set of intervals is disjoint. We now add an edge between any adjacent intervals (Ii and Ij such that Ri = Lj). For each putative intron, we add a splice edge between the corresponding intervals. We now incorporate polymorphisms. If an interval contains a coding SNP, we add intervals for each allele. Thus, each SNP produces a "bulge" in the graph. We derive an exon graph from the genomic interval graph. For each node in the interval graph, add one node to the exon graph for each legal reading frame. Each exon graph node has a protein sequence and may have an untranslated prefix and suffix. If intervals are joined by an edge, then the corresponding exons (with compatible reading frame) are similarly joined. Edges are annotated with an amino acid when a codon is split between exons. In order to remove noncoding "noise" from the database, we remove all nodes and edges that are not part of a coding sequence of length 50 or more. This procedure removes nodes corresponding to translation of EST mappings in the wrong reading frame. The finished exon graph contains a total of 133 M amino acids, in 3.5 M exons, with 2 M splice junctions.
Mass spectra An Agilent 1100 HPLC system (Agilent Technologies) was used to deliver a flow rate of 300 nL min1 to the mass spectrometer through a splitter. Chromatographic separation was accomplished using a three-phase capillary column. Using an in-house constructed pressure cell, 5 µm Zorbax SB-C18 (Agilent) packing material was packed into a fused silica capillary tubing (200-µm inner diameter (ID), 360-µm outer diameter (OD), 20 cm long) to form the first dimension RP column (RP1). A similar column (200-µm ID, 5 cm long) packed with 5 µm PolySulfoethyl (PolyLC) packing material was used as the SCX column. A zero dead volume 1-µm filter (Upchurch, M548) was attached to the exit of each column for column packing and connecting. A fused silica capillary (100-µm ID, 360-µm OD, 20 cm long) packed with 5 m Zorbax SB-C18 (Agilent) packing material was used as the analytical column (RP2). One end of the fused silica tubing was pulled to a sharp tip with the ID <1 µm using a laser puller (Sutter P-2000) as the electro-spray tip. The peptide mixtures were loaded onto the RP1 column using the same in-house pressure cell. To avoid sample carryover and keep good reproducibility, a new set of three columns with the same length was used for each sample. Peptides were first eluted from the RP1 column to the SCX column using a 0%80% acetonitrile gradient for 150 min. The peptides were fractionated by the SCX column using a series of salt gradients (from 10 mM1 M ammonium acetate for 20 min), followed by high-resolution reverse phase separation using an acetonitrile gradient of 0%80% for 120 min. We have found that a three-dimensional run can provide significantly more resolving power but at the cost of a longer separation time. For three dimensions, we elute fractions with acetonitrile from RP1 in 10% increments and then perform the salt elutions as described above but with a resolving gradient for RP2 of acetonitrile equal to the gradient used to elute from RP1. Spectra were acquired on LTQ linear ion trap tandem mass spectrometers (Thermo Electron Corporation) employing automated, data-dependent acquisition. The mass spectrometer was operated in positive ion mode with a source temperature of 150°C. As a final purification step, gas phase separation in the ion trap was employed to separate the peptides into three mass classes prior to scanning; the full MS scan range was divided into three smaller scan ranges (300800, 8001100, and 11002000 Da) to improve dynamic range. Each mass spectrometry (MS) scan was followed by 4 MS/MS scans of the most intense ions from the parent MS scan. A dynamic exclusion of 1 min was used to improve the duty cycle.
In addition, we downloaded all human, non-ICAT-labeled spectra publicly available (as of March 2006) in the PeptideAtlas data repository (Desiere et al. 2004 The HEK293 mass spectra are available from http://bioinfo2.ucsd.edu, together with spectrum annotations.
Database search
When a tag and its flanking masses are matched, a candidate peptide is produced. Each candidate peptide is scored to compute the probability of that peptide generating the query spectrum (Tanner et al. 2005
The empirical distribution of F-scores can be fit by a mixture model of a gamma distribution (representing false annotations) and a normal distribution (representing true annotations) (Keller et al. 2002 As an additional measurement of false discovery rate, we constructed a reversed database by reversing the sequences of all nodes and reversing the direction of each edge. We measured an empirical false discovery rate by searching 700,000 spectra against the reversed databases. Our F-score cutoff yields 1200 matches on the reversed database, for a false annotation rate of 0.2%. In a search of the forward database, 47,000 spectra passed this same score cutoff. Based on these results, we estimate that 1200 of the 47,000 spectrum matches against the true database are incorrect, for a false discovery rate of 2.5%. In addition to this filter at the spectrum level, we pay particular attention to exons hit by multiple peptides; no such instances were observed for the search of the reversed database. Post-processing of the search results was performed to deal with peptides which occur in multiple proteins. We note that in addition to closely related paralogs, the predicted exons may include some pseudogenes highly similar to their source genes. As an extreme example, the peptide AMGIMNSFVNDIFER (from H2B histone family, member S) is found in >20 valid and invalid ORFs. Therefore, when measuring coverage, we iteratively select a set of genes. At each stage, the gene which can be used to annotate the greatest number of spectra is selected, and the selected gene "absorbs" all shared peptides. We require at least two peptide hits before judging a protein present. This procedure ensures that redundant or questionable protein records are not selected. When considering alternative splicing, we select multiple isoforms of a protein only if we must do so in order to account for all the peptides matched.
Mapping known proteins to the genome The heuristic alignment algorithm enumerates 6-mers from the protein found in the six-frame translation of the genomic region of interest. Adjacent hits are merged into putative exons. Using dynamic programming, we find a chain of exons which cover the entire protein. Exons close to each other can be merged, to step over mismatches between the protein sequence and genome. Finally, exon endpoints are refined to capture the best available splice signals.
A total of 56,725 proteins (98%) were mapped against the genome with Each peptide identified in our database was compared to the locations of known proteins. If a peptide was found multiple times in the genome, or if two matches had equivalent match scores, we considered each locus. When selecting a locus, the order of preference was as follows: match to a known gene, match a known gene with SNPs, match a novel single-exon peptide, match a novel intron-spanning peptide. This procedure helps us avoid proposing new exons which correspond to pseudogenes.
Improving gene predictions
We first ran GeneID against the human genome, retaining all predicted exons with score
For each exon, we consider three parameters. The parameter c is equal to the number of spectrum annotations that are contained in the exon of interest. The parameter Pa is set to the best P-value of a peptide match covering the splice acceptor of the exon. We set Pa = 1 if there are not at least two spectrum annotations covering the acceptor site. Otherwise, we add 0.001 to the P-value to limit the effects of matches with extremely low P-values. Similarly, Pd is the best P-value of a match covering the splice donor. The score S of each exon is modified as follows:
For each gene of interest, we extract the genomic interval containing the exons from the gene. We run GeneID in exon-chaining mode to predict a gene on this interval using the original exons, then using the rescored exons.
Search algorithm comparison We compared the performance of Inspect to that of SpectrumMill (version 3.1, Agilent) on a collection of 800,000 spectra (34 runs) from the HEK293 data set. Both tools searched these spectra against the same database consisting of the IPI database, together with the reversed sequence of each protein. We assume that spurious matches are distributed randomly throughout the database. Using this assumption, if 5% of all matches come from reversed proteins, then the false discovery rate among matches from valid proteins is also 5%. Sorting the SpectrumMill matches by score, we obtain 94,633 spectrum annotations (27,845 distinct peptides) at a false discovery rate of 5%. Sorting the Inspect matches by score, we obtain 135,192 spectrum annotations (43,311 distinct peptides) at this same false discovery rate. These results (40% more spectra, 70% more peptides) indicate that Inspects filtering and scoring are effective on this data set.
Exon graph construction
To verify the completeness of the exon graph, we considered the IPI database (version 3.15) as a representative corpus of known human proteins (Kersey et al. 2004 The mapped proteins include multiple isoforms of many genes. Counting known proteins that share exons as one gene, we reach a gene count of 32,493, of which 10,583 have multiple isoforms (Supplemental Fig. 1). These gene mappings include a total of 442,572 distinct exons. We show later the annotation of peptides corresponding to isoforms that are not contained in the IPI database but have been deposited in GenBank. For each mapped protein, we determined whether GeneID predictions and/or EST mappings captured the genomic intervals (exons) and putative splice junctions (introns) of the protein. Table 1 summarizes the results.
This table reflects the extremely high EST coverage of the human proteome. The exon predictions from GeneID cover most true exons, but the intron coverage is lower. The low intron coverage likely results from the simplistic exon-joining algorithm used in constructing the exon graph. A more sophisticated approach may cover more splice junctions. The exons missed in this construction typically come from the edges of the protein. The coverage rates for first and last exons are 81% for ESTs and 60% for GeneID, significantly lower than the average overall. Further research will target these problematic exons. Given the high coverage of known proteins by the algorithmically derived exon graph, we turn now to the results of mass spectrometric annotation with the exon graph.
Search results
Each annotation includes the genomic location of the peptide. We compare these loci to the chromosomal locations of known proteins. We then categorize peptide matches based upon their relationship to known genes (see Methods). Recall that the human genome is heavily annotated. Therefore, the degree to which known proteins are covered by annotations from this data set is a reasonable estimate of our coverage of the full proteome. See Figure 4 for an initial breakdown of the results. The majority (89%) of peptides match known genes. Of these, 24% span an exon boundary, confirming splicing events at the protein level. A total of 121 peptides (in 1517 spectra) span two exon boundaries; these represent cases where a tryptic peptide fully spans a short exon. A total of 11,050 splice events are confirmed by identified peptides. Given that only
Protein coverage The search results include 6252 proteins confirmed by two or more distinct peptides, and a total of 3745 proteins are matched by five or more distinct peptides. As noted earlier, we select a minimal set of proteins which account for spectrum annotations. This allows us to avoid listing records corresponding to multiple isoforms of the same protein unless both forms are in fact present.
Because protein abundances within the cell vary greatly, we see extreme variation in the number of spectra matching each protein, with >25,000 matches from enolase 1, (alpha) but only one or two matches to other proteins. As with other high-throughput techniques such as cDNA sequencing, the repeated sampling of common elements eventually reaches saturation. We count the number of distinct peptides (from known proteins) discovered for a given number of identifications and plot the resulting discovery curve. The discovery rate slows as more peptides are found (Fig. 5), but is still far from saturation. The discovery curve is fit well by the function y
Novel peptides Matches to the exon graph which do not correspond to known proteins are potentially of great interest, since they may come from uncharacterized exons or even unannotated genes. We investigated and categorized all peptide matches that are not present in the IPI reference database. We reiterate that searching a larger database increases the likelihood of obtaining a high-scoring match by chance, and we employ several safeguards to filter such matches. First, we use a cutoff based on the false discovery rate (see Methods) to limit the number of such matches. Second, we used the results of a standard database search to filter any novel matches that can be explained away by a known peptide that is missing from the exon graph. An example of a peptide removed by this filtering is LGEHNVEVLEGNEQFINAAK, coded by an intron of TRBC1 (GI:135523) on the forward strand of chromosome 7. The spectra for this peptide are annotated by a fragment of porcine trypsin with similar sequence (LGEHNIDVLEG NEQFINAAK). Many of the peptides not present in IPI are present in other isoforms or proteins found in the NCBI nonredundant database. We observe a total of 90 such peptides (1938 spectra). See Supplemental Table 1 for the complete list. These cases illustrate the danger of selecting a limited set of "representative" splice forms for a protein database. After removing such annotations, we retain 58,000 novel spectra (6100 peptides). We note that incorrect matches are more likely to be novel peptides, since 80% of the exon graph database is novel sequence. Let us conservatively assume the incorrect matches all fall within the novel peptides. Given a 2.5% false discovery rate across all 1.2 M annotations, we estimate that 28,000 spectra are correctly annotated by novel peptides. These correspond to an estimated 3300 peptides, based on the mean number of spectra per novel peptide. A report of all novel peptides is provided in Supplemental Table 2. In the remainder of our analysis, we restrict our attention to those novel peptides strongly supported by additional lines of evidence. We find evidence for novel exons (or extensions of known exons) in 16 genes. These instances are supported by sequence homology and by the discovery of one or more peptides in close proximity along the genome. The discovery of translated peptides demonstrates that these sites are indeed exons and not conserved noncoding sequences. See Figure 6 for an example of the evidence for one exon.
Table 2 summarizes these exon discoveries. While the main purpose of our project is the preliminary annotation of nonannotated or sparsely annotated genomes, the discovery of new exons on the human genome demonstrates the power of the technique. In most cases, the novel translation is immediately upstream of known exons. We note that many of the reference protein sequences are derived from cDNA sequences. The 5' portions of such sequences are often inferred or absent due to truncation of cDNA. In addition, predicted translation start sites are often incorrect. With the exon graph, we can use mass spectra not only to confirm translation of these genes but to correct their sequence annotations. Supplemental Table 3 reports the peptide hits to these novel exons, as well as peptides from the known exons of the protein. Supplemental Figure 2 illustrates one such case.
Two peptides were observed that fall within splicing factor 1 (GI:42544130) but not in the annotated reading frame. These peptides are of particular interest since they fall within one of the genomic regions selected by the ENCODE project (ENCODE Project Consortium 2004
Alternative splicing We examined our search results for evidence of alternative splicing. We consider all splice donors and splice acceptors that have multiple partners. We ignore matches where the splice boundaries are not part of a known protein, or where the peptide covers six or fewer base pairs on either side of the intron. We highlight a total of 40 instances of alternative splicing in this way. We report these events in Supplemental Table 4. In 24 of these instances, only one of the two isoforms is present in the IPI database. As s conservative filter, we report such splice junctions only if they are supported by EST evidence and/or supported by sequences in the NCBI nonredundant database.
Polymorphisms
Hypothetical proteins
Refining gene predictions We ran GeneID on the genomic intervals containing 1386 protein-coding genes. We selected genes for which one or more peptides were mapped to the coding region, and for which a single splice isoform was known (from the IPI database). We then rescored all predicted exons by incorporating peptide matches from our database search. The sensitivity and selectivity of gene assembly improved (Table 3), with a gain of 863 correctly identified exons. The improvements are greatest for proteins that are well sampled (data not shown). We also note that since we examine a broad selection of genes, including 100 that span >100,000 bp, accuracy on this corpus may be lower than on other test sets. Figure 7 shows an example of a gene prediction improved by this method.
In a few cases (20 genes), predictions worsened after rescoring. The peptide annotations used for these genes appear to be correct. In most cases, an incorrect exon (which overlaps the true exon) was boosted and selected for the final gene prediction. One instance of a peptide mapped to an incorrect splice boundary was also observed. Further work will focus on improved incorporation of MS/MS data, and integration of MS/MS search results alongside other data that can corroborate exons (ESTs and comparative genomics). We anticipate that refinement of the algorithm as well as acquisition of additional spectra will improve results.
Delineating the protein-coding genes within a eukaryotic genome remains a complex and labor-intensive process. To cite one example, a human-curated annotation of the human X chromosome required an estimated 15,000 person-hours (Harsha et al. 2005
The exon graph is a compact representation of protein splice isoforms and polymorphisms. We observe a near 10-fold reduction in database size between dbEST and the exon graph. We emphasize that this is difficult to accomplish with a typical database, stored in FASTA format. Enumeration of all protein sequences greatly increases search time and creates confusion when matches to dozens of "records" are explained by one gene. Many databases sidestep the problem by including one or two representative sequences for each protein, but this approach carries omits isoforms and polymorphisms. Algorithmic improvements are one way to reduce redundancy from linear protein databases (Edwards and Lippert 2004 We used two data sources that complement each other to construct the exon graph. An advantage of the EST evidence is that it includes evidence for introns. Short exons, or exons with unusual hexamer count, are difficult to identify de novo but may be covered by ESTs. A limitation of EST evidence is that ESTs may not be available for all genes, and may not cover the 5' portion of a gene. Many genes are transcribed only in certain tissues or under certain conditions and may never have been captured as ESTs. Another drawback of EST data is the presence of unprocessed and truncated transcripts, as well as genomic contaminants. Exon predictions have the advantage that they explicitly indicate reading frame. Database construction proceeds from putative exons and introns, independent of any specific exon prediction method. We are working to integrate other signals including the output from multiple gene finding programs, evolutionarily conserved regions, etc.
Our results include 40 instances of alternative splicing. We emphasize that we have highlighted only those instances where two splicing events are observed at the same locus. These results directly confirm both splice events. Many other peptide identifications are unique to splice isoforms that are not considered standard, giving indirect evidence of alternative splicing. It is notable that many splice isoforms differ by the inclusion of a single amino acid. These are cases where two splice donor (or acceptor) sites are present, separated by 3 bp. Some isoforms of biological significance differ by presence or absence of a single amino acid (Tadokoro et al. 2005
Fully characterizing splice events from tryptic peptides gives rise to a phasing problem which may be avoided by top-down mass spectrometry of complete proteins (Roth et al. 2005
Our focus in this article is on cataloging coding exons and splice events. We note that mass spectrometry can measure other types of information that are invaluable for annotation of genes. These include post-translational modifications (Jensen 2006 We argue that high-throughput proteomics experiments should accompany each genome sequencing project. Mass spectrometry is a practical technique for annotating protein-coding regions. The search is able to tolerate a substantial overhead of "noise" in exon predictions. In addition, the technique is orthogonal to standard transcript-level methods such as cDNA sequencing. Mass spectrometry complements other experimental methods. With recent advances in instrumentation, the data volume we consider in this article can be produced in 10 instrument-weeks with two person-weeks of labor. Scaling up mass spectrometry experiments to help annotate a large portion of proteomes is an attractive prospect at feasible cost.
S.T. is supported by NSF IGERT training grant DGE0504645. This research was supported in part by NIH (RR016522-04A1), and by the UCSD FWGrid Project, NSF Research Infrastructure Grant Number EIA-0303622. Part of this investigation was supported using the computing facility made possible by the Research Facilities Improvement Program Grant Number C06 RR017588 awarded to the Whitaker Biomedical Engineering Institute, and the Biomedical Technology Resource Centers Program Grant Number P41 RR08605 awarded to the National Biomedical Computation Resource, UCSD, from the National Center for Research Resources, National Institutes of Health.
6 Corresponding author.
E-mail stanner{at}ucsd.edu; fax (858) 534-7029 [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5646507
Aebersold, R. and Mann, M. 2003. Mass spectrometry-based proteomics. Nature 422: 198207.[CrossRef][Medline] Aho, A. and Corasick, M. 1975. Efficient string matching: An aid to bibliographic search. Commun. ACM 18: 333340.[CrossRef] Bafna, V. and Edwards, N. 2001. SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17: 1321. Blanco, E., Parra, G., and Guigó, R. 2002. Using GeneID to identify genes. In Current Protocols in Bioinformatics. John Wiley & Sons Inc., New York. Unit 4.3. Boguski, M.S., Tolstoshev, C.M., and Bassett Jr., D.E. 1993. Gene discovery in dbEST. Science 265: 19931994. Carlton, J.M., Angiuoli, S.V., Suh, B.B., Kooij, T.W., Pertea, M., Silva, J.C., Ermolaeva, M.D., Allen, J.E., Selengut, J.D., and Koo, H.L., et al. 2002. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419: 512519.[CrossRef][Medline] Choudhary, J., Blackstock, W., Creasy, D., and Cottrell, J. 2001. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1: 651667.[CrossRef][Medline] Craig, R. and Beavis, R. 2003. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17: 23102316.[CrossRef][Medline] Creasy, D. and Cottrell, J. 2002. Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2: 14261434.[CrossRef][Medline] Desiere, F., Deutsch, E., Nesvizhskii, A., Mallick, P., King, N., Eng, J., Aderem, A., Boyle, R., Brunner, E., and Donohoe, S., et al. 2004. Integration of peptide sequences obtained by high-throughput mass spectrometry with the human genome. Genome Biol. 1: R9. Dunkley, T.P.J., Hester, S., Shadforth, I.P., Runions, J., Weimar, T., Hanton, S.L., Griffin, J.L., Bessant, C., Brandizzi, F., and Hawes, C., et al. 2006. Mapping the Arabidopsis organelle proteome. Proc. Natl. Acad. Sci. 103: 65186523. Edwards, N. and Lippert, R. 2004. Sequence database compression for peptide identification from tandem mass spectra. In The 4th Workshop on Algorithms in Bioinformatics (WABI). Bergen, Norway. ENCODE Project Consortium 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306: 636640. Fermin, D., Allen, B., Blackwell, T., Menon, R., Adamski, M., Xu, Y., Ulintz, P., Omenn, G., and States, D. 2006. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 7: R35.[CrossRef][Medline] Florea, L., Francesco, V., Miller, J., Turner, R., Yao, A., Harris, M., Walenz, B., Mobarry, C., Merkulov, G., and Charlab, R., et al. 2005. Gene and alternative splicing annotation with AIR. Genome Res. 15: 5466. Frank, A., Tanner, S., Bafna, V., and Pevzner, P. 2005. Peptide sequence tags for fast database search in mass spectrometry. J. Proteome Res. 4: 12871295.[CrossRef][Medline] Godovac-Zimmermann, J., Kleiner, O., Brown, L.R., and Drukier, A.K. 2005. Perspectives in spicing up proteomics with splicing. Proteomics 5: 699709.[CrossRef][Medline] Gupta, S., Zink, D., Korn, B., Vingron, M., and Haas, S. 2004. Strengths and weaknesses of EST-based prediction of tissue-specific alternative splicing. BMC Genomics 5: 72.[CrossRef][Medline] Harsha, H., Suresh, S., Amanchy, R., Deshpande, N., Shanker, K., Yatish, A., Muthusamy, B., Vrushabendra, B., Rashmi, B., and Chandrika, K., et al. 2005. A manually curated functional annotation of the human X chromosome. Nat. Genet. 37: 331332.[CrossRef][Medline] Heber, S., Alekseyev, M., Sze, S., Tang, H., and Pevzner, P.A. 2002. Splicing graphs and EST assembly problem. Bioinformatics 18: S181S188.[Abstract] Jensen, O.N. 2006. Interpreting the protein language using proteomics. Nat. Rev. Mol. Cell Biol. 7: 391403.[CrossRef][Medline] Keller, A., Nesvizhskii, A., Kolker, E., and Aebersold, R. 2002. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74: 53835392.[Medline] Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R. 2004. The international protein index: An integrated database for proteomics experiments. Proteomics 4: 19851988.[CrossRef][Medline] Korf, I., Flicek, P., Duan, D., and Brent, M. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140S148.[Abstract] Kuster, B., Mortensen, P., Andersen, J.S., and Mann, M. 2001. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1: 641650.[CrossRef][Medline] Leipzig, J., Pevzner, P., and Heber, S. 2004. The Alternative Splicing Gallery (ASG): Bridging the gap between genome and transcriptome. Nucleic Acids Res. 32: 39773983. Lill, J. 2003. Proteomic tools for quantitation by mass spectrometry. Mass Spectrom. Rev. 22: 182194.[CrossRef][Medline] Lu, B. and Chen, T. 2003. A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics 19: 113121. Mironov, A., Fickett, J., and Gelfand, M. 1999. Frequent alternative splicing of human genes. Genome Res. 9: 12881293. Modrek, B. and Lee, C. 2002. A genomic view of alternative splicing. Nat. Genet. 30: 1319.[CrossRef][Medline] Omenn, G., States, D., Adamski, M., Blackwell, T., Menon, R., Hermjakob, H., Apweiler, R., Haab, B., Simpson, R., and Eddes, J., et al. 2005. Overview of the hupo plasma proteome project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly available database. Proteomics 5: 32263245.[CrossRef][Medline] Parra, G., Blanco, E., and Guigó, R. 2000. GeneID in Drosophila. Genome Res. 10: 511515. Perkins, D., Pappin, D., Creasy, D., and Cottrell, J. 1999. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20: 35513567.[CrossRef][Medline] Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2005. NCBI Reference Sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33: 501504.[CrossRef] Resing, K., Meyer-Arendt, K., Mendoza, A., Aveline-Wolf, L., Jonscher, K., Pierce, K., Old, W., Cheung, H., Russell, S., and Wattawa, J., et al. 2004. Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 76: 35563568.[Medline] Roth, M.J., Forbes, A.J., Boyne, M.T.N., Kim, Y.-B., Robinson, D.E., and Kelleher, N.L. 2005. Precise and parallel characterization of coding polymorphisms, alternative splicing, and modifications in human proteins by mass spectrometry. Mol. Cell. Proteomics 4: 10021008. Sadygov, R. and Yates, J. 2003. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75: 37923798.[Medline] Tabb, D., Smith, L., Breci, L., Wysocki, V., Lin, D., and Yates, J. 2003. Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. Anal. Chem. 75: 11551163.[Medline] Tadokoro, K., Yamazaki-Inoue, M., Tachibana, M., Fujishiro, M., Nagao, K., Toyoda, M., Ozaki, M., Ono, M., Miki, N., and Miyashita, T., et al. 2005. Frequent occurrence of protein isoforms with or without a single amino acid residue by subtle alternative splicing: The case of gln in drpla affects subcellular localization of the products. J. Hum. Genet. 50: 382394.[CrossRef][Medline] Tanner, S., Shu, H., Frank, A., Wang, L., Zandi, E., Mumby, M., Pevzner, P., and Bafna, V. 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem. 77: 46264639.[Medline] Tsur, D., Tanner, S., Zandi, E., Bafna, V., and Pevzner, P. 2005. Identification of post-translational modifications via blind search of mass-spectra. Nat. Biotechnol. 23: 15621567.[CrossRef][Medline] Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., and An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline] Yates, J., Eng, J., and McCormack, A. 1995a. Mining genomes: Correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 67: 32023210.[Medline] Yates, J., Eng, J., McCormack, A., and Schieltz, D. 1995b. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67: 14261436.[Medline]
Received June 15, 2006; accepted in revised format November 9, 2006. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||