Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature

Table 1.

The Training and Testing Corpus

Category GO code Training Test 2000 Test 2001 PubMed query
Autophagy GO:0006914 177 22 1 (autophagy [TI] OR autophagocytosis [MAJR]) AND (Proteins) [MH] OR Genes [MH]) AND 1940:1999 [DP]
Biogenesis GO:0016043 1023 132 4 (biogenesis [TI] OR ((cell wall [MAJR] OR cell membrane structures [MAJR] OR cytoplasmic structures [MAJR]) AND (organization [TI] OR arrangement [TI]))) AND (Genetics [MH]) AND 1984:1999 [DP]
Cell adhesion GO:0007155 1025 133 5 (cell adjesion [MAJR]) AND (genetics [MH]) AND 1993:1999 [DP]
Cell cycle GO:0007049 1085 303 19 (cell cycle [MAJR]) AND Genes [MH] AND 1996:1999 [DP]
Cell death GO:0008219 1154 434 28 (cell death [MAJR]) AND Genes [MH] AND 1997:1999 [DP]
Cell fusion GO:0006947 740 20 0 (cell fusion [MAJR] OR (mating [TI] AND Saccharomyces Cerevisiae [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
Cell motility GO:0006928 1094 269 23 (cell movement [MAJR]) AND (Genetics [MH]) AND 1995:1999 [DP]
Cell proliferation GO:0008283 394 0 0 (cell proliferation [TI]) AND (Genes [MH]) AND 1940:1999 [DP]
Cell–cell signaling GO:0007267 237 41 0 (synaptic transmission [MAJR] OR synapses [MAJR] OR gap junctions [MAJR]) AND (Genes [MH]) AND 1940:1999 [DP]
Chemimechanical coupling GO:0006943 1011 147 6 (contractile proteins [MAJR] OR kinesins [MAJR]) AND (Genes [MH]) AND 1993:1999 [DP]
Intracellular protein traffic GO:0006886 1107 322 28 (endocytosis [MAJR] OR exocytosis [MAJR] OR transport vesicles [MAJR] OR protein transport [MAJR] OR nucleocytoplasmic [TI] AND (Genetics [MH]) AND 1994:1999 [DP]
Invasive growth GO:0007125 492 52 4 ((invasive [TI] AND growth [TI]) OR neoplasm invasiveness [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
Ion homeostasis CO:0006873 424 64 5 ((na [TI] OR k [TI] OR ion [TI] OR calcium [TI] OR sodium [TI] OR hydrogen [TI] OR potassium [TI] OR pH[TI] OR water [TI] AND (concentration [TI] OR senses [TI] OR sensing [TI] OR homeostasis [TI] OR homeostasis [MAJR]) AND (genetics [MH]) AND 1940:1999 [DP]
Meiosis GO:0007126 1003 151 7 ((meiosis {MAJR])) AND (Genes [MH] OR Proteins [MH]) AND 1986:1999 [DP]
Membrane fusion GO:0006944 317 58 4 (membrane fusion [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
Metabolism GO:0008152 1005 225 30 (metabolism [MAJR]) AND Genes [MH] AND 1989:1999 [DP]
Oncogenesis GO:0007048 1043 168 15 (cell transformation, neoplastic [MAJR] AND Genes [MH] AND 1994:1999 [DP]
Signal transduction GO:0007165 1168 302 25 (signal transduction [MAJR]) AND Genes [MH] AND 1995:1999 [DP]
Sporulation GO:0007151 847 49 0 (sporulation [TI] AND (genetics [MH]) AND 1940:1999 [DP]
Stress response GO:0006950 1068 253 22 (Wounds [MAJR] OR DNA repair [MAJR] OR DNA Damage [MAJR] OR Heat-Shock Response [MAJR] OR stress [MAJR] OR starvation [TI] OR soxR [TI] OR (oxidation-reduction [MAJR] NOT Electron-Transport [MAJR])) AND (Genes [MH]) AND 1996:1999 [DP]
Transport GO:0006810 1022 84 8 (biological transport [MAJR] OR transport [TI]) AND (Genes [MH]) AND 1985:1999 [DP]
  • This table lists the category name in the first column, the corresponding gene ontology code in the second column, and the PubMed query used to obtain abstracts in the final column. For the training dataset, the articles were obtained by using the query as listed in the table. Within a PubMed query the [MAJR] label specifies MeSH major headings, [MH] specified MeSH headings, [TI] specifies title words, and [DP] species publication data ranges. The test2000 and test2001 datasets were obtained by modification of the publication date limit to restrict articles to those published in 2000 and 2001, respectively. Titles were omitted from the test data sets. The table lists the number of articles obtained for each category for the training and test sets.

This Article

  1. Genome Res. 12: 203-214

Preprint Server