Table 1.

The Training and Testing Corpus

Category GO code Training Test 2000 Test 2001 PubMed query
AutophagyGO:0006914177221(autophagy [TI] OR autophagocytosis [MAJR]) AND (Proteins) [MH] OR Genes [MH]) AND 1940:1999 [DP]
BiogenesisGO:001604310231324(biogenesis [TI] OR ((cell wall [MAJR] OR cell membrane structures [MAJR] OR cytoplasmic structures [MAJR]) AND (organization [TI] OR arrangement [TI]))) AND (Genetics [MH]) AND 1984:1999 [DP]
Cell adhesionGO:000715510251335(cell adjesion [MAJR]) AND (genetics [MH]) AND 1993:1999 [DP]
Cell cycleGO:0007049108530319(cell cycle [MAJR]) AND Genes [MH] AND 1996:1999 [DP]
Cell deathGO:0008219115443428(cell death [MAJR]) AND Genes [MH] AND 1997:1999 [DP]
Cell fusionGO:0006947740200(cell fusion [MAJR] OR (mating [TI] AND Saccharomyces Cerevisiae [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
Cell motilityGO:0006928109426923(cell movement [MAJR]) AND (Genetics [MH]) AND 1995:1999 [DP]
Cell proliferationGO:000828339400(cell proliferation [TI]) AND (Genes [MH]) AND 1940:1999 [DP]
Cell–cell signalingGO:0007267237410(synaptic transmission [MAJR] OR synapses [MAJR] OR gap junctions [MAJR]) AND (Genes [MH]) AND 1940:1999 [DP]
Chemimechanical couplingGO:000694310111476(contractile proteins [MAJR] OR kinesins [MAJR]) AND (Genes [MH]) AND 1993:1999 [DP]
Intracellular protein trafficGO:0006886110732228(endocytosis [MAJR] OR exocytosis [MAJR] OR transport vesicles [MAJR] OR protein transport [MAJR] OR nucleocytoplasmic [TI] AND (Genetics [MH]) AND 1994:1999 [DP]
Invasive growthGO:0007125492524((invasive [TI] AND growth [TI]) OR neoplasm invasiveness [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
Ion homeostasisCO:0006873424645((na [TI] OR k [TI] OR ion [TI] OR calcium [TI] OR sodium [TI] OR hydrogen [TI] OR potassium [TI] OR pH[TI] OR water [TI] AND (concentration [TI] OR senses [TI] OR sensing [TI] OR homeostasis [TI] OR homeostasis [MAJR]) AND (genetics [MH]) AND 1940:1999 [DP]
MeiosisGO:000712610031517((meiosis {MAJR])) AND (Genes [MH] OR Proteins [MH]) AND 1986:1999 [DP]
Membrane fusionGO:0006944317584(membrane fusion [MAJR]) AND (Genetics [MH]) AND 1940:1999 [DP]
MetabolismGO:0008152100522530(metabolism [MAJR]) AND Genes [MH] AND 1989:1999 [DP]
OncogenesisGO:0007048104316815(cell transformation, neoplastic [MAJR] AND Genes [MH] AND 1994:1999 [DP]
Signal transductionGO:0007165116830225(signal transduction [MAJR]) AND Genes [MH] AND 1995:1999 [DP]
SporulationGO:0007151847490(sporulation [TI] AND (genetics [MH]) AND 1940:1999 [DP]
Stress responseGO:0006950106825322(Wounds [MAJR] OR DNA repair [MAJR] OR DNA Damage [MAJR] OR Heat-Shock Response [MAJR] OR stress [MAJR] OR starvation [TI] OR soxR [TI] OR (oxidation-reduction [MAJR] NOT Electron-Transport [MAJR])) AND (Genes [MH]) AND 1996:1999 [DP]
TransportGO:00068101022848(biological transport [MAJR] OR transport [TI]) AND (Genes [MH]) AND 1985:1999 [DP]

[i] This table lists the category name in the first column, the corresponding gene ontology code in the second column, and the PubMed query used to obtain abstracts in the final column. For the training dataset, the articles were obtained by using the query as listed in the table. Within a PubMed query the [MAJR] label specifies MeSH major headings, [MH] specified MeSH headings, [TI] specifies title words, and [DP] species publication data ranges. The test2000 and test2001 datasets were obtained by modification of the publication date limit to restrict articles to those published in 2000 and 2001, respectively. Titles were omitted from the test data sets. The table lists the number of articles obtained for each category for the training and test sets.